The Great UrlEncode Mystery

The last two days have been pretty hectic. First thing Monday, we're hit with some odd behavior on a component that is part of the Windows Vista Enterprise registration process. Of course, anything that has to do with Vista receives pretty high priority :)

Some background on the system: I'm on the RegSys (Registration System) team. It's a component that's responsible for collecting information from users signed in through Passport. We're also responsible for the Profile Center. When a user creates a wizard (a set of questions for a user to answer), they have the option of providing branding information. This provides for a consistent user experience across Microsoft.com sites; for example, if the Microsoft Mouse and Keyboard group wanted to collect information, it would be inconsistent if the page that asked questions didn't have the red-colored branding.

Users also have the option of overriding branding. We could apply a querystring value that mapped to an existing brand, and the question set would render with the appropriate theme. This is important to this particular scenario; we have a wizard mapped out to an existing marketing campaign that has been around for almost a year. With the Vista release, that same campaign is coming into play as part of the Windows Vista registration experience. They want to use the same wizard, but handle theming for both the existing campaign *and* Windows Vista.

The brand override gets passed through the querystring, and the brand was getting encoded twice and decoded once. This seems like a simple issue to resolve, but wait.. there's more! This was only happening in some environments. Most notably, we couldn't reproduce the issue in our test environment. This happened in production, but not in test. We knew we were running the same code in Test as Production, but synchronized just to be sure. After the sync, the issue was still happening. So... from where did this second and unexpected UrlEncode come from?

We diagnosed the HTTP traffic using a tool called Fiddler and determined that the issue was happening on one of our pages. When the handoff from Passport was made after a sign-in, it landed on one of our pages with a specialized querystring value and some code in our global.asax.cs file took over from there. Make note of this. This becomes important later :)

However, the code should be fine, as the same codebase was producing two different results. Since this is a pretty high priority fix, we were presented with one option at the end of the day: UrlDecode the branding parameter a second time. However, this didn't answer any of our questions regarding the behavior of the application. The whole encode/decode process is an extremely brittle one that happens in several places around the system; changing it in one place would most likely break something elsewhere. And while a hack would have fixed the issue, we wouldn't have understood the why behind it. As a developer, this is unacceptable :)

Day two, we continue diagnosing the issue. One of the developers is able to reproduce the issue on his box, but synching with source control fixed it. We weren't able to diagnose why, but found that the sync changed only configuration files. Queue a request to get all our configuration files from Production, and we spend the next half of the day running a diff on the files we have and the files in production. There are no key differences... so we start replacing configuration files from production until the functionality breaks in our test environment.

We eventually find that the file that caused the issue was %WINDIR%\Microsoft.NET\Framework\v2.0.50727\CONFIG\web.config. We run a diff on our two web.config files with incredible scrutiny, but the files are more or less the same... However, replacing our web.config with the one from production caused the feature to break. I start the process of painstakingly going through each difference, no matter how small, until our application breaks. The Assemblies node goes in and out without consequence, various nodes get added and removed...

Then, when the httpModules node is replaced with the version from production, things start breaking. Which is good! After two days of debugging, we're finally able to reproduce the issue in our test environment reliably. But why is the httpModules node breaking things? They contain the same contents... But the order assemblies are added is different. When we shift the order within httpModules around, the feature starts working/breaking.

Theories were abound on why this would be the case... but the particular problem happens with an assembly that another team is responsible for. We escalate to the team responsible for the assembly, and they reply saying that by changing the order of a dependency, we were effectively enabling/disabling their library. Their library had written code to handle the specialized request from Passport (mentioned above). When the request came in, the external library took over, mistakenly encoded an already encoded querystring parameter, and passed it on to us.

When we disabled the secondary library, our code handled this specialized request without encoding the branding parameter as expected.

 

And so solves the great UrlEncode mystery :D This eventually boiled down to pinpointing a problem with an external component, but getting there was an incredibly difficult process. This definitely ranks up there as one of the more troublesome problems to diagnose... What was your debugging session from hell? :)

EDIT: In other news, Gears of War, COD3 and Viva Pinata come out this week. Going to be a costly week! And my ill-gained King Kong points have temporarily put me at the top of the building 6 leaderboard. :)