Everybody who has worked at Microsoft for long enough has their war stories. I'll share one of my first, from Windows Mobile 6 development.
It Sounded Simple Enough
My first full feature at Microsoft was the storage card encryption feature in WM6. My team (Device Management & Security) collaborated with the Windows CE filesystem team to deliver it. The filesys team would deliver the encryption filter itself, and my team would provide the control panel UI, the framework to allow Exchange to manage the encryption settings, and do some end-to-end performance testing.
Windows CE has always shipped an encryption filter as a sample code, so we planned to take the sample, do a little bit of work to make it production ready, and be done. Easy, right? To reduce development time we chose to not implement cache coherency in the filter - every file handle got their own instance of the filter with its own internal buffer. Cache coherency is a complex problem in filesystem design, so we were able to shave a lot of time of the schedule with this choice.
Once the filter is delivered, I ran various user scenarios on encrypted storage cards to make sure that the user experience hadn't degraded too much. We expected some performance hit from the encryption (it's not free), but I needed to ensure that nothing degraded so badly that users would turn off encryption. We contacted several of the feature teams, and they went off and did their own testing on encrypted storage cards for the same reason.
It turned out that one particular app had a ridiculous performance degradation. Opening a file that previously took a few seconds now took several minutes. Obviously, this was unacceptable. I went to investigate.
The encryption filter decrypts each block of the file as you read it. So if you read a block, the filter needs to read the first block of the file to get the decryption keys. Then it reads the block that you requested, decrypts it, and returns the part that you asked for. So we always read one more block than you requested. (the key block) I thought this might be part of the performance problem, but I needed to do some measurements to be sure.
I took the FSDSPY sample filter driver and modified it to log a little bit more information. With this I was able to track every read and write to the disk. I ran some sample files through the app and then wrote a little Powershell script to parse the log output. The results were eye-opening:
Unique physical pages read (one read operation): 103
Minimum pages read (page-size reads): 202
Your pages read (optimistic calculation): 98792
Total Bytes Read Total Reads Average Read Size Largest Read
---------------- ----------- ----------------- ------------
411945 49396 8 6428
The application was reading about 100 pages off of the disk. If they read it one page at a time, which is pretty common, they'd be reading about 200 pages. But they were reading almost 100,000 pages. 25,000 of their 50,000 reads were reads for only 4 bytes of data. Because the filter reads a minimum of two pages per read (as described above), this app was reading 8192 bytes from the disk and doing a decryption operation for every 4 bytes of data that it wanted to consume. So that explained the thousand-fold performance degradation. Diagnosing the problem was pretty easy. Solving it was harder.
We blamed the application team. How could they think that reading in their data 4 bytes at a time was acceptable? It's ridiculous. They should fix their read routines.
The application team blamed us. We had changed the rules of the world. Their file load operation used to be fast, and now it was slow. It was fast because the disk driver itself has a little cache. The encryption filter doesn't have a cache (because we would have had to make it coherent), so by inserting the filter we changed the behavior of the world. Their file code was several years old and in maintenance mode.
We didn't have schedule time to go back and rewrite the filter to add this functionality. It would have essentially caused a rewrite of the filter, because the cache logic is almost as complex as the encryption filter itself. The app team didn't have time reserved to rewrite all of their file I\O routines, and they certainly didn't want to cut any other work to do so because it doesn't buy their users any perceived value. We couldn't ship it as is, because using the files for this app is absolutely the kind of thing that we'd expect our customers to put on an encrypted storage card. It would cripple the value of our feature if you couldn't use it with this application.
A Happy Ending
Luckily, we didn't have to take either of the above choices. The filesystem team was working on a global cache filter for Windows CE6. This cache filter sits at the top of the system (above our filter) so it would buffer all of the app's filesystem accesses. The filesystem team hadn't yet finished this feature for CE6, so it was a bit of a risk to depend on it, but that was the path we chose. Once they got it finished and tested for CE6, they ported it back to Windows Mobile 6. With the cache filter installed, the performance tests went back to green - the performance hit from encryption with the cache filter in place was neglible. So in the end, everyone got what they wanted and we were all able to ship our features.
Incidentally, this incident inspired a blog entry from my colleagues on the filesystem team.