Priorities

So we’re deep in the drive toward beta2 right now.  Lots of pressure and an incredibly high quality bar that we’re trying to meet so that people can actually use VS2005 to “go live” with.  What’s that mean?  Well with Beta1 we advised customers that this was a product that they could use to see what we were working on and to try out some of the new technologies that we were rolling out.  But (and it’s a big “but”) we specifically told them that they shouldn’t use VS2005 for any production code or websites.  We know that there were many problems with the first beta and it would irresponsible for us to not tell customers about this.  Otherwise someone might try to roll out a website based on 2k5, have the system not cope with it, and possibly damage their business in the process.  That changes with Beta2.  This is a release that we consider high enough quality that you can create enterprise applications on and even ship those apps if you choose to.  There will be bugs (heck, that’s true of the final release), but hopefully those bugs won’t significantly impair your ability to use the product.

My previous post discussed part of this process and how difficult it was to make changes.  Less change means less chance for regressions or destabilization.  But less change combined with a bar that is increasing on a day by day basis makes progress toward the quality release we want very difficult.   But it’s something we take very seriously and know is the only way to get 1000 people to pull it all together.  Interestingly enough there is something else that can happen that we consider an even higher priority than the work toward Beta2.  Something so important that it gets all your attention and supersedes everything else.  I’m talking, of course, about skiing trips to whistler.  Wait… no I’m not.   That was just on my mind for other reasons (like: I need a break!!! J).  What I’m actually talking about is “servicing”.  A general term we use to discuss product support for our customers for our previously released apps and frameworks.  In my case that’s providing support for customers for VC# 2002 and 2003. 

I brought this up because I thought that many people might not know about these commitments, and if they were aware of them they might not know just how seriously we take them.  From the top on down servicing is, without question, the most thing that we can do.  You can hear it from Soma who says: “Looking ahead, servicing existing customers and shipping a high quality Visual Studio 2005 and .NET Framework continue to be our highest priorities” and you can trust me when I say that that message continues on down all the way to developers who are partly responsible for this task.

So how does servicing work?  It starts with a customer opening a support incident with Product Support Services (PSS).  The PSS guys work with the customer to figure out what the problem is and to do an initial screening to see if it’s possible that a previously released fix might address this issue for them.  For C# there are patches for a couple of nasty bugs that were fixed post 2003 that often end up solving a lot of customer problems.  If there are no known resolutions for the customer issue then the issue gets escalated to a Days To Solution (DTS) issue which will then land on my plate.  A DTS is exactly what its name implies, an issue that needs to be solved in a matter of a couple of days or less.  There’s a little flexibility here because of disparate timezones, as well as the fact that we often need additional information from the customer.  The purpose of the DTS is to see what the simplest resolution that unblocks the customer.  For example, for a compiler bug we might see if it’s possible to rewrite the user’s code slightly so that the same semantic meaning is kept but the compiler bug is bypassed.  If we’re able to come up with a workaround then we supply it to the customer to see if it’s acceptable for them.  If it’s not acceptable, or there are simply no known workarounds, then the issue gets escalated once again and because a Quick Fix Engineering (QFE). 

What’s a QFE?  Well, for us, it’s a specialized build of our product that we will create that specifically unblocks the customer.  Note: that doesn’t mean that the issue is necessarily fixed, merely that it’s “resolved”.  So a crash might be “resolved” by limiting functionality in a certain scenario so that the customer can continue to do work.  However, in general issues will actually be fixed.  The reasoning for this is three-part.  First, the customer is better served with a real fix.  We don’t want to unblock them while simultaneously degrading the rest of VS.  Second, if we create a fix that regresses other components then this QFE will literally be specifically for this customer.  That means that if another customer comes along with this problem then it’s very possible that this QFE wouldn’t suffice since it might regress VS in ways that are unacceptable to them.  Finally, this kind of QFE might be sufficient for this customer, but we won’t be able to integrate it into a Service Pack.

So once it’s escalated to a QFE what happens?  Well, it comes back on my plate and now it’s time to actually come up with a fix for the problem.  Now, since I worked on the DTS I’m already acquainted with the bug and there’s a good chance that I’ve determined what code is at fault.  If I haven’t then it’s up to me to figure out what the problem is at that point.  This is actually quite difficult.  Why?  Well, these QFEs concern the 2003 codebase, and between 2003 and 2005 it’s fair to say that a majority of the code has been rewritten and certainly all of the code has been modified.  I came in after 2003 shipped so it ends up with you trying uncover a bug in a code base that’s completely alien.  And, once you find the location where you can definitely tell something wrong it happening you now have to try and figure out how you fix the bug.  Do you patch it up at that point, or do you try to determine if there is a deeper seated bug that you need to fix up much earlier in the process?  Do you make a massive change which has a huge risk of breaking things in this codebase which you are so unfamiliar with, or do you just do a quick surgical fix that addresses this specific issue with low chance of negatively affecting everything else?  Amazingly enough there are developers out here who work on this stuff as their full time job.   They excel in working with code bases written years and years ago and trying to find the right fixes for the issues that are being seen.

Now, you think that’s hard?  Well, here’s the best part.  Sometimes you have to do all this work without any repro case to work with at all!  Why?  Well, if this issue is hitting the customer’s Uber-Important-Sekrit-Mission-Kritical app then there’s a fair chance that they consider the code to that app extremely sensitive and they might not be comfortable with handing over that code to MS even with protections in place (would you? J).  So what does that mean for us?  Well, we have to try to repro this issue ourselves based solely on the characteristics of the project as well as the symptoms they’re experiencing.  This is enormously difficult.  I’ve worked on QFEs for projects involving 30+ MB of source code, and projects that produced assemblies that were 300+ MB large (yes, you read both of those numbers correctly).  With something that huge the amount of complexity as well as interdependencies grows astronomically.  What might be a tiny memory leak that doesn’t affect 99+% of customers is now completely breaking everything for one of these projects.

Now the QFE process is given more time than the DTS, but it’s still on a clock (generally determined by how badly the customer needs the fix and how severe the problem is).  So you work hard on the fix and then work with QA to make sure it isn’t causing problems and with the customer to make sure it’s solving their problem.  If you’re lucky this only takes one try.  But it can often take several tries as your first fix solves one issue, but then another is discovered, and another, and another.

Eventually the fix is created and signed off on by all relevant parties and you can now go back to working on the regular work you’ve got (which suddenly seems ridiculously easy compared to the work of the last week!).  All very fun, and gratifying too especially when you can unblock a customer and help them to do the work that they want to do.  Software has bugs and right now we don’t know how to change that fact, but with our servicing plan we can help the situation out a whole lot.