My name is Bogdan Mihalcea and I’m a developer in VC++ Development Team. For the last three years in the team I worked in almost all the areas of IDE (Intellisense, Project System, Resource Editor, Debugger). Recently we were engaged in fixing various bugs in our product for VS9 SP1. In multiple instances we were facing similar bugs that can be easily being described as “a crash in a corner case scenario, with a fix in a common code path which might impact everybody”.
Typically corner case bugs surface when there are unforeseen circumstances early in the development process, or major design changes applied to an old code base. Like any other mature code base, VC is no different and the same rules apply; occasional design changes due to new requirements or breaking changes in our dependencies, new developers joining the team, all together with constant changes due to market requirements and evolution can be a significant causing factor for these bugs.
So how do we approach this kind of bug when it surfaces? It is not easy and usually there is no clear path to follow. We use the experiences that we gained along the way while trying to minimize risk at the same time. There are always risks: to break the backwards compatibility, to degrade the performance, and yes to introduce new bugs (no developer is perfect, if that will be the case we will not have bugs to begin with). Any of these risks are important to us as they might impact millions of customers.
For a service packs usually we try to follow a very strict plan as we try to avoid all the above risks. In a recent bug that was reported to us a customer experienced a project system crash when building a project that uses short paths (i.e. DOS 8.3) in Additional Directories. Analyzing this bug we concluded the following:
- Although a corner case it is a real crash in the product
- The customer had a context that required a fix
- The fix was in a common code path, with potential to affect every customer that uses the VC project system
Not going too much into the implementation details, but enough for this exercise, in order to increase the performance we used a file hash map, to prevent unnecessary File Manager Object recreation. The file path was used as key. There were 3 operations involving the map that concerned us: Create, Lookup, and Remove. The Create and Lookup were operations revolving only around the map; Remove was an operation belonging to the destructor of the File Manager Object.
And here was the bug, in the customer case the object was added into the map using the path “as is”, in this case was short, but when destroyed we used the path stored in the object which was processed and stored as long path. The remove operation failed and we were left with an object in the map that was already destroyed. There is no need for a lot of imagination to see what happened, next, when we queried the map for this file path and we used the object we got in return.
The first iteration of the fix was to normalize the access to the map by using always lowered long path names. This was a clean and safe approach but it introduced degradation in our project performance load tests by 2ms per item, which might seem small but in reality when your projects have thousands of items it adds up easily to seconds.
The reason why it introduced this extra cost was the Windows’ API we used to get the long path, which hits the disk in all cases. So even if we have a short path and we convert it to long, or we try to convert an already long path to a long path we pay the same cost, we hit the disk. Most customers have their projects using always long paths, so with this fix we were addressing a small number of the customers’ cases, and we introduced degradation of the performance on all build operations. In order to validate this fix we had to run all our tests for build system including the performance tests. What we gained after the first iteration was only the fact we had clear borders of problem and we confirmed that the fix was good and targeted in the right place.
The second iteration strategy was to optimize the first fix and try to minimize the number of times we use the Windows API. So we made sure that the key is always in sync with the path stored in the File Manager object, this prevented the object from failing to remove itself from the map when destroyed. We deliberately didn’t change the look-up and the destroy functions to not introduce unnecessary API calls. Everything looked nice but far from being ready. During the routinely code review we had for the fix, we noticed that we still don’t prevent the map from having a short and a long path of the same file which in complicated scenarios will end up confusing the dependency graph.
For the third iteration we leveraged the fact that before the creation, we were performing a look-up in the table to see if we have the required object already in the map, if we did, we returned the existing object. To address the second iteration’s problem, we introduced a second look-up in the table. This time we used the path stored in the File Manager Obj as it has potential to be different, if the target file exists on disk (the API has no impact on paths that do not exist), than the original one we were asked to create the object. We ran our tests and this time we didn’t notice any performance drop.
Even if all 3 iterations fixes were functionally correct, we chose to go with the fix from Third Iteration; the main reasons were, based on the risks described above:
- Smaller more surgical fix, with less chances to break other scenarios
- No performance hit for the common code path (long paths always)
- Addresses the dependency graph issue for existing files
The process is far from being perfect and we do our best to improve it every day; I used this past experience only to give you a glimpse of the complexity and the impact of apparently small bugs with obvious solutions on the product cycle and to the end customer (that includes us as well because of course we use the same version of VS to develop it, which is “fun” when it is during development, missing features, slow and most of the time built “Debug”, but I will save this for another blog).