You’ve seen me write a fair amount recently about the VS 2010 Beta 2 performance problems – well, you’re going to see me keep writing about them 🙂 We’ve had enough time, at this point, to understand the feedback and characterize the problem – I’d like to share it with you in hopes of you understanding it and perhaps learning something that can help you avoid the same issue in the future.
The irony of the whole thing is that in this product cycle, we had a bigger concerted focus across the division on performance than we had ever had before. From the very beginning we focused on defining scenarios, goals, regression tests, etc. We went into this product cycle knowing that performance was going to be a challenge. The adoption of WPF for some of our UI elements and a new editor certainly were among the highlights of features likely to lead to issues. We wanted to head off performance problems from the start.
As I look back, there were many things that lead to the situation that we currently find ourselves in. However, if I were to pick one that was the most impactful, I’d say it was the way we measured and goaled performance.
Prior to this release, our performance efforts were hindered by regressions and unreliability of measurements. We set out this product cycle to fix that problem by having every team create a clear list of scenarios and automated tests to measure them (TFS, for instance, has about 150). We focused on making sure the tests were repeatable and had very small standard deviations so that if a test showed a significantly different time than expected, we had a strong reason to believe there was an actual regression to investigate. We set up automated suites that would run some of the tests on virtually every checkin, some every day and some every week. We call the effort RPS (or Regression Prevention System). It was to be our canary in the coal mine.
Clearly it didn’t work as well as we had hoped. Why? My analysis is this… It order to get a very reliable, very repeatable set of regression tests, we had to spend a lot of time refining the tests (almost the whole product cycle). We iteratively removed randomization and focused the tests so that we got consistent runs every time. The result is a set of very precise tests that test individual features very well. The problem? No one actually uses the product one very isolated feature at a time. The reality is when you load up a real world application, you have 5 editor windows open, 3 forms designers an architectural diagram, you’ve just finished debugging and you now hit “Go to definition” on a symbol – it’s totally different than what we were testing. I learned 2 things from this exercise about how to ensure your performance testing is measuring the “real world”:
- You must use a “real” dataset – We must have realistic solutions with a realistic mix of projects, realistically complex artifacts (like forms or architecture diagrams).
- You must use a “real” usage pattern – We can’t start up the IDE, test one thing and shut it back down. We must measure in the context of an appropriate amount of data/context loaded and we must do it within the context of a normal user session (for example, we might run a debug session and then test how fast the form designer is).
So you might ask, don’t you use the product? Why are you just relying on these “microbenchmarks”. Yes, we do use the product. After we started getting the Beta 2 feedback about performance, we did a survey of internal users (just within DevDiv) on satisfaction with the Beta and ship readiness. 70% of respondents said they were dissatisfied with VS performance (that’s more than twice than the 30% of external respondents who said the same thing). So we had the data that it wasn’t good enough but we weren’t listening to it. Why? I think we made up all kinds of stories. We said it was “long memory” – people’s impressions were tainted by the performance of Beta 1. We said it was taint from their development environment (most devs actually use components that they build themselves – not just ones from the official build lab; after all, we do build the IDE :)). We said it was lag – all the great performance improvements we were working on just hadn’t made it into all of the branches yet, etc. etc.
Clearly a learning to be had there too.
There are many other issues too, including:
1) Performance hardware – We didn’t do a great job ensuring that we were validating on an appropriately wide range of hardware. For example, we weren’t looking at netbooks at all and if you look at my intellisense videos from yesterday, you can see that Beta 2 netbook performance was abysmal. There were other hardware related issues – for example, we found that WPF hardware accelerated performance varies dramatically depending on the quality of the video card and is sometimes slower than software rendering.
2) Carving out room for new technologies – Anytime you add something it’s likely to take more resources. You have to figure out how to make room for it. We did not do a good job of this in this release. In Windows 7, the Windows team had some really good practices around this. For instance, there was a rule that if you are going to add anything to boot (CPU, Disk, Memory, etc) you must find enough savings elsewhere to pay for it before you are allowed to check in. There was a strict “no-growth” rule in key scenarios. Definitely a practice that we’ll be looking at.
3) Coordination across the division – In all aspects of our work we struggle with the tension between letting individual product units run their own business and coordinating efforts centrally across the division. It’s clear to me that in this release coordination was not good enough but it’s very tough balance. We’ll be spending some real time thinking about this in the next release.
4) I’m sure there are more we’ll learn as we continue to reflect on this product cycle. In the meantime, we will continue working to ensure a top quality product when we ship.
So what do we know about the performance issues we are having? We’ve categorized them as follows:
Virtual Memory Exhaustion – Large solutions and lots of feature usage are causing virtual memory (you’ve got 2GB on a 32-bit OS) to fill up and for VS to become unstable and crash. It’s not exactly a performance problem because it doesn’t generally cause VS to slow down, however we treat it as one because it’s a resource exhaustion problem and many of the things you do to address it also improve actual memory usage and therefore improve performance too.
Leaks – Allocation and retention of memory that is never used again and accumulates over time. It ultimately leads to Virtual Memory Exhaustion but can also cause working set and other issues due to heap fragmentation that keeps the unwanted memory in the workingset. I also like to separate this out because the way I think about it is different. With Virtual Memory, you are balancing trade-offs and it’s an optimization problem. With Leaks, it’s a zero tolerance policy. No leaks are acceptable in any circumstance. The way you test for them is different, etc. Beta 2 had a LOT of leaks.
Performance – Specific scenarios where the application doesn’t perform according to user expectations. This can be due to any bottlenecked resource (CPU, memory bandwidth, network bandwidth, disk I/O, etc). The two most common underlying causes are innefficient algorithms (too much CPU usage) and too large a working set (more memory used than the physical RAM can accomodate, resulting in thrashing pages in and out to the disk). Based on the feedback, we’ve identified a number of areas of common performance complaints, including Editing/Intellisense, WPF Designer, Debugger, and Project Load.
Yesterday, I blogged about performance gains in editing/intellisense (including before and after videos). Over the next few weeks, I hope to do a post every couple of days on our efforts and the results of our performance work. Stay tuned and hopefully it will be both entertaining and informative. Since Beta 2, we have had some tremendous improvements and I’m confident we’re going to ship with good performance and high customer satisfaction, but it has certainly been a bit of a call to action for us.