Since Windows XP was released, teams at Microsoft have been hard at work collecting, aggregating and analyzing crash information. All those disparate technologies bore different names and "brands". To try and make some sense (or further confuse everyone) I broke down the technologies we support by functionality and code names. The below figure illustrates the collection and use of error data.
In the Windows PC portion, applications can let Windows handle crashes or use the WER APIs. In most cases it is recommended for developer to let Windows handle crashes and then monitor quality and reliability through the WER Services portal and set of web services.
The Collect quadrant represents the technologies used to communicate with the Windows client. The technologies included in this area are Watson and Online Crash Analysis (OCA). Watson is used to collect application crashes (user mode) and application defined events while OCA is designed to collect memory dumps resulting from system crashes ("blue screen of death"). Both technologies collect crash information from the Windows client and allow for developers to collect "secondary level data". Secondary level data can be full memory dumps, log files, registry settings or WMI queries. All data collected is covered by the Microsoft data collection policy. Software and driver developers can request secondary level data collection through the WinQual site. Subjective user feedback may be collected through the Responses mechanism described later.
The Aggregate quadrant is where incoming data is "sliced and diced" based on fault signature dimensions. Subsets of the data stored in the Watson back-end warehouse are exposed to Windows developers through many facets. Some examples is the data available in the Watson data warehouse are failures pivoted by executable and version, by OS version and language as well as many other options (more on those in future posts...).
The Prioritize phase is all about making the gobs and gobs of data we collect into useful information. A great example of how some teams at Microsoft use this approach is an internal tool called AutoBug. AutoBug takes symbolic data (results of debugger analysis of memory dumps) from application crashes and uses the results to open bugs in our bug database. Those bugs are then assigned to the developers who own the faulting code. This type of integration between the development process and WER data allows development organizations to address issues earlier and more efficiently.
When issues are fixed or are well understood, it is time to Respond to error reports. The WER infrastructure allows software developers to interact with users after a crash, point them to fixes, explanations, support and even ask them for more details on the circumstances of the crash. Developers are able to manager responses through WinQual as well as view the quality of their responses as measured through customer satisfaction surveys. More information about Responses, in future posts...
Lead Program Manager, WER Services