Larry goes to Layer Court

Two weeks ago, my boss, another developer in my group, and I had the opportunity to attend "Layer Court".

Layer Court is the end product of a really cool part of the quality gate process we've introduced for Windows Vista.  This is a purely internal process, but the potential end-user benefits are quite cool.

As systems get older, and as features get added, systems grow more complex.  The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.

This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering.  The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.

It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the  product over time.

One of the major initiatives that was ramped up with the Longhorn Windows Vista reset was the architectural layering initiative.  The project had existed for quite some time, but with the reset, the layering team got serious.

What they've done is really quite remarkable.  They wrote tools that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components.

These can be as simple as DLL dependencies (program A references DLLs B and C, DLL B references DLL D, DLL D in turn references DLL C), they can be as complicated as RPC dependencies (DLL A has a dependency on process B because DLL A contacts an RPC server that is hosted in process B).

The architectural layering team then went out and assigned a number to every single part of the system starting at ntoskrnl.exe (which is the bottom, at layer 0).

Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1 , the pieces that depend on those (for example, user32.dll) got layer 2, and so forth (btw, I'm making these numbers up - the actual layering is somewhat more complicated, but this is enough to show what's going on).

As long as the layering is simple, this is pretty straightforward.  But then the spaghetti problem starts to show up.  Raymond may get mad, but I'm going to pick on the shell team as an example of how a layering violation can appear.  Consider a DLL like SHELL32.DLL.  SHELL32 contains a host of really useful low level functions that are used by lots of applications (like PathIsExe, for example).  These functions do nothing but string manipulation of their input functions, so they have virtually no lower level dependencies.   But other functions in SHELL32 (like DefScreenSaverProc or DragAcceptFiles) manipulate windows and interact with large number of lower components.  As a result of these high level functions, SHELL32 sits relatively high in the architectural layering map (since some of its functions require high level functionality).

So if relatively low level component (say the Windows Audio service) calls into SHELL32, that's what is called a layering violation - the low level component has taken an architectural dependency on a high level component, even if it's only using the low level functions (like PathIsExe). 

They also looked for engineering dependencies - when low level component A gets code that's delivered from high level component B - the DLLs and other interfaces might be just fine, but if a low level component A gets code from a higher level component, it still has a dependency on that higher level component - it's a build-time dependency instead of a runtime dependency, but it's STILL a dependency.

Now there are times when low level components have to call into higher level components - it happens all the time (windows media player calls into skins which in turn depend on functionality hosted within windows media player).  Part of the layering work was to ensure that when this type of violation occurred that it fit into one of a series of recognized "plug-in" patterns - the layering team defined what were "recognized" plug-in design patterns and factored this into their analysis.

The architectural layering team went through the entire Windows product and identified every single instance of a layering violation.  They then went to each of the teams, in turn and asked them to resolve their dependencies (either by changing their code (good) or by explaining why their code matches the plugin pattern (also good), or by explaining the process by which their component will change to remove the dependency (not good, because it means that the dependency is still present)).   For this release, they weren't able to deal with all the existing problems, but at least they are preventing new ones from being introduced.  And, since there's a roadmap for the future, we can rely on the fact that things will get better in the future.

This was an extraordinarily painful process for most of the teams involved, but it was totally worth the effort.  We now have a clear map of which Windows components call into which other Windows components.  So if a low level component changes, we can now clearly identify which higher level components might be effected by that change.  We finally have the ability to understand how changes ripple throughout the system, and more importantly, we now have mechanisms in place to ensure that no lower level components ever take new dependencies on higher level components (which is how spaghetti software gets introduced).

In order to ensure that we never introduce a layering violation that isn't understood, the architectural layering team has defined a "quality gate" that ensures that no new layering violations are introduced into the system (there are a finite set of known layering violations that are allowed for a number of reasons).  Chris Jones mentioned "quality gates" in his Channel9 video, essentially they are a series of hurdles that are placed in front of a development team - the team is not allowed to check code into the main Windows branches unless they have met all the quality gates.  So by adding the architectural layering quality gate, the architectural layering team is drawing a line in the sand to ensure that no new layering violations ever get added to the system.

So what's this "layer court" thingy I talked about in the title?  Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on.  And that's where we were two weeks ago - one of the features I added for Beta2 restored some functionality that was removed in Beta1, but restoring the functionality was flagged as a layering violation.  We tried, but were unable to resolve it via email, so we had to go to explain what we were doing and to discuss how we were going to resolve the dependency.

The "good" news (from our point of view) is that we were able to successfully resolve the issue - while we are still in violation, we have a roadmap to ensure that our layering violation will be fixed in the next release of Windows.  And we will be fixing it :)