Software archaeology

There are times that I think my job is the same as an archeologist. 

Rick touched on this a bit on his “Anatomy of a Software Bug” post (an excellent read, btw, if you haven’t already seen it).

Code, like people, gets old.  And, just like old people, old code tends to be pretty brittle.

This is especially true after the original developer of the code has moved on to greener pastures and the remaining members of the team are left to maintain the code.  Since they don’t always understand the details of the implementation, they’re not willing modify the code, for fear of breaking the code.  But they’re still tasked with adding new features to the code, fixing bugs in the code, etc.

As a result, they tend to make their changes as small surgical changes that affect the component as little as possible.  And the more these surgical changes add up, the more fragile the code gets.

As these changes build up, the code in the component starts to resemble a piece of conglomerate rock.  The more people add onto the base, the more blobs of other rocks get added to the conglomerate.

When I work on the code like this, it sometimes feels like I’m an archaeologist digging through the layers of an ancient city – if you pay really close attention, you can see which developers wrote which portions of the code.

One of the clues I have to work with is the tab styles – different developers use different tabbing conventions, which is a clear clue.  Similarly, different developers use different bracing styles or different naming conventions on variables.

I’ve been working in winmm.dll for the past several months, and it is a classic example of this.  The base code is about fifteen years old, it was written for Windows 3.1.  Over the years, it’s been ported to Win95, then ported to Windows NT, then ported to Win64.  The code’s been tweaked, support for new technologies have been added (WDM, PnP), functionality’s been replaced (the current joystick logic is totally different from the original logic).  So there’s been a lot of stuff added to the original core.  But it’s all been bolted on piecemeal.

In addition, since the code was written, Microsoft as a whole has gotten a whole lot getter at software engineering – the coding standards and conventions that are standard today are orders of magnitude of what was done 15 years ago, so sometimes it’s painful to see the old code sometimes – as I’ve written before, back in the late 1980’s, code size was king, memory was expensive, so design decisions were made to favor techniques that reduced code size at the cost of code clarity.

By the way, this isn’t a dig on WINMM, or Microsoft’s coding – ANY sufficiently old piece of code will have these characteristics.  I’m sure that you could perform the exact same archaeological analysis of any old piece of code and find the exact same kind of issues.


Comments (16)

  1. Anonymous says:

    I appreciate that this laying down of different strata of features and fixes over several generations occurs in all environments, but how much longer do you think this code can meaningfully be included in Windows?

    We’ve seen the issues such as Blaster which effect a decades worth or products even after the extensive code reviews carried out on Server 2003. Server 2003 only escaped Sasser due to other architectural modifications. If your own engineers cannot detect these subtle, yet compromising bugs through code reviews, perhaps it is time to think about rewriting some of these older componants.

    Even with all Microsofts resources, you seem too busy building the next generations subsystems and APIs to spare the manpower to refactor the legacy stuff, which you will no doubt include in the finished product to support you incredible binary backwards compatibility targets. It seems to be a policy of "don’t touch what works" even if there no longer someone around who can understands why or even how it works. The security training of all your developers won’t increase the security of the platform if all the old bugs are all still sitting there waiting to be exploited.

    This is perhaps an area where the open source arena has an advantage. They are less pressed for time and developer resources and someone can take it upon themselves to do some refactoring just for asthetic reasons or to scratch an itch.

    I don’t no for sure how much this happens in reality, though it was interesting to see that some of the supposed source stolen from SCO had in fact already been removed from the kernel and rewritten due to being ugly and inefficient.

    Perhaps there should be another development team whos task is to comb the legacy subsystems and bring them up to a similar standard to all the newer stuff. With the privelege of running on 90% of the worlds desktop computers ought to come the responsibility to maintain them to the best of your ability.

  2. Anonymous says:

    Actually refactoring IS a constant effort, there is a significant amount of it happening in the Longhorn timeframe. I can’t talk about the refactoring that’s going on, but it IS happening.

    One issue that Microsoft has to deal with that open source vendors do not have to deal with is the issue of compatibility.

    We can’t break ANYTHING when we release a new version of the operating system (ok, we can break things that exploit security holes). If there’s an application out there that that depends on a behavior of the current system, we MUST make the current system work with the app. Even if the app is "broken".

    Open source vendors don’t have that level of compatibility. If refactoring breaks something, then it’s the responsibility of the guy whose component was broken to fix it, not the responsibility of the group that did the refactoring.

    There’s actually been a bunch written about this recently, I’m swamped at work right now, so I can’t dig up the links (there was one about the maintainability of linux systems last week, for example), but the gist of the articles is that the constant refactor/break cycle in open source development is considered to be one of the things that hurts open source.

    Customers like to be able to rely on the fact that their software works from one release of Windows to the next. And if their software stops working, they don’t buy new copies of windows.

    Raymond Chen’s written extensively on this over the past several months, check out his ‘blog (and read the commentary threads).

  3. Anonymous says:

    I’ve been reading Raymonds blog for several months now, its often fascinating.

    I do think its a tremendous achievement to have been able to maintain binary compitibility across 5 development cycles and two entirely different kernels.

    That in house VB line of business applications created by novice devlopers largely continue to function over such transitions is probably one of the reasons that Windows remains so firmly entrenched in corporate environments. I would probably consider VB the killer app of Windows.

    But, as my Panda antivirus is so keen on drawing to my attention, there are some 76,000 viruses that also depend on the various Win32, Office and Scripting environments, as well a a healthy ecosystem of spyware and trojans wouldn’t it be nice to be able to leave them all behind, or at least contain them to an optional compatibility subsystem.

    Apple seems to have been able to make such a transition without too much disruption, though Apple has a much smaller and more fiercly loyal install base.

    Its good to hear you have people busy refactoring. Hopefully some of the changes will be backported to the earlier releases.

    The rapid redevelopments in open source are somewhat more accecptable due to most application code also being destributed in source form, a distribution can include the most common programs and make sure everything is recompiled to the latest kernel and libc etc, a very different environment than exists on the Windows platform.

    Perhaps with the wider adoption of auto-updating distributions through ClickOnce etc you can start to be more adventurous in improving the platform for its own sake rather than maintaining compatibility.

  4. Anonymous says:

    Actually Apple’s transition to OS X was quite painful from what I’ve heard. Essentially they (correctly, imho) decided to ditch application compatibility in favor of moving to a stable platform (and anyone who claims that MacOS versions before OSX were stable hasn’t actually used the platform).

    In many ways, what Apple did was very similar to the changes that Microsoft went through when moving from Win16 to Win9x, and then to NT.

    The big difference is that my Win16 apps still run on NT, while many OS 8 apps don’t run on OS X.

  5. Anonymous says:

    I’d tend to think that Apple made a transition similar in concept of jumping from Windows98 to Longhorn. A change of both the kernel architecture to and the primary APIs and subsystems at the same time. Admittedly it took up until version 10.2 or so for things to fully shake themselves out but the result seems to be very successful and clean.

    Windows was left in the lymbo situation of the applications written to the old APIs still functioning on the newer kernel but many of them fundamentally ignoring the new features and security model. Which leaves us in the rather messy situation where many of the capabilities of NT are completely ignored and a huge number of people are running their systems with full priveleges leaving them rather vulnerable.

    I’ve read the MSDN article "Focus on Least Privilege" and it looks like progress is being made, eg the copy-on-write system for the file system and registry, but it will be a long time coming and many people are already having negative eXPeriances that may put them off looking to Windows in the future.

  6. Anonymous says:

    Could be. But the jump to Longhorn won’t be as far as it appears for 99.9% of the apps out there.

    The changes that are coming up for Longhorn aren’t going to involve externally visible changes to the existing APIs, instead there’s just a lot of stuff going to be added for new apps to use.

    There will be a LOT of eye candy and usability improvements, but those won’t actually affect existing applications (with the possible exception of shell extensions).

  7. Anonymous says:

    Since you’re working on winmm, maybe you know the answer to this question: is there a way to playback a midi file stored in resource using mci command or some other high level API ? Something like PlaySound and SND_RESOURCE.

  8. Anonymous says:

    B.Y. I asked the MIDI developer (it’s great working in multimedia).

    His answer:

    "Not via MCI. The closest you could get is to use the midiStreamXxx but those functions don’t use standard MIDI files, so you would have to convert a MIDI file into MIDIEVENT structs and store them in a binary blog in your resource file.

    I’m pretty sure there is a quick and easy way to play MIDI files using DShow, but I’m not familiar enough with that API to say how difficult it is."

  9. Anonymous says:

    Thanks for asking. I was trying to avoid parsing out MIDI events. I did read DMusic doc and it seemed very complicated. I did what seemed the easiest way to me: save the MIDI resource to a temp file and play it via mci.

    Oh, don’t get me started on DShow, I hate that API.

  10. Anonymous says:

    Hey! WinMM helpdesk! 🙂 I had a similar problem recently, too. I wanted to get frames from an AVI file that is stored in memory. But I could not get AVIFile to use that, the only interface it offers requires a filename. So that I had to preprocess the file (split it to frames) and use only the lower layers of the library. And it works fine, so I am satisfied with it. But if there’s a simpler way, it would be fine to hear about that (OK, I know, it’s not exactly winmm).

    "store them in a binary blog" — well that would be a proper geek blog! Blog in binary! :o)

  11. Anonymous says:

    Code refactoring. *shiver*

    I’m refactoring an application right now and I’m taking the leap. I’m wholesale replacing various pieces of the program in order to make it more functional and streamlines. In some places I’ve cut the number of SQL queries down from 16 to 20 to 1. Memory caching of common result sets is helping tons as well.

    I’m going to definitely have to pay the piper however. The changes I’m making are broad strokes against an application that I’m in charge of, but support our business. The testing and release of this version will have to be done with extra care and time. And I’ll probably miss one of the situations the original code was modified for; those strange things that only happen once in a blue moon.

    Luckily I’m not Microsoft and I can afford to take these chances. I’d love to see a OS, API or Framework written from the ground up by Microsoft that was NOT binary compatible with past releases. I think it would rock.

  12. Anonymous says:

    You say "back in the late 1980’s, code size was king, memory was expensive, so design decisions were made to favor techniques that reduced code size at the cost of code clarity."

    But it would seem that in striving for code clarity, size and and speed have been sacrificed. How about keeping size and speed as priorities and focus on better commenting!

  13. Anonymous says:

    scooter: It’s not always clear that size and speed have to be sacrificed for clarity.

    The NT filesystem’s source code is some of the best written code I’ve ever seen. They are written entirely in C, but are very well structured, and utterly readable. Gary Kimura (dev lead for the filesystems team) did an utterly remarkable job on them.

    And they’re darned fast too.

    I just wish people realized this back then.

  14. Anonymous says:

    Re: slamming OS X:

    Funny that the Objective-C NS* classes written by NeXT back in the 80’s still live on, at least as interfaces… and still seem quite modern. During that time, MS has worked up and abandoned how many application frameworks?

  15. Anonymous says:

    Binary compatibility breakage is a very serious problem for free software, and especially Linux. What’s shameful is that the worst culprits are the vital things like the C runtime library (glibc/stdc++).

    Just this week I "upgraded" glibc on my Linux system and about half of my core applications broke.

    relocation error: /usr/local/valgrind-2.0.0/lib/valgrind/ symbol __libc_sigaction, version GLIBC_2.2 not defined in file with link time reference

    bah. I am considering switching from Linux to FreeBSD; they seem to be doing a little better job of this.

Skip to main content