Critical Driver or Cargo Cult Programming?


I’ve been self hosting Vista on my laptop since sometime in January.  Every Monday morning, without fail, I installed the latest build available from the “main” windows branch, and tried it.

There have been good builds and bad builds – the first few were pretty painful, everything since sometime in March has been wonderfully smooth.

But sometime late in May, things changed for the worse.  Weekly builds installed just fine on my main development machine, but my laptop would get about 3/4ths of the way through the install and stop after a reboot complaining about a problem with the critical system driver <driver>.sys.

Of course, I filed a bug on the problem and moved on – every week I’d update my laptop and it’d fail.  While I was away on vacation, the guys looking into the bug finally figured out what was happening. 

The first part of the problem was easy – something was causing <driver>.sys to fail to load (we don’t know what).  But that didn’t explain  the unbootable system.

Well, the <driver>.sys driver is the modem driver for my laptop.  Eventually one of the setup devs figured the root cause.  For some totally unknown reason, their inf has the following lines:

[DDInstall.Services]
AddService=<driver>_Service_Inst

[<driver>_Service_Inst]
StartType=0

If you go to msdn and look up DDInstall.Services, you get this page.

If you follow the documentation a bit you find the documentation for the service install section which describes the StartType key – it’s the same as the start type for Windows services.

In particular, you find:

StartType=start-code
Specifies when to start the driver as one of the following numerical values, expressed either in decimal or, as shown here, in hexadecimal notation.
0x0 (SERVICE_BOOT_START)
Indicates a driver started by the operating system loader. This value must be used for drivers of devices required for loading the operating system.

0x1 (SERVICE_SYSTEM_START)
Indicates a driver started during operating system initialization.

This value should be used by PnP drivers that do device detection during initialization but are not required to load the system.

For example, a PnP driver that also can detect a legacy device should specify this value in its INF so that its DriverEntry routine will be called to find the legacy device, even if that device cannot be enumerated by the PnP manager.

0x2 (SERVICE_AUTO_START)
Indicates a driver started by the service control manager during system startup.

This value should never be used in the INF files for WDM or PnP device drivers.

0x3 (SERVICE_DEMAND_START)
Indicates a driver started on demand, either by the PnP manager when the corresponding device is enumerated or possibly by the service control manager in response to an explicit user demand for a non-PnP device.

This value should be used in the INF files for all WDM drivers of devices that are not required to load the system and for all PnP device drivers that are neither required to load the system nor engaged in device detection.

0x4 (SERVICE_DISABLED)
Indicates a driver that cannot be started.

This value can be used to temporarily disable the driver services for a device, but a device/driver cannot be installed if this value is specified in the service-install section of its INF file.

So in this case, the authors of the modem driver decided that their driver was a boot time critical driver – which, as the documentation clearly states is only intended for drivers required to load the operating system.

So I’ll leave it up to you to decide – is this an example of cargo cult programming, or did the authors of this modem driver REALLY think that the driver is a critical system driver?

What makes things worse is that this is a 3rd party driver – we believe that their INF is in error, but we can’t fix it because it’s owned by the 3rd party.  Our only choice is to baddriver it and prevent Vista from loading that particular driver.  The modem chip in question hasn’t been made for many, many years, the vendor for that chip has absolutely no interest in supporting it on Vista, so we can’t get it fixed (the laptop is old enough that it’s out of OEM support, so there’s no joy from that corner either – nobody wants to support this hardware anymore).

Please note: This is NOT an invitation for a “If only the drivers were open source, then you could just fix it” discussion in the comments thread.  The vendor for the modem driver owns the rights to their driver, they get to choose whether or not they want to support it, not Microsoft.

 

Comments (37)

  1. Anonymous says:

    Not really related in any way, but I quickly tried 5381 before 5384 (beta 2) and in my laptop the 5381 for reasons unknown felt faster and didn’t seem to have some of the nasty issues present in B2 which I am running now. I’ve disabled Indexing and UAP but still 5381 felt faster. But since I have no data to back this up I just might be hallucinating.

    Looking at the performance events is also tough since the if one place causes disk trashing then it’ll affect dozen other things and the log shows as if everyone of those services or drivers were stalling.

  2. Anonymous says:

    This sounds like a job for Drew in the past!

    It’s not so hard to make that fix and re-sign everything with your test cert. As long as the test root certificate is installed on your box everything will work.

    For that matter the driver wasn’t loading because it doesn’t have a signature.

    This might help:

    http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/64bitDriverSigning.doc

    Anyone outside Microsoft can go buy a certificate as explained in the doc and make your scenario work.

    I guess with all the time spent externally evangelizing the driver changes in Vista there wasn’t enough attention paid to internal developers. 🙁

    My job here is done. Back to being Drew in the present . . .

  3. Anonymous says:

    Why not support driver compatability workarounds like you do for applications?  Just have a compatability setting that essentially states "for driver X, override setting Y in the INF."

    So, in the case of this modem driver, disallow the use of "StartType=0".

    I don’t think you will violate any copyright laws by chaning the way you interpret a data file. Otherwise every new version of a compiler could technically be illegal.

    Oh, an please tell me this sort of thing would not happen today, and would have been detected by the WHQL process.  I’d like to think that certification does serve a purpose.

  4. Unfortunately this is a 32bit platform and the problem isn’t that the driver isn’t signed.  I wish it was, that would make it "easy".

    We don’t know what’s wrong with the driver, and making it a critical boot driver makes it essentially undebuggable (the kernel debugger doesn’t work on critical boot driver errors).

  5. baljemmett says:

    Presumably there’s something different between the environment Vista is presenting to the driver and that presented by the version of Windows it was originally released for?  Or have the StartType codes changed at some point since it was written?

    Either way, I guess it’s one of those issues where a 3rd party does something bone-headed and whatever you do, MS get to take the blame 😐  

  6. Anonymous says:

    > If you go to msdn and look up DDInstall.Services, you get

    > this page.

    You linked to

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/devinst_r/hh/DevInst_r/inf-format_d402e9dc-1a6f-423c-b80e-43dd5779b4cc.xml.asp

    You need to link to

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/DevInst_r/hh/DevInst_r/inf-format_10bcb43e-0799-4dff-981f-2d8c4bf8f835.xml.asp

    And this was a lucky day because "sync toc" worked.

    If you agree with customers that MSDN’s table of contents could be improved, perhaps you could suggest that to the maker?

    Meanwhile I’ll bet you could fix the inf file yourself.  If the driver works when loading on demand, then you’ll still be able to use it.  Even cargo cult inf file editing persons have been known to accomplish such feats.

  7. Anonymous says:

    Running on a chk build may give you clues about why the driver isn’t loading.

    I have no idea why it would be boot start.

    I thought all boot start drivers needed signatures regardless of architecture. Maybe the design has changed (again), though.

    And I realized that I was spreading a piece of bad info in my last comment. I forgot that CI doesn’t actually use the cert stores. And they may be planning to remove CI’s approving of test-signed code by RTM. I have no idea where the final decision on that fell.

  8. Anonymous says:

    Or then again maybe you should ignore what I said about a chk build. I just re-read what you said about the debugger. The kd won’t do anything? Crazy. I’d hope that there’s a log somewhere . . .

  9. JeffCurless says:

    Why can’t the kernel debugger work?  I haven’t debugged critical boot time drivers, as I work on drivers that load a lot later.  Of course… you could have always changed the .inf to have the driver start later, rebuilt the install image, and debugged the driver that way.  Note, I’m not saying to release it that way.  Of course, that debugging would just be for your information, but it might be interesting and point to a bug in windows.

  10. dispensa says:

    Um, not to ask the obvious question, but why don’t you change the start type?

  11. Manip says:

    Doesn’t Windows allow you to patch stuff on load? Thus, couldn’t you patch the INF file as it is read without modifying the original file?

    [Note: I don’t actually know anything about how Windows Patches files for compatibility]

  12. Anonymous says:

    He could change it for his single install, but that’s not the point.

    The point is that the OEM provided INF sets the start type incorrectly, and Microsoft cannot change it, because they don’t own the driver.  The problem comes in including this driver with Vista, as it will cause this same problem if you happen to own that particular device.

  13. Baljemmet, something changed, we don’t know what.  The starttypes haven’t changed since long before NT 3.1 shipped.

    Jeff: Why can’t it be debugged? Because the kernel debugger is loaded after the critical drivers are loaded, it can’t be used to debug them (at least I can’t, others might be able to) 🙁

    Dispensa, we can’t change the start type because it’s not our driver.  Maybe it really DOES need to be a critical driver for some reason we’re not aware of.  And I can’t change the start type after the fact because I can’t boot the OS to change the start type because a critical driver isn’t loading (Catch 22).

    And Manip, we don’t do that appcompat stuff for drivers (as far as I know, I’m not a driver guru).  And the IHV owns their INF file, we don’t.  What happens if we decide to unilaterally change their INF file and we break stuff by doing it?  The IHV has told us that the’re not supporting this device any more, and the OEM has explictly removed my laptop from their list of supported machines (for ANY operating system), so there’s not much we can do about it at this point.

  14. Anonymous says:

    The solution for the catch-22 is, of course, boot another OS and change the seting from outside. Having just installed Vista, I noticed you can boot from the CD, choose recovery something and get a command prompt. From there, you could mount HKLM of the borked OS, and changre the start type in the registry.

    Not a long-term solution though..

  15. Anonymous says:

    Just wondering – how old is the laptop? And could you at please tell us who the vendor is so the rest of us can try to use an alternate vendor if we are planning on buying kit that we intend to use for more than X years? (Where X is the age of your laptop)

    It’s not libel if it’s true.

  16. Jeff Parker says:

    You know this line might be there and be as a critical driver being the age of it and everything. But I seem to remember something in NT 4 where there was an option at first login to log in through VPN or dial up or something like that. If this driver wasn’t loaded then you could not authenticate. I might be off here I am going strictly from memory on an old NT 4 setup I had once.

  17. Anonymous says:

    I’d like to point out that most of you really haven’t answered Larry’s question yet.  

    Larry,

    I think Microsoft has done about all it can do.  While I am all for the "Raymond Chen" way of doing things were Microsoft goes out of its way to make things work despite how incredibly crappy that stuff may be, sooner or later we must face the fact that the third party guys have to get on board.  

    In other words, Microsoft can’t do it all (despite coming very close).

    Document the issue so that customers can easily find out what’s going on via Google or MSN search and move on.

    James

  18. Jeff, the OS doesn’t get that far in booting – this failure occurs BEFORE ntoskrnl.exe is loaded.

    Jonathan, if I had access to a Vista DVD, I think I could fix it with emergency repair mode, but unfortunately, I don’t.

    Adam, I can’t tell you which laptop it is, because I don’t know if the place where I got the information from is company confidential (that’s why I obscured all the info in the post).

    What I will say is that it’s a 3 year old laptop, it’s long out of warranty, and it technically doesn’t even come close to meeting Vista’s hardware requirements (although it runs Vista just fine for me).

  19. Anonymous says:

    My guess is cargo cult programming. A system which required the modem drier before the kernel loaded would be a pretty messed up design, IMO.

  20. Anonymous says:

    Then Vista’s "Requirements" really are requirements are they?  Maybe the requirements need to be made more realistic.

  21. Anonymous says:

    Last I saw, in a WindowsInf directory, every INF file other than those starting with OEMnn were provided by Microsoft on Microsoft’s Windows CD, and they SAID that they were provided by Microsoft.

    Microsoft can’t change the StartType line in an INF file provided by Microsoft because an OEM vendor owns an INF file provided by Microsoft?

    In old days, one could build a modified Windows CD containing a modified INF file and use that to do an install.  In Vista it’s more troublesome to build an image containing a modified INF file, but somewhere I read about some company that does rebuild images and test them on occasion.

  22. Anonymous says:

    Perhep talking with the driver owner and ask them if they wish to include the driver in the default driver set of Vista? So people don’t need to install the driver from disk, and the fixed default version will "just work"?

    Honestly, if I own a piece of old hardware on the system I’m installing Vista, and found Vista automatically recognize it, I wouldn’t bother to reinstall the driver from the old driver disk, especially when the old driver is not "made for Vista".

  23. nksingh says:

    From casual reading of the NTDEV mailing list at osronline (where many driver-writers seem to go), it seems like the whole field of creating/editing INF files is just cargo-cult scripting.  There was even talk of replacing INF with some XML-based configuration system, but driver-writers have been tweaking these files for so long that they just don’t want to throw them away.

    Based on that list, it seems like a lot of kernel-mode driver programming involves taking a sample and just modding it to work with one’s device.  The world will be a better place when everyone’s writing user-mode drivers and MSFT is the only one mucking about in the kernel.

  24. Manip says:

    *cough* Dell Lattitude C610 *cough* 😉

  25. Anonymous says:

    Heh – this is, of course, part of the reason sensible Linux users avoid closed-source drivers (the other reasons being an unwillingess  to maintain a stable ABI and a semi-religious devotion to OSS). Unfortunately, were Linux to catch on in any big way, you’d probably end up with users installing badly-written binary-only drivers left, right and center, just like they do with unsigned Windows drivers today, and with exactly the same results.

    To be honest, I don’t think there’s any real solution to the problem of bad drivers. It’s slightly odd, though, that Windows driver development would suffer so much from cargo cult development and Linux driver development doesn’t seem to that often (even though the documentation is often sparse and driver writers have full access to the source of the kernel and lots of other drivers). Perhaps it’s just less noticable because they use more suitable drivers as a base…

  26. vince says:

    Your operating system is only as strong as its weakest link.  In this case it’s an outside vendor.  Having your OS dependent on outside forces, especially ones that are probably going for quick-and-easy rather than correct can always cause problems, no matter what the development model is.

    So in this case having the source code could have helped.  It wouldnt have necessarily had to have been open source though.. MS could have required the sourc be kept in escrow for just such an eventuality.  Or they could have a team that would review INF files before certifying a driver.  There are many ways this could have been avoided, but as you said it’s pretty much too late to do anything about it now.

  27. Manip, not that one 🙂  It turns out that there are at least two other laptop models from different OEMs with the same problem (another reason I didn’t mention it).

    Cheong, I believe that we asked them and they weren’t interested in supporting that chipset on Vista (reading between the lines in some emails).

  28. Anonymous says:

    > the OS doesn’t get that far in booting –

    > this failure occurs BEFORE ntoskrnl.exe is loaded.

    I doubt it.

    ntldr loads ntoskrnl, HAL, and all the boot start drivers into memory, and then transfers control to ntoskrnl.  ntoskrnl is responsible for initializing the drivers (e.g. calling DriverEntry, doing the PNP dance, etc.).  It is impossible for a driver to fail before ntoskrnl is loaded.

    Also…  ntoskrnl initializes the kernel debugger VERY early in the boot processs.  You absolutely can debug boot start drivers — I’ve done it many times.

  29. Anonymous says:

    "So in this case, the authors of the modem driver decided that their driver was a boot time critical driver – which, as the documentation clearly states is only intended for drivers required to load the operating system."

    That’s not accurate.

    "0x0 (SERVICE_BOOT_START)

    Indicates a driver started by the operating system loader.

    This value must be used for drivers of devices required for loading the operating system. "

    The documentation clearly states that drivers that are required for loading the operating system must use this startype.  It does NOT at ALL, let alone clearly or ambiguously, say that it is ONLY for drivers that are required for the loading of the operating system.

  30. Anonymous says:

    Why wouldn’t the company want to make such a minor change to support that hardware? Is it because they’re excruciatingly lazy, or because they think that if they make it work, they’re going to get support calls in the future from people trying to find out what’s up with potential bugs in the driver?

    I just hate to see old (But, evidently, perfectly servicable) hardware go to waste because hardware vendors decide not to update their drivers anymore…

  31. Anonymous says:

    Ugh…not sure why anyone would *want* to run Vista on a Latitude C-series…it’s best to leave them chugging away on 2K/XP, which they do quite happily with RAM upgrades.

  32. Anonymous says:

    I know it’s not really the point, but wouldn’t disabling the modem through BIOS prevent the driver from wanting to load in the first place?

    (Yeah, I’m kind of assuming you don’t need it..)

  33. Anonymous says:

    >Please note: This is NOT an invitation for a "If only the drivers were open source,

    >then you could just fix it" discussion in the comments thread.  The vendor

    >for the modem driver owns the rights to their driver, they get to choose

    >whether or not they want to support it, not Microsoft.

    Thinking forward, I’d hope that MS would have just a little bit more control over the code that they ship with Vista.

    If you don’t have the source, you can’t do a thorough code review and make any worthwhile claims about stability or security. At least some user-space drivers limit their capability for damage to their domains, which is bad enough.

    If you don’t have permission to maintain the drivers, then you’ve given IHVs control over Vista’s market. One person’s aging laptop is another computer that won’t be getting Vista, whether MS wanted to sell another copy or not.

  34. Anonymous says:

    I hope this thread isn’t closed yet, ’cause Larry is so wrong it’s not even funny!

    Larry claimed to Jeff "this failure occurs BEFORE ntoskrnl.exe is loaded". Uhuh, sure Larry, sure.

    Let’s get our facts traight, shall we?

    The (specific/specified) HAL and (again, specific/specified) kernel (be it NTOSKRNL.EXE, *PA, or some other)  are loaded by the OS loader, basically as a pair. They have circular dependencies, which is wicked, but it’s solved by the fact they are loaded as a pair, and both PE’s imports are resolved before anything fails. We can for all intents and purposes treat them as one unit from now on, and they ARE loaded.

    All BOOT drivers are ALSO loaded into memory before the kernel even gets a chance to run a single byte of code. Failing to even load one of these drivers is obviously a catastrophical error, and prevents system boot.

    But to in this context claim that a BOOT-tagged driver is loaded (and somehow fails to execute) before ntoskrnl is even LOADED, is at best a misunderstanding.

  35. Mike, there are a set of drivers loaded by osloader before ntoskrnl and the HAL are loaded, these are the critical boot drivers – they include drivers like the disk drivers (which are required to load the rest of the OS).  By marking this driver as boot critical, it marked the driver as one of the drivers that is loaded before the OS (and thus before KD).  If you boot with tracing enabled, you see ntoskrnl.exe get loaded at the end of a dozen or so drivers (including the HAL).  This driver was one of the dozen or so.

    I suspect that if I got one of the ntldr developers engaged, I could figure out how to get KD working, but in this case it wasn’t worth the effort.

    Nar, there is a VERY small set of software shipped in the box that is owned by 3rd party developers, and we don’t have the source to those drivers.  In this case, the driver wasn’t in the box, but it WAS available on Windows Update, and Vista figured that out and installed it.

  36. ndiamond says:

    I used to have a machine where NTBOOTDD.SYS did indeed have to be loaded before the HAL and NTOSKRNL.EXE, because the HAL and NTOSKRNL.EXE were in a partition that the BIOS’s INT13 wouldn’t reach.

    I never figured out how to rename both a suitable version of ATAPI.SYS and a SCSI driver to both be C:NTBOOTDD.SYS in order to use the boot loader’s menu sensibly.  Eventually I got an even larger internal hard drive and no longer had to put an alternate installation on an external drive.

    Meanwhile I still think Microsoft has the source code of a modem’s INF file and can change that one from boot start to demand start.