How my lack of understanding of how processes exit on Windows XP forced a security patch to be recalled


Last year, a Windows security update got a lot of flack for causing some machines to hang, and it was my fault. (This makes messing up a demo at the Financial Analysts Meeting look like small potatoes.)

The security fix addressed a category of attacks wherein people could construct shortcut files or other items which specified a CLSID that was never intended to be used as a shell extension. As we saw earlier, lots of people mess up IUnknown::QueryInterface, and if you pass the CLSID of one of these buggy implementations, Explorer would dutifully create it and try to use it, and then bad things would happen. The object might crash or hang or even corrupt memory and keep running (sort of).

To protect against buggy shell extensions, Explorer was modified to use a helper program called verclsid.exe whose job was to be the "guinea pig" and host the shell extension and do some preliminary sniffing around to make sure the shell extension passed some basic functionality tests before letting it run loose in Explorer. That way, if the shell extension went crazy, the victim would be the verclsid.exe process and not the main Explorer process.

The verclsid.exe program created a watchdog thread: If the preliminary sniffing took too long, the watchdog assumed that the shell extension was hung and the watchdog told Explorer, "Don't use this shell extension."

I was one of the people brought in to study this new behavior, poke holes in its design, poke holes in its implementation, review every line of code that changed and make sure that it did exactly what it was supposed to do without introducing any new bugs along the way. We found some issues, testers found some other issues, and all the while, the clock was ticking since this was a security patch and people enjoy mocking Microsoft over how long it takes to put a security patch together.

The patch went out, and reports started coming in that machines were hanging. How could that be? We created a watchdog thread specifically to catch the buggy shell extensions that hung; why isn't the watchdog thread doing its job?

That was a long set-up for today's lesson.

After running its sanity tests, the verclsid.exe program releases the shell extension, un-initializes COM, and then calls ExitProcess with a special exit code that means, "All tests passed." If you read yesterday's installment, you already know where I messed up.

The DLL that implemented the shell extension created a worker thread, so it did an extra LoadLibrary on itself so that it wouldn't get unloaded when COM freed it as part of CoUninitialize tear-down. When the DLL got its DLL_PROCESS_DETACH, it shut down its worker thread by the common technique of setting a "clean up now" event that the worker thread listened for, and then waiting for the worker thread to respond with a "Okay, I'm all done" event.

But recall that the first stage in process exit is the termination of all threads other than the one that called ExitProcess. That means that the DLL's worker thread no longer exists. After setting the event to tell the (nonexistent) thread to clean up, it then waited for the (nonexistent) thread to say that it was done. And since there was nobody around listening for the clean-up event, the "all done" event never got set. The DLL hung in its DLL_PROCESS_DETACH.

Why didn't our watchdog thread save us? Because the watchdog thread got killed too!

Now, the root cause for all this was a buggy shell extension that did bad things in its DLL_PROCESS_DETACH, but blaming the shell extension misses the point. After all, it was the fact that there existed buggy shell extensions that created the need for the verclsid.exe program in the first place.

Welcome Slashdot readers. Since you won't read the existing comments before posting your own, I'll float some of the more significant ones here.

The buggy shell extension was included with a printer driver for a printer that is no longer manufactured. Good luck finding one of those in your test suite.

The security update was recalled and reissued in a single action, which most people would call an update or refresh, but the word recall works better in a title.

Comments (78)
  1. Nathan says:

    For accepting responsibility in this. Nice job. Something we can all learn from.

  2. charless says:

    Ditto to Nathan’s comment!

    But, the question that came to my mind was why didn’t anyone see/report this hang durring internal testing? Based on what you have said over the last few days, this hang seems to fall somewhere between likely and very likely. Or maybe a better question would be, now that you have explained why this design “can’t” work, can you explain how verclsid.exe ever exits cleanly?

    And again, thankyou for some very insigtful reading!

    -charles

    [A few days of internal testing is not going to come anywhere near 100% coverage of all shell extensions on the planet. It takes only one bad program to foul an upgrade. -Raymond]
  3. tony roth says:

    I love it when somebody else screws up!

  4. stanley says:

    I learned the hard way when I was creating a shell extension recently that verclsid.exe will not allow Explorer to load your DLL if you haven’t finished implementing the interfaces you say you implement.

    During development, it may be helpful to manually add your shell extension to the allow list:

    http://support.microsoft.com/kb/918165

    That’s kind of a hackish approach, and you’ve got to make sure to remove the entry before you deploy to make sure your final DLL will get verclsid’s blessing.  Is there a better way to do this?

    I almost wanted to ask if the suite of tests verclsid runs is published anywhere, but I imagine Microsoft would rather leave it unspecified so as not to imply a contract.  It does make it a little frustrating to try and figure out why verclsid.exe won’t let Explorer load your DLL, though.

  5. tcliu says:

    Interesting story, Raymond.

    One question though: I get the impression that a shortcut can specify any CLSID, and Explorer will try to load it as an extension. The problem was that some CLSIDs referred to objects that were never intended to be used as shell extensions, and therefore Explorer would crash.

    Now, verclsid.exe will catch the currently existing ones, but is there any way to mark a COM object as "Under no circumstances use this as a shell extension"? I’ve searched around a bit, but didn’t find anything. (I’ve never written a COM object in my life, so I don’t even know where to start.)

  6. Adam says:

    Just out of curiosity, how did explorer and verclsid.exe communicate?

    It’s just that the way to do this that springs to my mind would be for explorer to launch verclsid.exe, and have explorer wait for verclsid.exe to exit and check its exit status. If verclsid.exe exits indicating success within the time limit that explorer is willing to wait for, all is OK. If it returns indicating a failure, or crashes, or doesn’t exit within the time limit, the CLSID is bad.

    So, I guess I’m wondering why exit codes weren’t adequate for the communication that was needed (what else apart from yes/no is there?), and what communication mechanism was used instead.

    I’m also wondering what would happen if a buggy/malicious extension managed to scribble over the part of the address space being used by the watchdog thread. Was it just assumed that having the watchdog in the same process as the thing you’re testing for its ability to trash a process was too tiny a risk to worry about?

  7. charless says:

    Oops, I misinterpreted in sentence that starts: "The DLL that hosted the shell extension…" Thanks for the clairification. Now the end of the story makes sense too.

    -charles

  8. Fred Schtiener says:

    Looks like the classic case of bad testing, not that I haven’t done but just pointing out the obvious. I know you can’t test for every possible case (sometimes) but still you need more testing. Then rinse and repeat.

    [More testing = more time = more people complaining that Microsoft is slow to release patches. You can’t have everything. The bug was in a shell extension that came with a particular model of printer that the manufacturer doesn’t even make any more! Good luck testing that. -Raymond]
  9. JamesNT says:

    Mr. Chen,

    Excellent article.  I seem to recall being bitten myself by said patch.  Regardless, you are still my programming god.  The only thing this situation proves is that you are human and, like the rest of us, are expected to work magic half the time.

    Thank you for all your hard work.

    JamesNT

  10. dislyxec says:

    Integrity and Honesty, a microsoft core value :)

    I love these stories–it reminds us all that we’re not the only ones that screw up :P

  11. Brody says:

    This article reminded me of how it feels to wade through the kludged-up tangle of anti-pollution hoses and devices on a 1972 Ford Pinto. Those Pintos would only work if everything was adjusted perfectly. There were too many complex interdependencies for mere mortals to grasp. If Raymond Chen can’t even predict how something will work, something tells me the design is way too complex in the first place.

    http://en.wikipedia.org/wiki/Ford_Pinto

  12. Nick says:

    So after all this time it was YOU! :)

    Great post.

  13. Cody says:

    You’re about to be Slashdotted.

  14. Doug Harrison says:

    >When the DLL got its DLL_PROCESS_DETACH, it shut down its worker thread by the common technique of setting a “clean up now” event that the worker thread listened for, and then waiting for the worker thread to respond with a “Okay, I’m all done” event.<

    Why didn’t you just wait on the thread handle? I think this would have avoided the hang. That said, I think this is more the duty of the main program, and I’d try to avoid doing this in DLL_PROCESS_DETACH. In general, the problem with an “all done” event is that the thread continues to run after setting it, so it isn’t really “all done”. This becomes more and more of a problem as the distance between the raw API and the abstraction you’re using increases, e.g. CreateThread vs. _beginthreadex vs. AfxBeginThread.

    [You’ll have to ask the author of the buggy shell extension, but I suspect the answer would be “Because that guarantees a deadlock.” -Raymond]
  15. Aaron says:

    Like charless, I was a little confused as to how the process exit procedure got tripped up.  (caveat: I’m not a windows developer)  The procedure is listed as:

    1. releases the shell extension
    2. un-initializes COM

    3. calls ExitProcess

    From "the first stage in process exit is the termination of all threads other than the one that called ExitProcess." I deduce that step 3 is reached, and the shell extension worker thread is NOT shut down at this point due to COM "unloading" because the extension did a double LoadLibrary.  When step 3 proceeds, the worker thread is prematurely terminated (with respect to its own shutdown protocol) and therefore the extension hangs.

    If that’s all correct, then it sounds like the hole here is just that proper shutdown should have been included in the behavior checking/testing.

    I can see how that could be overlooked (especially if the problem you are trying to detect in code in the Real World is not during shutdown).  Although I assume it would still be worthwhile to implement, because every once in a while somebody will want to log off, shut down explorer, whatever, and that’s when such bad behavior would bite them (although maybe not so clearer associable with explorer).

  16. re: How my lack of understanding of how processes exit on Windows XP forced a security patch to be recalledre: How my lack of understanding of how processes exit on Windows XP forced a security patch to be recalledre: How my lack of understanding of how processes exit on Windows XP forced a security patch to be recalledre: How my lack of understanding of how processes exit on Windows XP forced a security patch to be recalled

  17. Matt says:

    Your executive summary for /. assumes they read the article before commenting.  You’ve met them, right?

  18. dlz says:

    As a /. reader that actually reads the articles, I found summary great.  Too bad probably no other /. readers will actually see it.

  19. Wyatt Best says:

    The first wave is here! (me)

    Very interesting. I am a beginning programmer, and just learned a ton about threads. But if I can see the explorer.exe process in taskmgr, why can’t I see verclsid.exe? (I have all the XP updates as of this morning.)

  20. Neal says:

    Slashdotters should be sure to read not just this article, but the other articles on processes that Raymond wrote this week.  

    Once that’s done I’d also suggest they start at the beginning of Raymond’s blog postings and read all up through today’s.  They’ll thank themselves for it later.

  21. Random Reader says:

    verclsid.exe is a temporary process that starts when needed, then exits shortly afterward.  You’ll only see it in taskmgr if you happen to be watching the instant explorer wants to load a new shell extension and asks verclsid to test it first.

  22. Do-do Brown says:

    (Im a /.’er btw)

    The buggy shell extension was included with a printer driver for a printer that is no longer manufactured. Good luck finding one of those in your test suite.

    Now what were you saying about us?

  23. Slashdot Reader says:

    [quote]Welcome Slashdot readers. Since you won’t read the existing comments before posting your own…[/quote]

    Yeah, fuck you too.

  24. onu says:

    Slashdotter here.  Teaching is the cornerstone of all open source virtues.  This article turns an unfortunate mistake into a valuable lesson.  You’re an asset to your company and everyone else in your profession.  Just keep those thoughts with you when you dive into the murky depths of the slashdot thread for this article.

  25. Zimboptoo says:

    I wonder how many more /.’ers are reading the comments because of your little summary?

  26. Anonymous Coward says:

    It takes a lot of courage to own up to a mistake like that (even if it wasn’t actually your fault). Congratulations!

  27. Nis says:

    "Since you won’t read the existing comments before posting your own, I’ll float some of the more significant ones here."

    Not true in all cases; please don’t make generalities.  ./ users can make general statements about stability but that would serve any purpose.

    Regards,

    Nix

  28. Sohail says:

    So wouldn’t this pretty much always happen? Sounds like something that should be caught in testing?

    [If you were clairvoyant enough to know which buggy shell extension to install to demonstrate the problem. -Raymond]
  29. Doug Harrison says:

    [You’ll have to ask the author of the buggy shell extension, but I suspect the answer would be “Because that guarantees a deadlock.” -Raymond]

    That’s a good one. :)

    P.S. I see now I misinterpreted your article the first time I read it. After reading about verclsid.exe “hosting the shell extension”, when I later read about “the DLL that hosted the shell extension”, I must have equated the two. I see now you’re using “hosting” in different ways, the former to more or less mean “loading” and the latter “implementing”. So what was the solution? I’d think it might be similar to what Adam suggested in his comment.

    [Yeah, I admit that it was kind of confusing; I’ve tweaked it. -Raymond]
  30. rhk says:

    So… you didn’t try simulating the defective behavior while testing your workaround? Then released it to the entire world, thinking “I guess that should work”?

    Kudos for owning up, but I think your testing should be a bit more diligent, including testing for the actual error condition being guarded against. Unless I’m wrong, it should have been caught easily.

    [Hanging in PROCESS_DETACH was not a defective behavior we were guarding against since no known shell extensions exhibited that failure mode. The issues we were addressing were crashes and hangs during CoCreateInstance. -Raymond]
  31. Paul says:

    Please layout your page with the side text not displayed on top of the main text. It is hard to read your double text design.

    [As I explain elsewhere on the site, I don’t control the blog software. -Raymond]
  32. Ivan Rouzanov says:

    Somehow I feel verclsid.exe was actually my fault. :)

    There was another scenario for screw-up when another buggy 3rd-party Shell extension wanted to talk to Explorer from its DllMain which was waiting on verclsid.exe to finish thus creating a deadlock. Watchdog did not kill it because it was waiting on ntdll LoaderLock to get started.

  33. rhk says:

    Ah, I see now. Perhaps you should make it more clear that you’re talking about TWO different buggy shell extensions. The one that you were working around, and the one that triggered the bug in the workaround.

    Indeed, it is a tricky bug to predict.

    [The purpose of the change was to address entire classes of buggy shell extensions, not any shell extension in particular. But the class of shell extensions that hang during PROCESS_DETACH was not one of the classes targeted. I didn’t go into detail on this topic since the goal was to discuss how processes exit; this article is part three of a series. The security angle was just a motivator, not the focus. Slashdot turned it into the focus. In retrospect, perhaps I should’ve used an unmotivated sample program. -Raymond]
  34. Silent Node says:

    Facinating. Thanks for sharing.

    ( from::/. )

  35. Sohail says:

    [The purpose of the change was to address entire classes of buggy shell extensions, not any shell extension in particular. But the class of shell extensions that hang during PROCESS_DETACH was not one of the classes targeted. I didn’t go into detail on this topic since the goal was to discuss how processes exit; this article is part three of a series. The security angle was just a motivator, not the focus. Slashdot turned it into the focus. In retrospect, perhaps I should’ve used an unmotivated sample program. -Raymond]

    Don’t do that Raymond. I appreciate these stories for what they are worth. To me, its a guy trying to get his job done with some oopsies along the way. I think only someone very immature would say "oh look its your fault because you suck, oh and this is another reason why M$ suxx0rz!"

  36. Ben Scott says:

    Just stumbled across this  Don’t ever remember how.  But I just wanted to say that I found this post of yours was interesting and informative, and that I’ve read many of your other posts and found the same. Even though I dislike some of Microsoft’s business practices and products, it sounds like you do a good job, and I for one appreciate that, too.  Keep up the good work.

  37. TK says:

    "is there any way to mark a COM object as ‘Under no circumstances use this as a shell extension’?"

    Sort of. There’s a registry value that can be set to prevent Internet Explorer add-ons from also being loaded by Windows Explorer. The name is "NoExplorer", under a key for "Browser Helper Objects".  But then you’d probably want to do something else to make sure that the COM object doesn’t get loaded by IE.

  38. young grasshopper says:

    I bet no /.er’s would ever be able to steal the pebble from Master Chen’s palm!!!!

  39. Andrew says:

    Good read Raymond.  The complete set of test cases is approaching infinity (ok – exaggeration).  Full regression testing can only really occur in the target environment which is why we Nazi IT admins lock things down and run UAT on a trial group before releasing to the larger user-base and why we fight to limit the number of applications, hardware types etc to the minimum that will support the business. So you can’t have the google toolbar on your PC?  Well boo-hoo. Give me a business case… :-)

    I don’t know of any way you could have avoided the release/recall event in this case.  Props for the prompt fix and recognition of error.

    From /. too  :-)

  40. DriverDude says:

    So if I read this right, Microsoft is having to workaround other people’s bugs… again? Talk about writing defensive code. (Remember http://www.microsoft.com/technet/security/bulletin/ms06-020.mspx)

    Kudos, Raymond, for owning up to this. I’m also impressed MS is allowing you to blog about this. Now if only I and everyone else will learn from these mistakes…

    Hey, I wonder if anyone in that big printer company indirectly mentioned in a comment above is reading this.

    I’m beginning to wonder if we should all be writing OFFENSIVE code. I don’t mean anything that violates a spec or intentionally causes errors, but rather always varying your responses so nobody gets lazy and starts assuming things. Kind of like how Perl randomizes hash ordering because some people disobey the "don’t count on hash order" warning.

    Obviously Windows’ process termination is a good example of something that changes and is unpredictible – but instead of being defensive and doing the "right things" (whatever they are) it seems most programmers just wing it.

    Sigh.

  41. CoderDude says:

    [I’m beginning to wonder if we should all be writing OFFENSIVE code. I don’t mean anything that violates a spec or intentionally causes errors, but rather always varying your responses so nobody gets lazy and starts assuming things.]

    You mean stuff like locking down parts of the registry and filesystem unless the user types in an admin logon, or locking down the kernel so that no unauthorized drivers run inside it?

  42. Gustavo says:

    Someday, I hope to be the cause of a bug with this

    level of impact.

  43. Ooh, bummer. One of those "what? why’s it doing that?" moments. Thanks for sharing it; avery pointed illustration of yesterday’s point!

  44. JB says:

    I’m a long-time /. reader (and also holder of an exmsft.com email address that I never use) and I not only read TFA but also all the comments before posting, so there! ;-)

    I agree/sympathize with the problem of finding drivers/other software that came with now-obsolete printer hardware, but would put forth the following observations and questions;

    – Unless the actual printer was needed to cause the bug, in which case you’d need to scour ebay, computer junk dealers, etc, would just having the software be enough? Many manufacturers make available for download drivers and software versions going back a long way

    – If the software is not available online, I’m sure most vendors would be willing to supply it to MSFT anyway. The problem here is the bureaucracies in between the developer at MSFT and the person at the vendor who could supply that obsolete software. I’m certain it could be obtained, but during my time at Microsoft, I would have had no idea who to ask to help me with such a request, even if I thought of loading up my test suite with obsolete software.

    – Even if I know who to ask, that person might not have a contact at the vendor, so s/he might have to do some research/relationship building to get what I need.

    Which, of course, would take a good bit of time, which would make people complain even more about the speed of patch releases. Not that those complaints are unjustified – Microsoft generally *is* slow to get patches out the door – but no one wants to make it worse. Thus, testing becomes a matter of covering as much as you realistically can and get it out the door. In the case of serious security holes with exploits in the wild, you probably do more damage by waiting to release something you’re certain is bug-free rather than releasing early with some you’re pretty confident is bug-free.

    [Don’t forget MP3 players, video cards, digital cameras, webcams, wireless network cards, USB thumb drives, mice, keyboards… Now do some back-of-the-envelope calculations how long it would take to test every last one of them. -Raymond]
  45. tcltk says:

    It bothers me, that the watchdog was in the same process together with the buggy code! Isn’t this a big no-no? Didn’t a code review uncover this?

  46. Chronos says:

    Out of curiosity, I have two questions:

    A. Is there any particular reason that the main thread wasn’t the watchdog thread, with the COM stuff being done by the worker thread?  (It seems the more logical assignment of duties to me.  Then again, I come from a Unix background, where “worker processes” are much more common than “worker threads” — a drastic change of mindset.)

    B. Is there any particular reason that TerminateProcess wasn’t used instead of ExitProcess?  (Although I’d understand if the thought just never crossed anyone’s mind.)

    [Switching the roles wouldn’t have changed anything; the hang would still occur. (A thread’s a thread; there’s nothing special about the “main” thread aside from the fact that it happened to be created first.) -Raymond]
  47. Christian says:

    Many thanks, Raymond that you explained what versclsid actually does.

    I searched very hard back then when that  fix was released to learn, what it exactly is and why it exactly was needed.

    Too bad that Ms does not publish exploits for its own security holes ;-) (That would be the best way to understand and learn the actual problem)

    This article was really interessting, but it would still be very nice to explain the last bit to me:

    You wrote "The object might crash or hang or even corrupt memory and keep running (sort of)."

    In what way would that be a security hole? How could an attacker create a file that ends with a dot and a bad CLSID? I mean it’s strange that explorer loads shell extensions like that, but I don’t see a real security hole here!

    And versclsid.exe runs with the same rights as explorer, doesn’t it? So if loading a shell extensions that does not implement interfaces or IUnknown or whatever corretly, then why would versclsid.exe not expose that security hole when it loads the extension? Are we even talking about a real attack vector here? Or is it just Denial of Service against explorer.exe?

    But then there would not have been a need to hurry to release that patch, would it?

    Sorry for taking this slightly off-topic (because this is a series about process exit), but it would be great if someone could answer this!

    Many thanks!

  48. Peaker says:

    Threads are only to be used when:

    A. Utilization of SMP multiprocessing is a significant performance benefit.

    B. Old/crap API’s only support synchronous action.

    I am not sure if B is the case here, but A definitely isn’t. If the case is B, then blame Microsoft for creating those crap API’s in the first place.  If the case is A, blame Microsoft for incompetence in complicating situations with threads, instead of using processes or better: asynchronous programming.

  49. I was wondering how you tested the mechanism against purposely broken shell extensions. Did you write broken shell extensions that mimicked known bad extension behaviour? How did you catalog broken extensions to use as examples?

    [I wasn’t involved in that part. See the fifth paragraph. -Raymond]
  50. Bearxor says:

    Awesome article/insight.  It’s facinating stuff.

    Question though: Why not make the printer vendor fix their shell extension rather than changing the way Windows checks the shell extensions?  Was it just in case another vendor messed up in the future or did the vendor refuse?

  51. Stu says:

    So is the solution just to let explorer do the watchdog?

    Ie, explorer starts verclsid and waits a set time for a response. After a timeout it kills verclsid and assumes that the clsid being tested is bad.

    Seems simple, and has the added bonus of preventing extensions from trashing the watchdog thread accidentally.

  52. Your article confirms the low quality of windows code. Not just the implementation was flawed (this happens), but the design itself is stupid.

    [See the previous day’s discussion of the so-called “design” of process exit. -Raymond]
  53. Chronos says:

    @Raymond, nevermind.  I think when I posted that, I didn’t quite have the right order of events in my head.  It still seems the more logical program layout to me, but you’re right that it shouldn’t change anything.

  54. Shivaree says:

    Fascinating and well-written. You are exonerated by merit of your own mea culpa. (a /. reader who read all the comments, so :P)

  55. Peaker:  Unfortunately many of those stupid synchronous APIs come from clearly defective opearating systems like Unix.  For example the open(), create(), read(), write() and close() all still appear to be synchronous, and all are directly from Unix.

     Don’t blame Windows for having synchronous APIs, ALL operating systems have them, because they’re easier to deal with. In fact, at some level EVERY API is synchronous at some level, even the "asynchronous" ones.

    Bearxor: I wasn’t involved in this, but my guess is that the vendor didn’t support the printer in question any more, and thus wasn’t interested in updating the shell extension.  And even if they DID provide a fix, it doesn’t change all the other shell extensions that had the exact same defective implementation.

  56. Picky says:

    Bearxor, I’d imagine it was because even if they fixed Shell Extension X, not everyone would upgrade and it is just a land mind waiting to pop up again when (as you suggested) someone else does the same thing.

    Especially as a security concern, it had to be fixed and I’m impressed that Microsoft was able to fix it, break it (oops), and then fix it again, then talk about it.

    Thanks Raymond for the insight.

  57. Jare says:

    Raymond, I too am interested in knowing why a watchdog thread was considered a good idea, since defective / insecure code would possibly thrash it.

    Sort of the same as standing in the same room as a guy with a gun and asking if he’s a murderer.

  58. JamesNT says:

    I just went to slashdot and read some of the feed back.  I can clearly see that Slashdot is still the cesspool of the Internet.

    @LarryOsterman.  I was beginning to wonder what was taking you so long to join the fray to Mr. Chen’s defense.  Nice articles on volume in Windows, by the way.  You and Mr. Chen are my gods.

    JamesNT

  59. tcltk says:

    LarryOsterman: Oh c’mon, unix was initially designed in 60/70s, calling it "clearly defective" in 2007 is a bit uncool IMO :). Especially because winnt is certainly not light years ahead, compared to modern unices. I love what Singularity people are doing though!

  60. A 'softie says:

    Bearxor: You write "Why not make the printer vendor fix their shell extension?"   Now you *want* Microsoft to throw our weight around and force people to do things?

    More seriously, we have a hard time making people do things (driver verification) even when we have really really good reasons to want them to do them (like avoid blue screens.)

  61. DriverDude says:

    "You mean stuff like locking down parts of the registry and filesystem unless the user types in an admin logon, or locking down the kernel so that no unauthorized drivers run inside it?"

    No, that is defensive – because vendors could not be bothered to make their software work properly, or work for non-admin users (among other reasons). Despite the fact that Win2000 software guidelines recommended it; UNIX software has done it right for decades; and Fast User Switching is five+ years old. (This is a pet peeve of mine; can you tell?)

    "Now you *want* Microsoft to throw our weight around and force people to do things?"

    Not really; that just makes people think MS is, well, a bully, even when it is THEIR bug to fix.

    Go for full disclosure instead – instead of a "hardware compatibility list", make a Hardware/Software INcompatibility List. Document all the problems. Inform the vendors’ customers so their *customers* demand a fix.

    After all, /. relishishes full-disclosure of Microsoft blunders. Microsoft does not have a monopoly on laziness or incompenence.

    And just to be clear: I think it’s OK to make a mistake – once. What ticks me off is that the same bugs are made again and again (overflows, priv esclation, etc); tools are available but not used (driver qualification/testing); and business reasons are used to justify ignorance ("those printers aren’t sold anymore, so why fix it")

  62. Tihiy says:

    So how it was fixed?

    Except adding those buggy extensions to "pre-approve" list?

  63. Norman Diamond says:

    > A few days of internal testing is not going

    > to come anywhere near 100% coverage of all

    > shell extensions on the planet.

    I’ve read all the comments so far (60 of them) and still don’t understand why this reply answers charless’s question.

    Of course there are gazillions of shell extensions that couldn’t be
    tested in advance, but the bug that forced a recall isn’t in one of
    those gazillions.  The recall was forced by a bug in the
    verclsid.exe program, right?  I still wonder why testing of the
    verclsid.exe program didn’t reveal the bug?  As the article says:

    > The verclsid.exe program created a watchdog thread

        and

    > the verclsid.exe program releases the shell

    > extension, un-initializes COM, and then calls

    > ExitProcess

    So the verclsid.exe program killed its own watchdog thread, without needing “help” from any other bugs, right?

    Saturday, May 05, 2007 1:33 PM by Shivaree

    > You are exonerated by merit of your own mea culpa

    I think Mr. Chen is exonerated by releasing a fix for the bug.  I hope this kind of practice will spread further.

    Saturday, May 05, 2007 1:40 PM by LarryOsterman

    > Unfortunately many of those stupid

    > synchronous APIs come from clearly defective

    > opearating systems like Unix.  For example

    > the open(), create(), read(), write() and

    > close() all still appear to be synchronous,

    Agreed.  Unix was deliberately designed to be less powerful
    than its predecessors (including one predecessor whose name it punned
    upon).  Some real OSes had both synchronous and asynchronous I/O,
    serving different kinds of development needs.  Of course they’re
    all gone now, sort of like Gresham’s law.

    [The problem was caused by a bug in a third-party
    shell extension which did something unanticipated which verclsid did
    not protect against. You can’t protect against everything. It’s easier to protect against a specific bug in retrospect. -Raymond
    ]
  64. steveg says:

    I’m confused about the reaction! It’s just a bug. (shrugs). It’s not a bad one. No one died, no rockets exploded at launch, no satellites crashed into Mars, no-one lost billions of dollars.

    All software has bugs, regardless of the vendor. As usual a lot of comments seems to be coming from people who have little to no experience with software development.

    I don’t know if you want to write more stories like this, but as a software developer I really enjoy these "Bugs of Our Lives".

    (my personal favourite mess-up was upgrading a system to a new version of an OS (Iris) and failing to spot the PID had changed from 16bit to 32bit. Ahh… that was fun. It cost a lot of $ when eventually the PID clocked over 2^16)).

  65. Amon Houndsbreath says:

    "I’m confused about the reaction! It’s just a bug."

    No it’s not just a bug if you’re a Unix bigot. It’s an anecdote that shows that Windows is unmaintainable and impossible to understand.

    Mind you looking at the comments on  Slashdot crowd, particularly the +5 Insightful ones, I can quite believe that something as complex Windows would be impossible for them to understand or maintain. They’d get stuck in a endless cycle of refactoring to make the code easier to understand, only to find that much of the complexity is inherent in the problem.

    So they’re right in a way, just in one which is less flattering to them than they think.

  66. Chris Becke says:

    Having installed a printer from the affected manufacturer all I can say is what the $*%@! 300Mb for a printer driver install. And its not even installed. I still get some kind of run-on-login setuplett running every time I restart.

    Its a sad world in which the blame for such crappy software cant be laid directly on the perpetrators of such utter bloatware.

  67. BryanK says:

    Regarding the mentioned synchronous APIs in Unix:

    That’s why Unix also has select() / poll() / epoll() / whatever else.  Yes, open() and close() (and probably create()) are still synchronous.  But read() and write() (which are where you’ll be spending the vast majority of your time anyway: opening a file handle is pretty fast compared to writing to it) can be put into a select()-type loop.  So no, the APIs aren’t asynchronous — but they won’t block, either.  All the blocking will be in the call to select() or poll().

    (OTOH, select() and poll() can’t take POSIX semaphores, or any of the other types of thread synchronization primitives.  I am guessing that this is because threads are almost always the wrong answer, especially when process creation is as fast as it is on most UNIX-like OSes; "most" binaries are single-threaded.)

  68. Rick C says:

    tcltk, that was said for effect, since the poster to whom Larry was replying was making uncalled-for insults.

  69. herd says:

    btw, it is flak not flack ;)

    (FLugzeug-Abwehr-Kanone)

  70. Michiel says:

    [It bothers me, that the watchdog was in the same process together with the buggy code! Isn’t this a big no-no? Didn’t a code review uncover this?]

    Running in-process, well – that’s COM for you. The whole point of the verifier application was to check the buggy code in a separate victim process. The assumption was that a failure would be detected and reported by the verifier. Now, that *is* a bug. You don’t report failure, you report success. Had that been done, then this class of bugs wouldn’t have happened.

  71. Herman the german says:

    flak is the abbreviation for Flieger-Abwehr-Kanone. Without any c.

  72. tcltk says:

    I think the real issue in the sync/async debate may be the programming language.

    If the underlying API is sync, the language and runtime simply need to support light-weight threads (as in Erlang), so heavy-weight threads costs go away.

    If the underlying API is async, the compiler could rewrite the sync code to the underlying async paradigm, so we get all the benefits of async API and simplicity of sync coding:

    <pre>

    using(f = open(“somefile”)) {

    x = f.readall()

    try {

     f.write(somestuff)

    } catch(E e) { handle write error… }

    }

    </pre>

    gets rewritten to something like:

    <pre>

    async_open(“somefile”,

    lambda(h) {

     h.readall(

      lambda(x) {

       h.write(somestuff,

        exception_lambda<E>(e) {

         handle write error…

        }

      }

     )

     h.close()

    }

    )

    </pre>

    There is one problem with this solution – you can’t span multiple methods with this kind of rewriting transparently. One option is to use continuations, but they are quite costly to implement AFAIK. I believe the best option is to use futures and related synchronization primitives.

    [Remember when we were talking about how processes exit? That was cool. -Raymond]
  73. peterchen says:

    great post, Raymond

  74. Great article!  The description of the error that you had made was very clear.  

    Great job on ‘fessing up’ — resposibility is something we desperately need in this industry.

  75. Dwain says:

    I’ve made many mistakes in the past (including wiping out a live database).  Some I’ve fess up to…and some well… but I MUST commend you for your efforts in coming straight. I think we all have a lesson to learn  from you about honesty.

Comments are closed.

Skip to main content