This one’s for you John. The core OS team didn’t forget you

Way back when, back in the very early days of this blog (actually it was the 3rd post to my blog), I wrote a story about John Vert complaining about CTRL-C not working on network commands.

Well, yesterday I got a piece of email from one of the developers in COSD.  I’ve sanitized it a bit, but here’s the important part:

Microsoft Windows [Version 6.0.<build>]
(C) Copyright 1985-2005 Microsoft Corp.

 d:\>dir \\<server>\dfg
The I/O operation has been aborted because of either a thread exit or an applica
tion request.

d:\>dir \\<server>\dfg
The I/O operation has been aborted because of either a thread exit or an applica
tion request. 

So John,  this one’s for you, even though it’s been 13 years since I worked on that code, your complaint wasn’t ignored, and it’s finally been fixed.

I have no idea what build will contain the fix, or even if the fix will make the final product, but it’s getting there.

As I type this, I can just imagine the /. headline: “Microsoft takes 13 years to fix a bug”.  The reality is WAY more complicated than that.  To actually make this fix work required a significant amount of change to the I/O subsystem and a number of changes to the way that I/O cancellation works. The biggest piece of the picture is the new CancelSynchronousIo API that was added to Vista to handle just this situation, without that support (as mentioned in my the original article), it wouldn’t have been possible to fix the problem.

Comments (27)

  1. Mike says:

    I hope 6.0 will finally allow you to "net stop rdr" shortly (!) after last mounted network resource (drive-letter) was "dismounted" ("net use x: /d") without either hanging it in an undefined state forever (stopping, but never stops, and can’t be started) or simply BSODing the box. That last point also displays a really, *really* unhealthy inbreeding between the redirector ("Workstation") service and the kernel component(s). A simple user-mode app (even if it in this case _is_ the redirector service) being able to BSOD the system isn’t exactly painting a flattering picture of Microsoft.

    Letting it silently time out (somewhere between 2 and 10 minutes I think) it works, but who counts the seconds for MS timing-related bugs when time-to-fix can be measured in decades. :->

    Not only on slashdot! 🙂

    That MS needed over a decade to introduce even the concept of CSQ, not to mention how many years apparently were needed to actually start to use them, is IMHO more telling about priorities than a potential slashdot story about "Microsoft has after 15 years finally got CTRL+C working!".

    Could you after this manage to beat Creative (*) and their drivers into submission, I think you might be on to something. 😉

    (*) There are other vendors just as bad, many worse if you count the abomination of NIC’s as USB devices, but due to market penetration Creative has a place of its own I think.

  2. Mike says:

    Not to create the time machine to go back 13 years with a list of bugs.


  3. Mike, I’ve never seen the redirector take more than a couple of seconds to stop (I actually do that every few days, go figure), and I’ve never seen it CRASH.

    If you’ve reported the crashes to MS, the redirector team should have your crash data and can figure out what went wrong.

    And the quality of 3rd party drivers is a significant issue.

  4. Norman Diamond says:

    > If you’ve reported the crashes to MS

    I’ve had around a dozen kernel crashes (BSODs) where Windows didn’t offer to report the crashes to Microsoft because whatever bug caused the network connection to not work also prevented reports of its own crash.

    In user mode I’ve had a few hundred process crashes where dumpprep.exe and another Dr. Watson process were executing and nearly hanging the CPU but they never offered to send crash reports because whatever bug caused the network connection to not work also prevented reports of its own crash.  These didn’t cause BSODs but still the only way out was to reboot.

  5. Jonathan says:

    The behavior that annoys me the most about this is when I try to use tab-completion:

    > dir \misspelled-servershare<hits tab> <curses for 30 seconds>

    Will I be able to stop that?

  6. Gabe says:

    Does anybody know how long it took other OSes to implement the ability to cancel synchronous IOs? Maybe I just don’t know what to look for, but I couldn’t find any other OS that implements it. All I can find are calls to cancel async IO (Solaris, Linux, VMS).

    It really is a shame that other systems don’t implement synch IO cancellation, because it’s really annoying when your whole group of Unix systems goes down due to a single NFS hard mount failure.

  7. vince says:

    > The reality is WAY more complicated than that.

    Not really.  The richest software company in the world, with the best engineers money can buy, takes 13 years to make control-C work.  I think that’s pretty simple actually.

    You can change the statement to say that your engineers were so incompetent they designed things so poorly that it took 13 years of valliant redesign and effort to fix the bug, but I’d argue that’s even worse.

  8. Vince,

    Have you ever had a class or read an in-depth book on operating systems engineering?

    Didn’t think so.


  9. Ryan Bemrose says:

    Vince –

    Something that Larry entertainingly points out in many of his blog posts is the simple fact that Writing Software is Hard.  Not everybody can do it, and of those that can, even fewer can do it well.

    The technical reasons behind getting this working, amidst all of the other complexities involved in writing an *operating system*, are undoubtedly good ones.  The decisions were also almost certainly colored by weighing the impact of not fixing the bug versus the risk and technical difficulty of fixing it.  Larry has given us a glimpse of the latter, and for understanding his reasons, I am a little bit better coder.

    Microsoft has a lot of resources, as you rightly point out, but they are not unlimited, and indeed, spread over all of the products that Microsoft creates, they are not overly large.  Microsoft is not in the business of writing flawless software (an impossible goal), they are in the business of shipping products (an achievable goal).  To do that, some hard decisions have to be made.  Not all bugs can be fixed.

    Writing Software is Hard.  If you honestly think that you can do better, then by all means do so, and compete in the marketplace.  However, your comment suggests that you do not have the first clue about how real software is made, and if all you can contribute to the conversation is uninformed anti-MS spew, please take it back to slashdot.

  10. Why couldn’t the problem have been solved earlier by calling TerminateProcess in the CTRL-C handler?

  11. vince says:

    Why do people assume you have to hang out on Slashdot to be anti-MS?  I’ve been anti-MS since before slashdot was a glimmer in Rob Malda’s eye.

    Taking 13 years to fix a bug like this is inexcusable.  If a company is going to hide its code and development processes and not have an open bug tracking system, then it will be judged by what info is realeased.  

    How am I supposed to believe all of those smarmy "you can do anything with our software" MS ads if all I wanted to do was get control-C to cancel some IO?  In any case the 13 year bug will likely be a 15 or more one, because I am sure it’s not going to be fixed in any sort of release any time soon.

  12. Vince, you’re right.  13 years ago, a REALLY stupid decision was made by Microsoft.  We decided that it was reasonable to allow Windows to access a LAN.

    This decision is the root cause of this problem.  The problem is that the timeouts that are appropriate for networked devices are totally acceptable for human beings.

    It took us a while to realize this, there have actually been a steady stream of fixes in every single release of the OS since NT 3.1 that combined to improve the situation (for instance, in NT 3.1, you couldn’t ctrl-c the NET USE command, in NT4 (I believe) support was added to allow that (it might have been Win2K)).

    The final piece of the puzzle was the CancelSynchronousIO API that was added for Vista.  Due to the way that cancelation was implemented in NT 3.1, it required a significant amount of effort to ensure that it worked correctly and reliably with existing drivers.

    Vince, I don’t know you or your experience, but it’s clear to me from your comments that you’ve never ever written software for platform with widespread use.  As Ryan mentions above, this stuff is HARD, especially if you want to get it right.

    For instance, the command interpreter guys could have executed their command interpreter on a different thread than the UI thread and just abandoned the operation on ctrl-c.  But that would have introduced even more problems (what do you do when that abandoned operation completes, and what if there were side effects of that abandoned operation).

    But they didn’t do that because it was more important to fix it CORRECTLY than it was to hack around the problem.

  13. brent says:

    "Writing Software is Hard.  If you honestly think that you can do better, then by all means do so, and compete in the marketplace."

    In this day and age and after all of the anti-competative and unlawful things MS has done to competitors why would I want to compete with them?  Just to give them things to copy for ‘innovation’ and add to the OS so I can go out of business?

    The Justice Department really screwed up what they had.  MS should have been split up and the OS made a regulated public service.  Then their apps would have had to stand on their own two legs without being propped up for years with the cash cow of the OS.  How many of them would have survived in that envoironment?

  14. vince says:

    > Vince, I don’t know you or your experience, but it’s

    > clear to me from your comments that you’ve never

    > ever written software for platform with widespread

    > use.

    I like how people can somehow analyze my software experience from a few posts I make on a blog.

    If by "widespread use" you mean code that is in Windows, well of course not.

    If you mean "is currently running on millions of computers", then yes.  Code of mine is included in the Linux kernel.  You’re free to download the Linux source and view it, critique it all you want.  

    I’ll notice I can’t view any kernel code that you’ve written, or for that matter any of the kernel code your company produces.  So I’m the one whose at a disadvantage when considering your programming skills.

    >  As Ryan mentions above, this stuff is HARD,

    > especially if you want to get it right.

    Well of course, if you want to be whiny about it.  Honestly, all programming is hard.  That’s no excuse.

  15. What vince[sic] is missing here is that we’re talking about behavior on builtin commands to the shell.  Maybe we can debate whether "dir" should be builtin or not but it is and as such, this isn’t just a "simple" decision to terminate a process.

    As Gabe points out above, (just about) nobody else has support for cancelling in-flight synchronous I/O.  (VMS had it indirectly since all sync I/O was actually async I/O followed by a EF wait but I’m not sure that sys$waitef actually was interruptable…)

    Oh, wait, that’s right.  Don’t feed the troll.  Someday I’ll learn.

  16. Lorenzo says:

    Vince, you are very annoying. You are saying no more than: "MS, you are the bad boys, because you are not open souce". So, please stop bothering us with your useless posts.

  17. Rune says:

    Jonathan wrote: "> dir \misspelled-servershare<hits tab> <curses for 30 seconds>"

    Perhaps not cancel it, but perhaps nudge the default timeout? I don’t need the command line to die for 30 seconds. 3 will suffice for my configurations. :->

  18. Rune,

     That’s exactly the crux of the problem.  The 30 second timeout comes from TCP/IP, not from the user.  The timeouts that are appropriate for a networking stack aren’t appropriate for interacting with a user.

  19. Rune says:

    OK, so how do I change the timeouts used by the TCP/IP stack? 😉

    Our own applications tend to use their own timeout mechanisms in order to detect a broken connection faster. When dealing with realtime stock information, you do not want to wait 30 seconds before being hooked up to an alternative server… (no, we do not utilise satellite connections to our end users)

    I read a whitepaper once on TCP/IP and all related registry settings, but I do not remember seeing anything about timeouts. Could we have some additional tweakings in Vista Server please? (while we’re at it: where are the Vista Server betas? 😉 )

  20. Rune,

     You don’t.  You get a choice.  

    You can accept the timeouts in the TCP/IP stack (which are the exact same timeouts in every single TCP/IP stack in the world) and interoperate with all the other TCP/IP implementations.

    Or, you can mess with the timeouts and interoperate with somewhere around none of the TCP/IP stacks in the world.

    Microsoft originally shipped a TCP/IP stack with timeouts that were reasonable for a LAN (and thus more palatible to human beigns), and got slammed hideously when it hit the internet because the MS TCP stack timed out too early for real-world situations.

    For the first service pack of NT 3.1, the timeouts were reset to match the values that everyone else expected.

  21. Norman Diamond says:

    More precisely, the answer to Rune is to use UDP.  Yes you’ll have to write code yourself to reimplement half of what TCP does, but you’ll get to set your own timeouts and fallback procedures.  You’ll be able to do all of that and maintain compatibility with UDP.

    Regarding the base issue, I haven’t commented yet because every individual component of the problem seems to be reasonable and the system doesn’t fall apart until you put them all together.  I’m beginning to think that there should be another division similar to the division between UI threads and worker threads.  Any thread that waits for synchronous I/O operations shouldn’t make the user wait for that thread.

  22. Mike says:

    "in NT 3.1, you couldn’t ctrl-c the NET USE command, in NT4 (I believe) support was added to allow that (it might have been Win2K)"

    Nope, must have been an XP or later addition. I just the other day on NT5sp4 managed to miss the first ‘1’ in "net use * \192.168"…, and after hitting CTRL+C and realizing it would be useless to try to kill net.exe from Taskmugger (as it’s the kernel mode part that hangs, and the only way to stop that is a hard machine reset), I was again reminded of how dearly I dislike this design. Eventually, after the I/O timed out and returned to net.exe I got ^C printed in the console. Oh well, at least it did acknowledged it got my CTRL-C. 🙂

    About the redirector BSODing: I’ve only experienced that once, on XPsp2 (we normally don’t use XP). However, doing "net use x: /d" and at least while "netstat -an" still reports the connection in state TIME_WAIT, I always get the redirector hanging if I run "net stop rdr" using NT5sp4.

    AFAIK we haven’t reported it to MS (the common opinion simply is "Oh well, that’s the way MS software works. *sigh*"), and I personally see no reason to do it now. Even if the fix is relatively simple, Microsoft won’t issue another sp for NT5.0 why we obviously have little-to-none incentive.


    I wonder if in some distant future we’ll see Microsoft actually test, perhaps even QA, at runtime getting rid of stuff too (such as stopping services and devices), as opposed to just optimizing the loading of more and more and more services, drivers, software and resource-consuming thingamajigs and gizmos onto users machine.</dreaming>

  23. Norman Diamond says:

    Which is better:  Microsoft takes 13 years to fix a bug?  or Microsoft resolves to refuse to fix a bug, which Microsoft has acknowledged to be a bug, because Microsoft has repetitively shipped the same bug for maybe longer than 13 years?

    Don’t worry about whether anything should be done about it.  (That’s only if you had any inclinations of wondering whether to worry about it.  You didn’t need to.  Microsoft’s decision is quite clear enough.  But if the occasion ever arises for me to submit another bug report in this here public bug database, don’t be surprised when the level of cynicism will be even higher.)

  24. Ulric says:

    Wow, Norman, you seem pretty upset because one that DlllMain with the generic type HANDLE instead of HINSTANCE?  

    That doesn’t cause any problem, it’s only theorical, and I didn’t really get from your posts on usenet if it caused a problem when using STRICT.  The two types are exactly equal in RAM, void pointers.

    They could have easily put an intern to fix change the wizard, but I can imagine from my company’s experience the bug request getting lost by being badly logged.  

    It says "Visual C++ generates bugs in DLLs".  That’s false,  that’s like saying the house is on fire.  It’s merely a semantic problem in template of the wizard.  Even less important if as you say the doc used to be that way as well.

    When bugs are logged more specifically they have a better chance.  There is no anonymous company entity, somewhere there is a human that’s reading and triaging what gets done over what.

    I don’t work for Microsoft, but I can easily imagine someone looking at this bug toward the end of a cycle, thinking by the title it’s a serious compiler bug, then seeing it’s not, then thinking hey, it’s been that way for years and no one ever complained.  Resolved : "Won’t fix."  We do the same.

    And why did no one else complain.  Easy!  No one cares! Defining your DllMAin with HANDLE or HINSTANCE results in the same compiled code, It’s just syntax suggar.

    This type of bug would be or will be logged by multiple users if it was really important, so it’s no big deal if one of it gets resolved as ‘won’t fix’.  It’s the critical compiler bugs, crashes, leaks, etc, you need to worry about.

Skip to main content