The day shell.windows.com went down


When the file association Web service was first being developed, the programmer responsible for implementing the feature just scrounged around and found an old unused computer and set it up as a simple Web server under his desk, so there would be something to test the code against. That server happily churned away serving out file extension information, and when people asked for their program to be added, he would manually add it to the server. The server worked just fine, and like most things which work just fine, it was forgotten.

And then remembered once things no longer worked just fine.

I think it was during one of the beta cycles, or maybe it was RC1, when the quiet neglected computer went offline. I forget why, so let's pretend that the programmer unplugged it as part of an office redecoration project. Suddenly, the shell.windows.com service went down.

What? The file association Web service went down?

Everybody had forgotten that shell.windows.com was still running on a computer under that programmer's desk. It had done such a good job up until now that nobody gave it a second though.

He plugged the computer back in and watched the server bang out requests like nobody's business.

Wheels were quickly set into motion to transfer the file association Web service to a machine with a little bit more professional attention and maintenance.

Comments (41)
  1. Tom says:

    Somehow, it’s reassuring to know that even the Big Boys still make rookie mistakes when it comes to can’t-fail production services.  Now I don’t feel so bad when I do it at my little company.

  2. xyz says:

    Fascinating article. I always wondered if anyone ever unplugged their computers at Microsoft. They must be running so many critical services, how do you guys manage when they are down? I now have a fuller understanding of what it means to be in such a powerful position of having important infrastructure underneath my desk. So when I next kick the computer b0xen underneath my desk, I will be reminded of this article.

    Also, the way you approached the article is enviable. It was almost.. poetic.

  3. phi says:

    Yeah. My boss used to say that stuff that works is usually easily forgotten. We usually remember what doesn’t work.

  4. John says:

    Do you mean that developer offices have inbound access from the public Internet? I’m shocked that you’re not behind a firewall.

    [“set it up as a simple Web server under his desk” includes “filled out the necessary paperwork to obtain inbound access from the public Internet.” I’m disappointed that I had to write that. -Raymond]
  5. rbirkby says:

    You would think that if the service recognised .txt and .html, it would recognise other OOB extensions such as .reg

    "Windows does not recognize this file type. "

    erm, I think it does.

  6. BOFH says:

    Sometimes I wonder if the machine running blogs.msdn.com isn’t a 486 stuffed under some code monkey’s desk somewhere.

    Some days it’s horribly slow.

    (And I know it’s in no way Raymonds fault nor responsibility, no need for a snarky nitpicker’s corner.)

  7. Neil says:

    [“set it up as a simple Web server under his desk” includes “filled out the necessary paperwork to obtain inbound access from the public Internet.”]

    Well that was his mistake then; if he hadn’t done that then the web service would have been put somewhere useful as soon as someone pointed out that the code only worked internally.

    [And it would have slowed down the development cycle tremendously. Instead of “fix a bug and copy the fix to the server under your desk, oh there’s a bug in my fix, let me try again (total time: 2 hours)” it’s “fill out a server content update request and wait N days, and then oh there’s a bug in my fix, let me try again (total time: 2N days).” It was originally just a prototype that only gradually morphed into its final form over a long period. -Raymond]
  8. Adriano says:

    "It had done such a good job up until now"

    Should be "then".

  9. Aardvark says:

    I find this odd that this could actually go unnoticed. I’d like to hope that this just slipped through an an issue since it was during a beta/rc release. Hopefully, some project plan somewhere had a line item for setting up the "real" public web service before RTM.

  10. Alexandre Grigoriev says:

    BOFH,

    And it runs unpatched Windows NT 3.5. I would not be surprised if it is actually.

    "Some days" means 29 days in a month; February too.

  11. Mike Dimmick says:

    @Alexandre Grigoriev, BOFH: hosted by ORCS Web, runs Windows Server 2003.

    http://toolbar.netcraft.com/site_report?url=blogs.msdn.com

  12. @Adriano – Thanks for that, I was totally confused until you pointed that out…

  13. John says:

    When will shell.windows.com go down for real?  I’ve read all the help and support life-cycle policy documents, but they only talk about support issues (hot-fixes, security updates, incident support, etc); it doesn’t really mention any kind of tangential services such as shell.windows.com or Windows Update for that matter.  On a side note, Windows Update is still functional for Windows 98 (at least the last time I tried a couple months ago).  And of course the real concern, product activation servers.  Not that WPA actually stops piracy, but that’s beside the point.  Perhaps all of this will be moot when Operating Systems are sold as services instead of a products.  Sadly, I think that’s the way this whole thing is headed…

  14. Alexandre Grigoriev says:

    @Mike Dimmick,

    Why then it’s performing so poorly? Nobody is watching it, and it leaks memory badly? What would it take to fix it? A certified petition signed by 500000 readers (not that there’s that many)?

  15. Yuhong Bao says:

    [And it would have slowed down the development cycle tremendously. Instead of "fix a bug and copy the fix to the server under your desk, oh there’s a bug in my fix, let me try again (total time: 2 hours)" it’s "fill out a server content update request and wait N days, and then oh there’s a bug in my fix, let me try again (total time: 2N days)." It was originally just a prototype that only gradually morphed into its final form over a long period. -Raymond]

    "Wheels were quickly set into motion to transfer the file association Web service to a machine with a little bit more professional attention and maintenance."

    After that happened, how was change control done? Did it slow the dev cycle or was that machine that originally hosted shell.windows.com used to test changes before production?

  16. Matt Ginzton says:

    OK, so this was a developer’s machine under a desk, but the developer filled out paperwork to allow access from the public Internet to this under-desk machine.

    Then everyone forgot the machine existed, because it’s working so well.

    Meanwhile, who’s installing security updates etc. on this machine?  Nice juicy target.  You’re lucky that it got unplugged (and you noticed) before it got hacked (and you didn’t notice).

    This is all refreshingly web 2.0 in its fly-by-the-seat-of-your-pants-ness, but it does seem rather haphazard for Microsoft.

  17. microbe says:

    [And it would have slowed down the development cycle tremendously. Instead of “fix a bug and copy the fix to the server under your desk, oh there’s a bug in my fix, let me try again (total time: 2 hours)” it’s “fill out a server content update request and wait N days, and then oh there’s a bug in my fix, let me try again (total time: 2N days).” It was originally just a prototype that only gradually morphed into its final form over a long period. -Raymond]

    This makes no sense. If you still need this kind of “development cycle”, you shouldn’t release it as beta. Internal access would have been enough.

    [I’m just guessing. And you need public access so alpha testers can use it. -Raymond]
  18. peterchen says:

    @rbirkby: It doesn’t even recognize .txt and .html for me – because I get region 0407, which is probably near a dragons lair.

    Changing the magic number to 0409 does the trick, though.

  19. Yuhong Bao says:

    “Meanwhile, who’s installing security updates etc. on this machine?”

    Can Raymond answer this? Because I was wondering this as well.

    And note that the paperwork was probably filled in around 2000-2001. Do MS pay more attention to security of this kind of thing now?

    BTW, personally I’d prefer to KISS and batch the changes in a single change request once it is all tested on a test machine.

    [Um, this was the test machine. -Raymond]
  20. Yuhong Bao says:

    [Um, this was the test machine. -Raymond]

    I did suspect that this would be a good test machine, but I did not mention it because that is beside my point. It was, BTW, a response to this:

    "This is all refreshingly web 2.0 in its fly-by-the-seat-of-your-pants-ness, but it does seem rather haphazard for Microsoft."

    And BTW, this web site was created long before Web 2.0 even existed.

    Also, you didn’t answer Matt Ginzton’s question about how the machine was patched, can you answer it or it is that you don’t know?

  21. Aaron says:

    Wow, a lot of snarky comments over a whimsical story.

  22. Jim says:

    Like a good old story, people soon forgot the moral of the story and dive into the useless details!!!

  23. bdodson says:

    Perhaps someday the developer could have hosted it on Windows Azure, forgotten about it and never worried about it again :).

  24. Mark says:

    Adriano: read http://www.reshmaanand.com/2006/09/right-here-right-now-use-of-present.html

    Yuhong Bao: I’m not sure you realise how demanding you are.  Raymond has said he doesn’t get paid to answer questions on his blog.  Your enthusiasm is great, but it would fill a full-time job, and often requires a careful response to avoid controversy.  You’d save everyone’s time (not to mention goodwill) by being a bit more selective in your questions.  Can you try that?

    P.S. It’s unlikely Raymond will know how the computer was updated if he can’t remember why it was unplugged.  Most likely, it was set to perform automatic updates at 3am, and therefore rebooted every few weeks (well, that ruins the story).  Or perhaps he deliberately turned updates off, but is this blog the place to discuss that?

  25. Yuhong Bao says:

    Mark: Well, to be honest, I did not originally ask the question, Matt Ginzton did, Raymond did not respond and I was wondering about the answer to it.

    [I am under no obligation to answer any question, especially if the answer does nothing to further the story. Suppose I told you, “Oh, it was updated by space centaurs from the planet Wombat.” The story remains unchanged. I wonder if you go to the movies and ask things like “Hey, who imported those olives that James Bond used to make that martini, and how did they get through customs?” -Raymond]
  26. frymaster says:

    "Turns out it was spyware that interrupted any requests to windowsupdate and redirected them to msn.com. My bad, guys. Sorry"

    I once worked out exactly what virus was on a user’s machine by noticing exactly which web searches it blocked :D (it only blocked ones that had detection/removal instructions for it)

    commenter-pre-emptive-anti-snark: this was just so I knew what it was.. the computer got a complete format + reinstall before going back to the user

  27. Dean Harding says:

    Mark: Wow, am I the only one who had trouble reading that reshmaanand.com link? It seems like somebody replaced this guy’s "," key with the "-" key…

    Sorry that’s rather off-topic, but I really struggled to parse his sentences until I figured out what the "-"s all meant!

  28. Yuhong Bao says:

    "Perhaps someday the developer could have hosted it on Windows Azure, forgotten about it and never worried about it again :)."

    shell.windows.com was created long before Windows Azure existed. Plus I am not sure if this would be allowed.

  29. Alexander Grigoriev says:

    Merus,

    Change your parents accounts type to "Limited User". Put a password to the Administrator account. Then you won’t have to fix your parents’ computer that often.

  30. Yuhong Bao says:

    Merus: Another example is that users often blame the OS for blue screen crashes that even trivial crash analysis would have pointed out that a driver is at fault, not the OS.

  31. Matt Ginzton says:

    I wasn’t trying to be snarky; it’s a fun story and I’m glad it got told. I’m just honestly surprised that modern Microsoft is that process-light.

    This kind of thing probably happens all the time on a much smaller scale, with internal services or services nobody cares about, but I’d think that pretty much anything MS builds would have the expectation that lots of people are going to be relying on it.

  32. Merus says:

    Speaking of unreasonable demands:

    I was convinced that windowsupdate.microsoft.com had gone down over Christmas, and ranted and raved to my parents (of course I was fixing my parents’ computer, it *was* Christmas) about how Microsoft can’t keep a web server up.

    Turns out it was spyware that interrupted any requests to windowsupdate and redirected them to msn.com. My bad, guys. Sorry.

  33. frymaster says:

    @matt:

    depends on your definition of modern, I suppose… this was developed for XP, which makes this story around 8 years old.  I don’t know to what extend MS changed their processes after XP, but I know they completely overhauled them after Vista, so it doesn’t offer that much insight into what they’re doing these days

    I know in general MS has a pretty low bar to jump over in order to get hold of some hardware for testing/prototyping… it’s likely someone did this for developing the web service and simply forgot until forcibly reminded that it was no longer a test app

  34. Tony Nitpicker says:

    This webserver was probably well maintained with regards to patches and such (even if you don’t mention it), but for me it is still a frightening prospect to have a server "that fell into oblivion" servicing inbound web-traffic, sitting under the desk of an developer – to paraphrase you.

    Was there code running on that server written for this project? Was it audited? Was the machine sitting in a DMZ or such? And so on… Even if you don’t mention it (and I presume that people at MS took care of these questions), I would feel more than uneasy about this whole situation, given the lax attitude towards this server you portray here.

    To nitpick my nitpicking: I assume it was probably a journalistic freedom you took here. Still, I would feel very uneasy about this and I would take it much more serious.

  35. FavoringCurry says:

    I’m an enthusiastic quiet reader of your blog. Don’t quit writing interesting posts like this, just because lots of morons like to pick them apart.

  36. Carl says:

    ‘Change your parents accounts type to "Limited User". Put a password to the Administrator account. Then you won’t have to fix your parents’ computer that often.’

    Unfortunately, the first time your parents want to install something genuine and you’re not immediately contactable by telephone (some of us do have a life) this will all go down the pan and they’ll want this ‘restriction’ removing.

    Even if you are on the end of the phone, you only really have two options:

    1) Drive over to your parents to admin them up (or remote desktop if you are in a position to do so)…

    2) Give them the password so they can do it themselves…

    1) will very rapidly become tiresome (see note about having a life, above)

    2) This almost (almost) defeats the whole purpose

    From past experience in the two or three times I’ve tried to set up my dad’s PC with limited access, I’ve ended up having to dole out the password.

    Obviously that had nothing to do with the original post, but hey-ho

  37. Duke of New York says:

    … and this is why, when your work creates a dependency, you make d*** sure to write that dependency down somewhere that other people can see (such as a bug DB).

  38. LongTimeListener says:

    Raymond,

    I – and others, evidently – enjoy the occasional chucklesome and interesting tale.  It’s a shame that you seem to attract people that can’t enjoy the story in the spirit it was told, but I hope that doesn’t stop you doing so in the future.

  39. J says:

    "Still, I would feel very uneasy about this and I would take it much more serious."

    You’d take it much more serious?  Really?  So several years later you still wouldn’t be able to tell a light-hearted story about it?  When you told the story of the "almost-forgotten" server, would you need a moment of silence afterwards for everyone to reflect on the seriousness?  If someone smiled during the story, would you yell at them that it’s not something to be laughed about?  Would you never tell the story because you sit in dread every night about the disaster that could have happened?

    Lighten up already.

  40. Alexander Grigoriev says:

    Carl,

    Choose one:

    1. Have a computer without malware and endure occasional hassle of remoting to it to install something, or:

    2. Have to periodically clean it of crap, spyware, etc..

Comments are closed.

Skip to main content