2012 Q3 link clearance: Microsoft research edition


My Q1 and Q3 link clearances are traditionally for links to other Microsoft bloggers, but this time I'm going to link to a few Microsoft research papers I found interesting.

Why do Nigerian scammers say they're from Nigeria?

Short answer: Because it ensures that the replies come only from the most gullible people on earth.

Bonus chatter: I received a scam email purportedly from Sir Humphrey Appleby, secretary to the Prime Minister. I could tell it was a fake because the message was comprehensible.

Sketch2Cartoon: Composing Cartoon Images by Sketching

Okay, I admit I haven't read the paper. But the video is fun to watch.

Debugging in the (Very) Large: Ten Years of Implementation and Experience

This is the paper on Windows Error Reporting that everybody cites. To me, it gets interesting starting in Section 6.

An Empirical Analysis of Hardware Failures on a Million Consumer PCs

I had the good fortune of seeing an early version of this paper. The thing that jumped out at me was the hard drive failure information:

  • The probability of a failure in the first 5 days of uptime is 1 in 470.
  • Once you've had one failure, the probability of a second failure is 1 in 3.4.
  • Once you've had two failures, the probability of a third failure is 1 in 1.9.

Translation: That hard drive failure you experienced? It was no fluke. Once you experience your first hard drive failure, the odds of a second one increase by a factor of over 100.

What's more, that second failure is highly likely (86%) to occur within the next ten days, and almost certainly (99%) within the next thirty.

Conclusion: When you get a hard drive failure, replace the drive immediately.

Comments (14)
  1. The link about the Nigerian scams is pure genius, thank you Raymond!

  2. Joshua says:

    So Raymond (indirectly) blogged on WER after all. After intercepting WER messages when able to controlled-reproduce a crash, I found the dumps I was able to extract completely useless, even when having exact versions with PDB files for loading into a debugger.

    However, the paper contained enough ideas that I should be able to make a local equivalent that harvests the crash data that I actually need (unfortunately I'll have to depend on crash-hardening a routine within the process to be able to submit data, now that's some trick).

  3. Zan Lynx says:

    Why would you use the process to report on itself? All the crashed program needs to do is launch a new program to collect the info then sit around and wait.

  4. underclocking=good says:

    Running hardware at their rated frequency is also bad. By underclocking the stability rises significantly.

  5. Antonio Rodríguez says:

    The hardware failure paper confirms many things I have learned over the years: that overclocking is *bad*; that when a machine fails, it will fall again; that "mature" machines (i.e., those that are neither too new or too old) are far more reliable; that cheap hardware, typically found in white-box computers, is far less reliable; and that typical lifespan for a 24/7 machine is about 6-7 years (in fact, I use that figure when planning the build and upgrade of my main work machine). But the result that points that laptops are more reliable has caught me off-guard: it doesn't relate to my experience.

  6. Danny says:

    My current PC was bought in idea to be used intensively ~15h/day, with very little days off during the course of a year. Initially I did RAID0 with it's 2 HDD in order to have a good performance. One year later started to experience a lot of bad sectors, every day at start the CHKDSK was running, finding more and more files with cross references in sectors, more and more files were recovered with bad data and so on. Then the noises started to be more and more loud and also on BIOS start the RAID self-check marked one HDD red (as opposite to green as it always been). Few days passed working with it red and in one morning disaster strikes. No booting of my "beloved" Micro$oft OS.

    So, got the HDD's out, put them into another PC, ran a low level factory format – you know, the one that HDD manufacturer provides via it's tools – and re-installed back to the PC. But this time I said "no more RAID, let them be separate C: and D: drives".

    And today is 4 years since that day, doing work/games on the same PC and on the same HDD's.

    Morale of the story – I maybe just broke the law of "Once you've had one failure, the probability of a second failure is 1 in 3.4."</quote>.

    Or maybe this happened before the article was queued by Ray and the law is in effect after so I should fast replace those HDD when next failure occur (Conclusion: When you get a hard drive failure, replace the drive immediately.)</quote>.

  7. steveg says:

    OMG. I wish I had those error databases to play with (and that number of customers :-).

    Stats are awesome.

  8. Matt says:

    @Danny "I maybe just broke the law of "Once you've had one failure, the probability of a second failure is 1 in 3.4"

    Winning the lottery doesn't break the law that you'll win 1 in 4-million plays. It just means you were lucky.

    Your single instance of your drive working is great for you, but Microsoft's 1:3.4 number comes from statistics on millions of machines, not single anecdotal examples.

  9. Danny says:

    @Matt

    My previous work PC was a P2 350 MHz from 1997. It has 512 MB SDRAM and runs XP SP3. Today. And has the same 3 HHD of 4GB, 8GB and 10 GB for like 10 years. It's still in use by my 6 yo son. And while it does not sit 15 hours /day like current work one it still does 4-5 hours /day. Every day.

    How's that for a luck?:P

  10. voo says:

    @Danny Yes and there are people who win the lottery after playing only once in their whole life. Statistics may seem unintuitive sometimes, but it works…

  11. Engywuck says:

    @Danny: well, there is this norwegian family that has just lately won the jackpot at the lottery for the third time in six years, winning over a million dollars each time… statistics seem to work :D

  12. JMThomas says:

    @Danny — One would think that consumer hard drives could easily be used with hardware RAID. They don't work because the microcode delays responding to some re-tryable errors for too long (defined by the RAID chip), and RAID starts recovery mode before the drive is ready for the next command.  There is a command RAID can send the drive to disable extended retry, but most don't and many drives ignore it anyway.

    The (much) higher priced RAID version of a drive ships with this timeout disabled, and longer burn-in/testing at the factory.  The margin on the RAID drive is excellent for manufactures, and impossible for most consumers to swallow.

    If you want raid, do it in software.

    PS: Your 'low level' format almost certainly wasn't.  I've yet to see a drive which doesn't keeps the bad track/relocation tables, nor test marked sectors for recovery, nor even reorganize the bad track accounting so every possible alternate track can be assigned.  But you did clear the file system's list of bad sectors!

    @voo — Danny wasn't playing the odds; he changed the rules of the game (by not using RAID hardware with consumer drives) to give him much better odds.  (Just like James T. Kirk at Starfleet Academy.)

  13. Chris Hutchcroft says:

    Great link for the hardware failure analysis. It has the right combination of 'who knew' and 'well duh'.

  14. David Walker says:

    I just got the weirdest Nigerian spam e-mail I have ever seen, in my inbox.  It was a long dissertation from someone saying they are sorry that I had sent money repeatedly to "my representative/intermediary" without receiving anything in return.  The sender is sorry about that, but he no longer trusts "my representative".  If I want to complete the transaction, I must work only with the sender of the e-mail and not with "my representative" any more, and to keep the details secret (of course).  If I don't follow those instructions, I will not see any of the money that I have already paid so many fees to receive.

    There was a long discussion of betrayals, and lots of talk about honesty.  It was quite surreal.

    I suppose the audience for this message is the VERY small, select group of people who have gone through this once and are truly wondering why they have not yet seen their money, and are stsill willing to hope.  If they follow the instructions in this message, they will be rewarded, of course!

Comments are closed.