Puzzle: Can you explain this program’s crash profile?


Some time ago, I was asked to help a customer study a set of crashes that had been collected by Windows Error Reporting. (You too can sign up to obtain access to crash data for your application.) The issue itself was the 325th most common crash in the ISV crash database, so fixing it would mean a lot toward improving the overall perceived stability of Windows. Fortunately, the issue was resolved relatively easily, but that's not what made the story interesting. What I found interesting was a little puzzle that faced me when I called up their crash profile.

One of the items in the crash profile report is a histogram plotting how many crashes per day were reported over the past three months. Most crash profiles take the form of an erratic graph with random day-to-day fluctuations. Sometimes you'll see a gradual trend (for example, as more and more people upgrade to a newer version). But this one had a strong pattern:

The number of crashes per day remains high for several days, and then plummet for two days, then return to their high values, repeating on a regular cycle.

It took me a bit of thought, but soon I understood why. Perhaps you can figure it out, too. Hints after the break.

If you actually stop and count (I didn't; I just eyeballed it), there are five days with high crash frequency, followed by two days with low crash frequency.

What happens on a seven-day cycle?

The five days with high crash frequency correspond to Monday through Friday; the two days with low crash frequency correspond to Saturday and Sunday.

The program in question targets a business audience. People use the program when they're at work (during the work week), but they don't use the program when they're at home (on the weekend). A program that isn't running can't crash.

Comments (31)
  1. Anonymous says:

    You have an extra opening tag for the table which makes it somewhat invisible…

    Btw, I see such graphs regularly – in my server logs :)

    [Busted table fixed, thanks. -Raymond]
  2. Dan Bugglin says:

    Not sure if my post went through, blog isn't saying anything, I'll try again…

    Haha the table is a flat blue line in my RSS Reader (Google Reader specifically).

    The usage of a table for that just horrifies me though. IE team needs to hurry up with 9 and canvas tag support! Or maybe you should consider using an image next time. I dunno, your usage of a table for this just baffles me, I can only assume the site you grabbed it off of used a table for it because:

    1) The site targeted IE and so couldn't use canvas (well it's an MS site I assume so that's pretty much a basic requirement).

    2) It dynamically generated the graph and the developer didn't realize you could generate images dynamically or didn't think they wanted to waste the bandwidth/disk space (hint: 1-bit two color PNGs compress really REALLY well).

    3) It was generated client-side using pure JavaScript. With no canvas and no data: uris like in Firefox/Chrome, no way to generate images, not even easy BMPs.

    [Interesting that you didn't complain when I used exactly the same technique a few days ago. I don't use images. -Raymond]
  3. Anonymous says:

    Wow. I am. In awe. Of this analysis.

  4. Anonymous says:

    I don't see 5 days with high values, and 2 days with low values in the image.

    I see 2 day with low values, then 2 days with high values, then 1 day again with low value, and then 7 days with high values.

  5. Anonymous says:

    Are the two lower weeks American Thanksgiving and the week between Christmas and New Year's Day?

  6. Anonymous says:

    @Martins: read each vertical bar as representing an entire day, not, say, about 3 hours.

  7. Anonymous says:

    From that graph, you can possibly also deduce something about the geographic distribution of users: they're mostly in a similar timezone to each other, and it's close to the timezone that the crash profile reporter uses – otherwise the late Friday or early Monday users would likely skew into the weekend bins and you'd get a less prominent pattern. Grouping per day seems a little dangerous, and counting events per 8 hours might let the patterns stand out better when your 'day' boundary is a bit offset from the users' days.

    Dan: canvas would be completely inappropriate even if universally supported – it requires scripts so it would fail in all feed readers. A bitmap image is the only thing that's likely to survive in feeds. But, really, it's not that hard to click to open the original post and see the illustration, and this isn't a blog about writing conceptually perfect markup for blog posts.

  8. Anonymous says:

    I recognized it right off – it corresponds to server load logs. :-)

  9. Anonymous says:

    I don't use images. -Raymond

    Which explains why I couldn't see it. (High contrast enabled for accessibility. Changing web browser colors has no effect but inserting images with <IMG> tags does).

  10. Anonymous says:

    Apparently that's a table? And so was the 058 graphic from a few days ago? Neat.

    058 rendered correctly in Google Reader in Firefox 3.6.3 on Ubuntu 10.04, but this graph didn't.

    As for what the graph signified, I immediately grasped the workweek correlation, but was thinking of a second-order correlation; a product that was used at the same rate throughout the week, but with some incredibly obscure bug that only showed its face Monday through Friday; (like the hilarious Android autofocus bug http://www.engadget.com/…/23182303/) rather than the far more obvious variable usage explanation.

    [The 058 graphic had problems too, but I was lucky enough to fix it before you visited. -Raymond]
  11. Anonymous says:

    Whenever I hear of this kind of pattern I try to match them to actual (real-life) events before matching them to server load or anything like that. When you said "The number of crashes per day remains high for several days, and then plummet for two days, then return to their high values, repeating on a regular cycle" that seems enough to figure out that it's weekly even without the graph (using the graph to verify that it looks right).

    [I don't use images. -Raymond]

    This peaked my interest – why is that?

    [The text is for people who can't see diagrams. The original server didn't support images, and I just got into the habit. And it saves me from having to add image upload support to my autoposter. -Raymond]
  12. Anonymous says:

    As Raymond has explained before the blog software supplied has poor-to-zero support for images.  (I forget exactly whether it's absolutely none or just that the interface is unusable.)

    [Actually, image support was added a few years ago, but I don't want to change my workflow. Images make asset management more complicated. Right now, each post fits in a single file in my queue. -Raymond]
  13. Anonymous says:

    >Images make asset management more complicated. Right now, each post fits in a single file in my queue. -Raymond

    data: URLs in CSS can still make that true (at the cost of some inefficiency, but that's not a terribly big deal for very simple vector drawings)

    [Hm, but first I'll have to wait for IE7 to die. -Raymond]
  14. Anonymous says:

    From that graph, you can possibly also deduce something about the geographic distribution of users

    Also, you can see that there is not a significant amount of users in Israel or muslim countries, otherwise you would see impact of the different weekday-weekend pattern in those countries (Fri/Sat weekend in Israel, Thu/Fri weekend in many muslim countries).

  15. Anonymous says:

    "wait for IE7 to die."

    ROFL, once again Microsoft's broken legacy gets in the way.

    [Not sure what you're getting at here. How do you force people to upgrade? -Raymond]
  16. Anonymous says:

    wait for IE7 to die.

    I'm still waiting for IE6 to die on my employer's network. It's really funny since the webapp we use for most of our work is made by a third party, who designed it for Firefox. But someone somewhere in our organization mandated IE6 for all computers, and now the webapp works better in IE 6 than Firefox.

    It's fools that make those policies that make "Microsoft's broken legacy" get in the way, not Microsoft.

  17. Anonymous says:

    Interesting pattern, I've seen it a few times in our CMS statistics aswell. Always makes me wonder who's working the weekends :)

  18. Anonymous says:

    Amusing.  (And an off-topic aside: you can only sign up to access crash data if you want to pay lots of money to the certificate-provider protection racket.  Which makes me sad.)

  19. Anonymous says:

    Isn't a pattern like this extremely common?

  20. Anonymous says:

    [Hm, but first I'll have to wait for IE7 to die. -Raymond]

    Is there some sort of rule that MSDN blogs must support legacy browsers? Also, being a technically focused blog, I'd assume your browser stats are skewed toward more modern browsers?

  21. Anonymous says:

    It's fools that make those policies

    There's nothing foolish about mandating a specific browser version for all staff. You can remove an entire class of support queries from your helpdesk, make life easier for your own developers as well as vendors, and improve employee productivity when they're mobile.

    And when you're a CTO, having stability is more important than having the latest version. Whether it be web browsers or anything else.

  22. Anonymous says:

    "How do you force people to upgrade?"

    Critical Windows Updates?

    [Been there, done that, been slashdotted, IE6 still not dead. -Raymond]
  23. Anonymous says:

    The company I work for mandates a specific version of IE. We discovered this when we were developing a webapp for internal use. We had to make some changes to support the 1% of developers who actually complied with the mandate.

  24. Anonymous says:

    [Not sure what you're getting at here. How do you force people to upgrade? -Raymond]

    Surely Microsoft can afford to hire some goons?

  25. Anonymous says:

    Does the fact that this pattern was surprising to Raymond mean that most apps in the db are not business apps? This indeed tracks a very regular usage pattern in many contexts. And apparently the pattern itself is the red herring in this post — it does not explain crashes! In fact it shows that crashes are happening uniformly and randomly when the app is in use, so the pattern has no information that can help understand the cause of the crash.

  26. > first I'll have to wait for IE7 to die

    Wait… are you saying that IE8 supports data: URLs?

    (Tries it)

    Wouldyalookatthat!  Props to IE8.

    As to your queue-of-single-files management, surely there are options… MHTML, or even just a .zip file…

    Inflicting images-tables-with-single-pixel-cells on your audience is a little ugly.

    [Right now, the autoposter is just "Here's some HTML, just post it verbatim." I'd have to teach it how to take MHTML/zip files apart, upload the image, then go in and edit the HTML to refer to the image, then post the HTML, then re-save the HTML into the MHTML/zip file with the edited URL. All to support a feature I don't even like! -Raymond]
  27. All to support a feature I don't even like!

    And you like tables-***-images?

  28. Fine, fine… rewording… tables-qua-images

    [I don't like them either, but I like them more than images since they don't disrupt my workflow. -Raymond]
  29. Anonymous says:

    Well, I have a system that windows update will never put IE7 nor IE8 on the update list. I have not disabled those updates, and the computer gets its updates direct from Microsoft.

    I'm stuck on IE6 for it.

  30. Anonymous says:

    There's nothing foolish about mandating a specific browser version for all staff.

    I can understand that, really. What I can't understand is how we can be 2 versions beyond and still be stuck with a browser that isn't well known for being security conscious. Especially when the mission critical webapp was originally developed for a different browser entirely.

Comments are closed.