There’s a reason why envelopes have backs


For some reason, people are upset that I don't have hard data for the cost difference between "slow" and "fast" mode enumeration. I already did a back-of-the-envelope calculation that showed that fast mode reduces the total time to enumerate the items in a folder from five minutes to two seconds. That's what's so great about back-of-the-envelope calculations: They let you make decisions without actually having to implement every possible solution.

Some quick estimation shows that using slow mode enumeration would be 200 times slower than fast mode. Does it really matter whether the speed-up is 195.1231 times or even 103.4761 times? Even if the estimate were off by an order of magnitude, a 20-fold speed-up is still worth it.

Imagine if people had to carry out experiments for every possible optimization (both before and after) to prove that it was worthwhile.

"Could you please deliver these letters?"

"Sure, here let me grab them."

"Why are you doing that? Why don't you take one letter at a time?"

"Um, because it'll be faster to take all of them at once so I don't have to keep coming back here."

"Do you have any hard data to support that? I'm not going to believe you until you show me some hard data."

Comments (34)
  1. richard says:

    Sounds like you wouldn’t make it as a scientist. I find scientists are unwilling to believe anything unless it is backed up with hard data that has been published in a respected peer reviewed journal and subsequently reproduced.

  2. andy says:

    In science it is most important that a claim is falsifiable (http://en.wikipedia.org/wiki/Falsifiability).

    And the nice thing with the falsificability is that it leaves the job to prove you wrong to the person who don’t believe your result.

    So, instead of asking Raymond for hard data you should get them yourself to try to falsify his claim that fast-mode is quicker then slow-mode.

  3. Erik says:

    Good analogy.  Reminds me of something Richard Feynman said about the value of science:

    "But we always must make statements about the regions that we have not seen, or the whole business is no use."

    Full quote here:

    http://www.collectedthoughts.com/quote.aspx?id=11306

  4. vince says:

    andy:

    That’s rediculous, and lazy.

    Raymond hasn’t done a peer-reviewed set of experiments providing hard numbers.

    He has done a back-of-the-envelope calculation.

    If he wants people to belive his numbers, he should back them up somehow.

    If you want engineers to believe you, provide numbers.  Real numbers.

    Posting a patch to the linux-kernel list and saying "I predict this increases performance 200 times" without providing benchmark results will get you laughed out of there.  For good reason.

    The important thing in science is being able to reproduce someone else’s results.  You can’t reproduce (or prove wrong) someone’s "results" if they haven’t produced any in the first place.

  5. James says:

    "In theory grabbing all the letters at once is faster, but without actual measured results how do you know there’s any real-world benefit?"–

    –a variation on the joke where a mathematician starts to present a proof of a new theorem, when a mathematician in the audience calls out, "I have a counterexample!" "That’s OK," says the first mathematician, "I have another proof."

    Or, you say to a scientist (or engineer, apparently), "Your argument contains a fallacy; your conclusion doesn’t follow from your assumptions." "Well," he replies, "I’ll have to see some hard data to support your claim." (Or, alternatively, "I’ll have to see some real-world, actual measured results to support your claim.")

    To the man with a hammer, everything looks like a nail, and similarly for the man with "hard data that has been published in a respected peer reviewed journal and subsequently reproduced" or "actual measured results."

  6. Euro says:

    Actually, he would make a great engineer. Engineering requires great skills at knowing what’s worth measuring and what’s not. That’s why an engineer doesn’t usually commission a finite-element analysis to verify that a two story house will hold up. That’s why an engineer will not usually calculate torsion stresses on an ordinary beam, unless s/he has a reason to believe it might an issue.

    Remember the adage about an engineer being the guy who can do for $50 what any fool can do for $1000? Well, measuring costs money.

    (Graduated as a Civil Engineer abroad, not a licensed Civil Engineer anywhere in the US, so this is not advice: always consult your personal Civil Engineer)

  7. The profound thoughts above notwithstanding, some optimizations simply dominate previous behavior.

    One cannot prove this, but it /is/… in the same sense that Mt. Everest /is/, or that Alma Cogan /isn’t/.

  8. ajb says:

    Vince-

    There is a difference between saying ‘I predict this will increase performance 200 times’ and saying:

    This will reduces the number of client-server roundtrips by 1 per file in the folder.  If a roundtrip costs, for example, 1.5 seconds, a 200 file lookup will be only a few seconds instead of 5 minutes.

  9. Mihai says:

    <<try to falsify his claim that fast-mode is quicker then slow-mode>>

    No need to.

    The fast-X is quicker than slow-X, by definition :-)

  10. Ulric says:

    You guys are tense.. this is a blog, someone talking about stuff he loves to interact with other people.  Blogs practically at the level of a water fountain conversations, if water fountains came with a blackboard to scribble some stuff on.

    If this were a published book you paid for, no doubt there would be some extra time and verifications, on top of the already existing one.  From my personnal experience, I figure Raymond can probably spend more than an hour verifing a post  (and just imagine how many people will put him back in his place if he gets details wrong!)  Deeper testing would probably test several days.  

    I’ve been maintaining a file browser that’s using IShellFolder, which had several iteration based on clients complain about its network speeds, and I’m interested in the topic.  We have had these problems.

    imho the topic is about the general idea behind it, and not about a specific quoted number of speedup time.

  11. vince says:

    Actually, you wouldn’t make a good engineer.

    Scientists (including Computer Scientists) are very good at “proving” optimal solutions that turn out to be unworkable in practice.

    As engineers the reason we want to see  *actual measured results* is because then you know for sure.

    See:

    http://www.gettysfamily.org/wordpress/?p=26

    Also, there are a lot of times that making something 200 times faster in theory has absolutely no real-world benefit.  Using a quick-sort instead of a bubble sort on a 10 item list.  Giving your CPU a 200 item issue queue.  Things like that.

    [Naturally, the results were verified afterwards to confirm that the predicted optimization actually helped, but I don’t have that data. (Because, surprise, people don’t send me copies of every performance test run anywhere at Microsoft.) But you shouldn’t need that hard data to realize that the change was worth trying. -Raymond]
  12. David Walker says:

    Vince:  It’s "ridiculous", not "rediculous".

  13. Not Raymond says:

    If I were Raymond you guys would make me quit posting.

  14. Norman Diamond says:

    Tuesday, October 03, 2006 3:11 PM by Ulric

    Blogs practically at the level of a water

    fountain conversations

    I agree.  That’s why agreements and disagreements get into the comments, and that’s why conversations fragment along several tangents.

    Also consider if a water fountain conversation had one speaker and multiple listeners.  Of course some kinds of announcements are designed to operate that way, but they’re still a lot less popular than typical water fountain conversations.

    Oops sorry, this tangent went on longer than expected.  I could prove this assertion but don’t have an envelope at the moment.

  15. Andres says:

    And you’re a civil engineer?  remind me not to drive across any bridge you design.

    Then I suggest you don’t drive across ANY brigde, because that’s how they are designed. A bunch of "people who know about bridges" sit for a while doing math, until they are convinced that the bridge will hold.

    Of course, the "hard-data approach" may seem better, but it’s just too expensive… before I can say how much weight my briedge is goint to hold, I have to spend millions of dollars to build one and start putting weight on it until it breaks. Then I’ll have "hard data" of what the briedge can hold. Except that maybe I was lucky, so I really need to build 100 briedges, and then see if they all hold the same weight. Only then I can build the briedge I actually wanted to build.

  16. N. Velope says:

      This would be a good time to remind everyone to read the chapter "The Back of the Envelope" in "Programming Pearls" by Jon Bentley.  It actually talks about building bridges and programming.

    http://www.cs.bell-labs.com/cm/cs/pearls/bote.html

    and

    http://www.cs.bell-labs.com/cm/cs/pearls/quiz.html

  17. Matt Ryall says:

    The problem is that in general by enhancing performance we degrade some other aspect of the software, typically maintainability.

    Degrading maintainability will have a longer-term impact on the speed and performance of the application, so it’s important to analyse the improvement in short-term performance before making this sacrifice.

    That’s the reason behind the oft-heard, "premature optimisation is the root of all evil".

  18. Slombo says:

    "The problem is that in general by enhancing performance we degrade some other aspect of the software, typically maintainability."

    While I’ve not seen the code here, it would appear to me that this optimisation is both obviously faster and (presumably) more readable/maintainable.

    How is GetAllFiles() harder than While(FilesExist);GetNextFile();loop

  19. vince says:

    > Actually, he would make a great engineer.

    > Engineering requires great skills at knowing what’s

    > worth measuring and what’s not. That’s why an

    > engineer doesn’t usually commission a finite-element

    > analysis to verify that a two story house will hold > up. That’s why an engineer will not usually

    > calculate torsion stresses on an ordinary beam,

    > unless s/he has a reason to believe it might an

    > issue.

    And you’re a civil engineer?  remind me not to drive across any bridge you design.

    People making decisions because of back-of-the-envelope calculations and not actually testing things give us things like the ceiling of the Big Dig collapsing.

    The whole issue, if you go back to the original blog entry, was the the "200 times speedup" given was calculated from a really contrived scenario.  Some of us were like "nice theory, but we’d like to see some actual numbers before believing that’s the real world results".

    From experience I’ve seen many times people claim that you’ll get big improvements doing something a different way, only for it to turn out to be the same, or even *worse*.  

    Estimating, back-of-the-enveloping, and brainstorming are great, and good for a discussiom.  They are never a substitute for actual implementation and testing.  And some of us prefer our software to actually run faster in real life, not just on the back of some envelope.

  20. Dean Harding says:

    > The whole issue, if you go back to the original blog entry, was the the "200 times speedup" given was

    > calculated from a really contrived scenario.  Some of us were like "nice theory, but we’d like to see

    > some actual numbers before believing that’s the real world results".

    Why would you need that? It’s trivial to see. If you have to do 201 requests in the "slow" way and only 1 in the "fast" way, then it’s *by definition* 200x faster. Each individual request takes approximately the same amount of time (that’s what great about networks – a 1KB request takes about the same time as a 100KB request).

    I can’t believe anyone would need "hard data" to be able to see that.

  21. Kin says:

    In the examples, replace letters with 15kg packages which slow down your movement quite a bit.

    Soon fetching all of them at once is not the faster way.

    This is probably a very trivial optimization, but it’s easy to think something will bring benefits when it actually does not or (worse) it does effectively slow down for some other interaction.

    That said, a (absurd) network connection with 1ms latency, 1kbit/sec bandwidth and 50% error rate (and a 1MB MTU and no hardware error check) will probably kill you in fast mode.

  22. Kuwanger says:

    "While I’ve not seen the code here, it would appear to me that this optimisation is both obviously faster and (presumably) more readable/maintainable.

    How is GetAllFiles() harder than While(FilesExist);GetNextFile();loop"

    While it might not be harder directly, things like caching, swapping, and a slow read medium might make getting a listing of more files than actually necessary can be detrimental, both in performance and in management.  So, when one is searching for a specific file in a directory on a remote server, it’s not necessarily clear that the slow method is actually slower.  And it can actually be harder, if one allocates and deallocates a large memory chunk repeatedly (which might actually be proper design) for a whole file listing when a few reads and a simple loop would have done.

  23. Jade Philosopher says:

    The general thrust of the anti-envelope comments remind me of a Feynman observation about results presented at a gravity conference – they were either obviously wrong or convoluted demonstrations of obvious truths.

  24. Dean Harding says:

    While it might not be harder directly, things like caching, swapping, and a slow

    read medium might make getting a listing of more files than actually necessary can

    be detrimental, both in performance and in management.

    So what? We’re not trying to speed up getting only partial directory listings here. We’re speeding up the SPECIFIC case of Explorer listing a folder’s contents. The ENTIRE folder’s contents. If you only want one file, the "slow" method is still there.

    I think Jade Philosopher is right. Either that, or people only read half-way before commenting.

  25. James says:

    Dean: alternatively, Explorer could optimise by working out that it will be displaying, say, nine items, and fetching only those nine. (Depending on sort order and what information is needed for the status bar, obviously.)

    Just opening my i386 directory on a network server takes around 12 seconds to enumerate the 5,941 objects within – and it only has to display the first 12, the rest are out of sight! (Come to think of it, displaying the first 12 straight away then fetching the rest asynchronously would be nice too…)

    [So you’re saying that Explorer should not sort the results? How else can it tell which are the “first 12”? -Raymond]
  26. swe dude says:

    > [So you’re saying that Explorer should not sort the results? How else can it tell which are the “first 12”? -Raymond]

    Why cannot the server return the first 12 first?

    If that’s not possible, it would at least be possible for explorer to show the first 12 returned files and later insert other files into the shown result if required.

    [What is the definition of “first”? Alphabetical order? (Sorted according to which language?) Order of creation? By song length? Alphabetical by author? FindFirstFile is not a SQL query. -Raymond]
  27. Aaron says:

    Seems to me that some people don’t understand the difference between a reasonable estimate and a wild guess.  And those people feel that they know what makes a good engineer.  That’s rather ironic.

    On the other hand, the concept of "falsifiability" in science has been grossly misinterpreted here as removing the burden of proof from the one making a claim.  Falsifiability means that a hypothesis *can* be disproven, and a claim is unscientific if it is not falsifiable.  It certainly does not mean that a claim is scientifically *true* unless *disproven*!

    Software has strong roots in both the science and engineering fields.  I think what we’re seeing in here is a bizarre argument between two groups who are firmly on their own side of the fence and perhaps don’t realize that the other side even exists.

    I’m pretty sure that Raymond has a good grasp of both sides.  The fact is, he’s neither proposing a scientific theory nor building a bridge.  It’s not necessary to have reams of hard data as long as the logic is sound (which it is) and the estimate is reasonable (which it is).

  28. James says:

    “So you’re saying that Explorer should not sort the results? How else can it tell which are the “first 12″? -Raymond”

    In the Explorer window I just opened, there are four sort order options: Name, Size, Type and Modified. Since none is ticked, no order has been specified; the first 12 returned by the server would fulfil this requirement as well as anything else. Getting any 12 straight away while the other few thousand get retrieved beats an animated flashlight and nothing else until it’s possible to apply an unspecified and unsolicited order, IMO! (If/when I actually ask for the results to be sorted in some order, that’s different – but I haven’t.)

    In this case, the meaning of ‘first’ is as obvious as in ‘findfirstfile’: the first one returned, whatever the source order might happen to be. Apart from anything else, it’s the only meaning which works; to query it is like asking which timezone someone is asking about when they ask what time it is…

    [I think if you look more closely, your “unsorted” view is still sorted. -Raymond]
  29. BryanK says:

    Couldn’t you ask the server for all the names (but not the handles) in large chunks, sort the names, then go back and get the handles for the first 15 files synchronously, then display the first 15 files, and only *then* go back and get the rest of the handles asynchronously?

    (Now this won’t work if the redirector’s implementation of NtQueryDirectoryFile doesn’t return all the possible attributes that the folder may be sorted by.  However, FindFirstFile’s return structure has members for both size and the various dates, so it should be possible to get that data the first time through.)

    And certainly the "fast mode" would still be faster than this.  But at least this would alleviate some of the problems with slow mode, because the slowness would be happening in the background.

  30. Norman Diamond says:

    > I think if you look more closely,

    > your “unsorted” view is still sorted.

    Um…

    > Alphabetical order? (Sorted according to

    > which language?)

    Windows Explorer is showing 4 filenames in a folder:

    downloading_vista_0.png

    downloading_vista_3.png

    downloading_vista_221.png

    downloading_vista_-3.png

    Windows Explorer is showing a small ascending triangle next to the word for filename.  Most likely if I want to then I can find the statement of how this is sorted, but this alphabetical order?  Sorted according to which language?

    Sure sorting isn’t a bad thing and I wouldn’t mind having more options.  I’d like to set things so that Windows Explorer will sort lists in the same order on local drives as it does in ftp views.  Lower priority, I might want to set Outlook Express 6 to sort things in the same order as Outlook Express 6.  But sometimes maybe it would be OK to say it’s listing things as it finds them while showing that the search is still under way, like a dog does.

    (By the way I didn’t really capture a screenshot of Internet Explorer downloading Vista RC2 at the 3% point, I only created a file with that name to check how Windows Explorer would sort it.  At the 0%, -3%, and 221% points, I captured screenshots.)

    [Michael Kaplan covered this topic just a few days ago. As for the hyphen, I covered this myself. -Raymond]
  31. TCLIU says:

    Guys, if you don’t think fast mode is faster in enough cases to make it useful then you don’t have to use it.

    Personally I think Raymond is spot-on here with his estimate, but that’ll only affect code that *I* write.

    Now I need to get back and benchmark if cout or printf is the fastest way to display the help text for my app. Does anyone here have data on whether processor architecture / SMP affects the performance of stdout? :P

  32. James says:

    “[I think if you look more closely, your “unsorted” view is still sorted. -Raymond]”

    I know it *is* sorted – my point is that I didn’t *ask* for it to be sorted, the relevant menu indicates no sorting order, and that sorting it is causing problems (a non-trivial delay opening the folder, to achieve a result I didn’t request in the first place). Explorer not indicating that it is sorted is, arguably, a bug, the fact the implicit ‘no sorting’ option doesn’t actually deliver is just an irritating drawback.

    FWIW Unix’s (GNU’s?) ‘ls’ behaves as I’d want: for ‘small’ numbers of files they’re sorted by name, anything higher it would take too long, so they’re unsorted. Explorer could go one better and display results as they come in, as it does already for search operations…

    TCLIU: For real performance gains, just #define them both as noops :-)

    [I can see the bug reports already. “Explorer shows files in scrambled order… sometimes.” If you don’t pick an explicit sort order, then the folder’s “natural” sort order is used. -Raymond]
  33. James says:

    [I can see the bug reports already. “Explorer shows files in scrambled order… sometimes.” If you don’t pick an explicit sort order, then the folder’s “natural” sort order is used. -Raymond]

    I can’t really see anyone regarding ‘Explorer shows files in the order I requested’ (besides, if ‘natural’ order actually matters to anyone, why is there no option to specify it?!) as more of a bug than ‘Explorer sits twiddling its thumbs for long periods of time’ – and displaying incrementally rather than waiting for the whole lot to arrive would preserve whatever order is being imposed, at the expense of a bit of CPU time. Sort and display the results as they arrive, everyone’s happy – make them wait as long as possible, nobody gains except perhaps the CPU temperature or battery life.

    Web browsers have been doing it for years, for faster operations than that!

    (On a related note, it would be really nice if XP’s Notepad didn’t stop the message pump while trying to open files, too…)

    [This creates the different problem of “I was about to click on something and it moved and I ended up clicking on the wrong thing.” As for the related note, edit controls (like all window controls) are single-threaded. There is no special treatment for the WM_SETTEXT message. -Raymond]
  34. Norman Diamond says:

    This creates the different problem of "I was

    about to click on something and it moved and

    I ended up clicking on the wrong thing."

    Windows Explorer already does that, in Vista, and it is by design.

    However I think James’ idea was that Windows Explorer would list files as they’re found (and my idea was that Windows Explorer would show its status as still in the middle of reading), adding to the list as they come in.  The user would click to sort when the user is ready for the display to be reordered.

Comments are closed.

Skip to main content