Bad Analysis Worse Than None


Once again I’ll begin by saying that I’m simplifying the below in the interest of not writing a novel, so please take this article as only approximately correct.  I try to make the principles more important than the details.


 


I’ve said before, that if you make no measurements, you’re pretty much doomed to miserable performance.  If you can get good performance without measuring then I think that you’ve missed your calling in life and probably you should change careers to “Professional Lottery Ticket Picker” and please call me up J


 


So the only thing worse than no data, is bad data, or badly interpreted data, or both.  And now without further ado, here is Rico’s free advice on bad data.


 


1) If you’re using an invasive profiler, don’t believe the times it gives you


 


When I say an invasive profiler, I mean one that instruments every call site in the program that’s being monitored.  And when I say don’t believe the times, I mean you might be better off just pretending it isn’t even giving you times at all.  There are several reasons why this is usually a good idea.


 


A typical invasive profiler might hook the prolog and epilog of every function.  It measures the time spent in each call and then, in digital defiance of the Uncertainty Principle, it attempts to subtract out its own influence.  Actually the problem here isn’t really quantum mechanical at all – the problem has to do with numerically unstable computations.


 


The function being measured is going to take some amount of time let’s call it x. And, performing the measurement is going to take some amount of time, in a fit of originality we’ll call that y.  When we hook the function and note its entry time and exit time, we’ll have actually measured x+2y.  We only want x so we’re going to have to subtract out the 2y.  The way this is usually done is that at the beginning of the run we carefully measure how long it takes to take a measurement, thereby getting a high precision measurement of y.  And armed with that number we bravely multiply it by two and prepare to subtract out all the interference.


 


Several things go wrong at this point.


 


First, y is very small, and if our function is a small hot thing being called very often then x is also very small.  The worst thing you can do to kill your significant digits is try to subtract two things that are close in magnitude in order to get a (much smaller) difference.  That’s exactly what we’d be doing here for small functions, because x+2y is actually quite close to 2y.  The next thing we’ll do is repeat this operation, perhaps tens of thousands of times for frequently called functions, multiplying the potential error.  I can feel my Numerical Analysis professor shuddering.


 


Second, y has significant variability from function to function.  This is what’s making the first problem truly deadly.  You see, on modern processors the cost of doing pretty much anything is deeply dependant on what happened previously, and in some cases on what is to come.  Even simple looking instructions like ones that record a time into some kind of buffer can and are deeply affected by the processors cache.  As a result, the average measurement of y at program startup could be significantly different than what is observed in any given function.  That spells trouble.  I remember a friend working on one of these started finding that in some cases he was getting negative times in functions because the overhead was much less in those cases and subtracting out the 2y measured at startup was more than the entire observed x+2y in the run.  Ouch.


 


But all is not lost, the bigger the function, or more generally, the bigger the time period being measured, the greater the accuracy.  But remember, if the function calls other functions all those nested overheads will need to be subtracted out as part of the process so that’s not helping you at all.   Only the big meaty worker functions are truly getting the best measurements.  Small functions, put little trust in the number, large function, put more trust.  Don’t compare the time spent calling a small function thousands of times to the time spent in fewer calls to a big function as the error margin is likely to be huge (I’ve seen times that were off by a factor of 5).


 


What these profilers excel at is giving you a picture of what is being called and how often.  In fact the data here is definitive.  So do trust the counts and count origins and use those to guide your analysis.  In my experience, these kinds of runs are the most costly and the least reproducible from run to run (time-wise).  That makes it hard to know if you’re actually making progress.  Take the times with a big grain of salt.  Milk the counts.


 


2) If you’re using a sampling profiler, make sure you have plenty of samples


 


Would you believe that 4 out of 5 dentists really prefer Brand X toothpaste if only 5 dentists were actually surveyed?  Of course not.  Yet perfectly sane engineers believe that 10% of their startup execution is in Function X because 1 of their 10 samples happened to land in that function.


 


Sampling profilers work just as the name implies.  They sample the processors IP (and sometimes a whole callstack) periodically when some event occurs, usually the passage of a fairly small amount of time, like say 5 milliseconds.   I like sampling profilers because they are the least invasive and so you can mostly trust the times.  I say mostly because you must have a statistically valid sample size to make any conclusions at all. 


 


Where sampling profilers really fall down is when you’re trying to measure, in detail, a brief event that occurs infrequently.  Actually even getting detail on a long event that occurs infrequently is problematic because you’ll miss so much of the fine detail with the sampling.  You really need the event to be repeated so that you have plenty of chances to get a sample at many interesting points before things get good.


 


Even with these limitations, for my money, sampled data is the best.


 


3) Fear the interference


 


I sort of hinted at this earlier but it bears discussion on its own merits. 


 


When measuring a large program it’s temping to isolate the part that’s interesting to you and measure just that without running the rest.  You might even be able to boil the relevant part down into a so-called micro-benchmark.  It’s very temping considering the time savings and productivity gains you can get by reducing the analysis.  Don’t do it.  Or at least, don’t do it without being very careful.


 


A big part of how your program runs and where the costs are has to do with how it interferes with itself.  A simple example is where two algorithms both tuned to have good caching behavior might play very poorly together because collectively they create a very bad caching pattern.  But even algorithms that are playing comparatively well together are placing somewhat of a burden on each other. If you were to run only part of the program and measure that, you might not be able to conclude very much about how changing the part you measured would affect the whole.


 


The very best measurements, and the only ones that really count, are the ones that time what the user will experience.  Anything that isn’t the user experience is something we’ve created for our own productivity, and those are useful only to the extent that they are probative. Do they help us to find real user problems more quickly?  Do they predict the user experience at all?  What correction factor must be applied (if any)?


 


I think about 4 times a year I see a new implementation of memcpy that is purportedly better than all previous known memcpy’s.  Now the thing about memcpy is people have been thinking about this problem for a long time, so already you have to treat a newcomer with some suspicion.  But the real kicker is that the proof that is usually offered comes in the form of micro-benchmarks of the standard memcpy and the super-memcpy side-by-side.  This is kind of interesting but it may not be very probative.  What really matters is: “Is the new thing faster in the context of actual programs making calls to memcpy?” Is it as frugal with cache lines?  Are there other side-effects?  Does the benchmark accurately reflect the typical alignment and length sizes seen or is it all “easy cases?”


 


I think it’s fair to say that memcpy is not changed four times a year, notwithstanding the availability of new implementations at that rate.  Thorough analysis on even so small a thing is after all quite difficult.


 

Comments (31)

  1. Frans Bouma says:

    Good article.

    I recently profiled code using NProf, (http://nprof.sourceforge.net/Site/SiteHomeNews.html) which is an open source profiler for .NET (in alpha, so it dies sometimes). It keeps its own calls in the list of calls made, which allows you to just see the facts measured, and not numbers after substracting anything.

    What I did find with NProf was really interesting and it did make my code 200% faster in some areas (measured with lots of loops running non-profiled). For example:

    myStringBuilder.AppendFormat("[{0}].[{1}].[{2}]", a, b, c);

    is much slower than

    myStringBuilder.Append("[" + a + "].[" + b + "].[" + c + "]");

    even though you do string concatenation. Also, performing a List.Contains(newObject) in the Add() method of a class derived from CollectionBase is very expensive if the collection gets more and more objects. Furthermore, a SortedList is also very expensive, if you can, use a Hashtable.

    Small things which don’t look that obvious at first but will pop up when profiling code.

    One nice thing I found with profiling (and then decompiling, but if I hadn’t profiled it, I wouldn’t have decompiled it) was that if you have this:

    switch(myStringVariable)

    {

    case "Foo":

    // some code

    break;

    case "Bar":

    // some other code

    break;

    // etc.

    }

    it will be optimized away by the compiler (C#) into a static hashtable which has the string as key and an int as value. that int is then used in the switch 🙂

  2. .. says:

    What is the best profiler that shows memory leaks and exceptions

  3. David Levine says:

    Good article. I noted that you did not discuss instrumenting code with analysis probes, such as performance counters, to help get a handle on performance issues. Even though it is manual and invasive I’ve found that to be a useful technique. What is your take on that?

  4. Stephane Rodriguez says:

    a better memcpy?

    Funny, this remembers the Amiga and its Blitter dsp. The blitter would do all those kind of things, so why would anyone with common sense try to write code for that matter instead? Stupid, waste of time.

    How does this translate today? Simple, display cards have built-in blitters. I don’t know see anyone anxious about performance issues try to reinvent memcpy, ignorance set aside, while the display card can do that.

    Of course, by analogy, texture tiling can serve some other purposes as well.

  5. .. says:

    Problem is, how do you access the blitter on any gfx card in a standard way.

  6. theCoach says:

    Rico,

    It would be very helpful if you could include the names of products that you recommend. In general, my feeling is that Microsoft’s best practice development environment is not well distributed to Microsoft shops. It would be very helpful if your products included full lifecycle tools, or recommendations.

    Thanks.

  7. Mark Alexander says:

    This reminds me of the best profiling quote ever:

    "Do you know which country has the greatest crime rate per capita? Vatican. Profiling is like statistics: numbers are meaningless without a good interpretation."

  8. Stephane Rodriguez says:

    "Problem is, how do you access the blitter on any gfx card in a standard way."

    IDirectDrawSurface->Blt(…)

    After all a direct draw surface is just a block.

  9. .. says:

    How do I do that on other .NET runtimes without DirectX. That is a Proprietry method not standard.

  10. Stephane Rodriguez says:

    DirectX is part of the OS, just in case.

  11. Rico Mariani says:

    I think I would move extremely cautiously on adopting DirectX to do my memory moving for me.

    Amiga’s Blitter was a sweet little device, and it could move memory every clock. But even so we had to be careful using it because it had significant setup time. Ah the days of comp.sys.amiga…

    These kinds of issues are often the problem with "better" memcpy’s — it’s hard to know what will work best on what sizes and alignments of copies.

    Now on a modern processor, I don’t think there is any issue with the CPU keeping up with the memory bus at all. Indeed the comparative slowness of the standard memory bus is what makes graphic chips seem so much faster when they operate on their own "wired for speed" internal memory busses.

    I would be surprised if blitting to main memory turned out to be faster that having the CPU do the work. Even in the days of the Amiga it was a horse-race, and Blitter did have advantages.

    I’ve no data, so all I can do is say to proceed with caution.

    Another of the comments asked if I had specific recommendations for profiling technology. I’m afraid I have to decline to answer that one. I think I could get myself into a heap of trouble by endorsing one profiler over another. However you, my gentle readers, might have no such constraints. So please do share your experiences if you are so inclined.

    Lastly, there was a comment about the usefulness of adding specific probes for analysis. I totally recommend this technique as a final step in getting definitive data. But use with care. If you do this too much then you’re too invasive. If you do this too soon then you’ll be spending a lot of time instrumenting the wrong areas. Start with something broader first and then narrow down as needed.

  12. Jeff Clark says:

    Good article. In my experience using Quantify I find that the ratio’s of how much time was spent in one method versus another to be accurate however the actual times reported are off as you say. For example if in reality MethodA takes 10 seconds and MethodB takes 20 seconds Quantify may report that MethodA takes 20 seconds and MethodB takes 40 seconds. However when I’m profiling all I really care about is the relative weights of the times in relation to each other, not benchmarks, so as long as the ratios are the same is all that matters.

  13. Ivonna Vackoff [MSFT] says:

    Yeah but the second I hookinto DX I am no longer cross platform so I may aswell not use C#.

  14. Rico Mariani says:

    In my experience, even the relative times can be off. In particular small functions tend to get overweighed.

    The trick is to just be careful. Certainly don’t take the times as gospel.

  15. Ken Brubaker says:

    Wise words on profiling from Rico

  16. Frans Bouma says:

    "How does this translate today? Simple, display cards have built-in blitters. I don’t know see anyone anxious about performance issues try to reinvent memcpy, ignorance set aside, while the display card can do that.

    Of course, by analogy, texture tiling can serve some other purposes as well. "

    No, this will not work 🙂

    The Amiga blitter worked on the same memory as the 680×0 chips did, using the uneven cycles of the CPU to access the ram via the bus so it wouldn’t stall any system activity. (pretty clever system actually :)). In a PC, the GPU works on its own memory, so it can’t copy memblocks around in mainmem, because it doesn’t have access to that ram. Moving data back and forth the AGP bus is IMHO slower than having a CPU-based memcopy. Nevertheless, a DMA based chip somewhere in the system which does have mainmem access could do what you suggest.

  17. Frans Bouma says:

    "Even in the days of the Amiga it was a horse-race, and Blitter did have advantages."

    Indeed. Although the idea is nice, I don’t think it will make much sense in a common system. Most of the time the blitter was used on the amiga to do non-graphical work, it was blitting values in a copperlist for example or values onto a set of asm instructions which then would get different meanings (so you could move bob-balls with solely a copper list and a blitter action :D), i.o.w.: self modifying code.

    You could run the blitter in forced mode which would steal bus cycles away from the CPU, (and practically stall the CPU), however this wasn’t a big advantage. If I remember correctly (but I have to dig up my Amiga Internals book from the basement to get the facts ;)) the blitter was just 4 times faster than the CPU, and the speed gain was created by having it doing things in parallel (eventually with a copper list doing other things in parallel) with the CPU.

    On modern PC’s this is absolutely not possible, as the bus is so much more slower than the CPU, that every action you’ll do in the bus will stall the CPU and things aren’t possible to do in parallel except if the processors performing these parallel actions have DMA access to the memory. In theory, a drive chip therefore could be able to do memcopies 🙂

  18. .. says:

    Optical Hypertransport bus not fast enough for you?

  19. Well, once again my elite readers have made a solution posting by me nearly redundant.  There were…

  20. Well, once again my elite readers have made a solution posting by me nearly redundant. There were many

  21. [Weblog specializing in performance optimization] [Virtual Source Code Control Systems: Promoting and Managing Projects using Visual SourceSafe]…

Skip to main content