Unintended consequences of adding APIs to the system


Yesterday, I wrote about a trick to reduce the number of bits in a number by one.

It turns out that I've only ever had one opportunity to use this trick (although I ran into an instance of it when code reviewing some stuff the other day), back when I was writing the NT 3.1 network filesystem.

One of the things that makes writing a network client so hard is that you need to map legacy versions of your network protocol to the semantics that your customers require.  In this case, I needed to map between the "how much disk space is free on the drive" SMB request and the NT filesystem query that retrieves the amount of disk space on the drive.

It turns out that the legacy servers returned this information in terms of bytes per sector, sectors per cluster and clusters per disk (because that was how MS-DOS maintained that information).

NT maintained the information in a different form, and the conversion between the two forms required that I divide the interim results by bytes per sector.

Well, if the bytes per sector was a power of two (which it almost always was), I could do the division trivially (by shifting the interim result right by the base2 logarithm of the sector size). 

Unfortunately, if the bytes per sector was NOT a power of two, I was stuck with a division.  Since the interim value was a 64bit number, that meant that I had to do a 64bit by 32bit division.  At the time I was writing the code, the compilers didn't have support for 64bit arithmetic, for 64bit math, we had a number of internal "RTL" functions that would do the math.  We had functions for addition, subtraction and multiplication, but nobody had ever written a version that did division.

I'd never done numeric processing before, but I needed the function, so...  I went to Dave Cutler (since he knows everything) and asked him how to do the math, he told me and I trotted off to implement his algorithm.

Since the code was only used in a special case that was likely never to occur outside of the test labs, I didn't particularly spend a lot of time optimizing the function - I faithfully coded the algorithm in C - two versions, one which did a 64x64 division, another that did a 64x32 division.  I could have written an assembly language version of the 64x32 version (since the processor supported 64x32 division natively), but we were REALLY concerned about portability and since it wasn't going to be called that often, I went with the more portable solution (and I'd been told that if I ever wrote stuff in assembly language, I'd have my thumbs slowly roasted over a red hot hibachi).

So I wrote the API, wrote enough unit tests to confirm to myself that it worked, checked it in as a new RTL API and went away satisfied that I'd done my first piece of numerical processing.  I was SO happy 🙂

At this point, anyone reading this post who was on the NT team at the time is starting to see where this is going.

 

About three years later (long after I'd left NT and moved onto Exchange), I heard a story about something that happened some point after I'd left the team.  You see, Dave spent a fair amount of time during the NT 3.5 to NT 4 period doing analysis of the bottlenecks in the system and looking for ways of making the system faster.

At some point, his investigations brought him to (you guessed it) my little division functions.  It turns out that one of the core components in the system (I think it was GDI, but I'm not sure) decided that they needed to divide a 64bit numbers by a 32bit number. 

Because the compiler didn't have native support for 64bit arithmetic, they went looking through the same RTL functions I did, and they discovered my 64x32 division routine.  So they started using the routine.

In their innermost loop.

All of a sudden, the silly routine I'd written to solve a problem that would never be seen in real life all of a sudden became a part of the inner loop of a performance critical function inside the OS.

 

When Cutler discovered this bottleneck, he looked at the code in my little routine and he went through the roof.  I understand he went through the roof and started cursing my name repeatedly before he rewrote it in assembly language.

 

So the moral of the story is: When you're writing code that's going into a system library, make darned sure that it's written to be as performant as humanly possible, because you never know if someone's going to find your one-off piece of code and put it in their innermost loop.

Comments (31)

  1. Anonymous says:

    You should have added a comment to the code:

    /* Don’t use this function */

    🙂

  2. Anonymous says:

    Larry,

    I know this is really off topic, but I use to work with VAXs and I went to the PDC where NT was launched. I saw Dave and a few of the other DEC/Microsoft guys. (I think the guy who worked on device drivers was called Mark L.)

    How many of the original 13-14 Ex-DEC people that Dave brought to Microsoft are still there?

  3. Anonymous says:

    Now on topic…

    One of my key phrases I use to judge ideas are "Wouldn’t it be cool if…" NO IT WOULDN’T!!!

    Your story reminds me of another phrase I should add to my list. "But nobody would ever use it for/like that."

    Right…

    Many of the problems I run into are those new and creative ways people find to use existing software.

  4. Anonymous says:

    That’s a great story.

    Dave Cutler has quite a reputation. Did people on the NT team – even MS veterans such as yourself – in any way feel intimidated about going to Dave to ask for help?

  5. Anonymous says:

    I suppose Dave didn’t remember telling you to NOT right it in assembly, when he was cursing your name.

  6. Anonymous says:

    If you really need a division and don’t test that it’s fast enough, then _you_ as a coder are broken. I mean, sure that not everyone has to know about pure-speed-optimizing, but just knowing that dividing can be costly would be fine 😉

  7. Anonymous says:

    Oops…

    It seems that I’ve written a lot of function in that fashion – Written something to solve one problem, thinking it’s not used frequently never be too serious about it and leave it in the library as is.

    The fortunate thing is because I’m writting in C#, I can always add XML comment to say this is experimental…

  8. michkap says:

    Glad you finally got this one in. I guess I have to wait a few years for some of mine….

  9. Anonymous says:

    Of all the artificial pauses I’ve seen, this was definitely top-ten material…

    "In their innermost loop.".

    I think this story illuminates two very important things.

    1. No matter what you think, your code will often outlive both its purpose and even you. Many, perhaps even most of us, daily use code written by designers and developers now dead. Larry expected, assumed, this code to only be used in test labs. As most should be aware by now, assumption is the mother of all fsckups. 🙂

    2. This also display a lack of proficiency of the GDI developer(s?) (at that time). Modifying or adding new functionality to performance-critical areas requires a certain amount of experience, insight, vigilance and willingness to research. Adding such a division (a division at all is bad enough) in a performance-critical piece of code without looking up at least the most common implementation (x86) seems to me like grounds for "We need to talk".

    I mean, for the user-land audio stuff Larry & co (finally) are making for Vista (and LH?), you don’t do one (or heaven forbids, more) division/sample/audio-stream. Do you? 🙂

    There is a third thing it says, implicit but still:

    Division is slow, really slow. 🙂

  10. Mike, we made the entire audio pipeline floating point. ANY manipulation of the audio samples is slow 🙂

    On the other hand, it turns out that by moving to floating point, even though individual operations may be somewhat slower, you need to do fewer of them, and you can do them more accurately (for instance, adjusting the volume on an audio stream is done by multiplying each sample by a floating point amplitude scalar – to do the same operation with integral math requires at least a division ).

    And there were many things that were done wrong by the people who wrote NT 3.1. IMHO, the biggest one was that one significant team chose to write all their code in C++. Unfortunately, the compiler didn’t support C++ natively, so instead they use the cfront C++-to-C transmogrifier and fed the output of cfront to the C compiler. Needless to say, the quality of the resulting compiled code wasn’t the highest.

  11. Anonymous says:

    Larry, I appreciate the use of FPU. Especially if properly used, and especially when the developer knows to move the constants "up" when the compiler is dumb, such as:

    Don’t (two divisions):

    x = x1 / z;

    y = y1 / z:

    Do (one division, two much faster multiplications):

    z1 = 1.0 / z;

    x = x1 * z1;

    y = y1 * z1;

    While I’m sure you already knew this, there might be developers not aware of this obvious optimization.

    There’s also the non-obvious thing that nowadays even Intel have learned to run the FPU in parallel with the ALU (previously AMD kicked Intel’s virtual ass in this area, from the K6 and onwards IIRC). Combine this with previous Microsoft compilers and you had a recipie for a n FPU performance disaster. 🙂

    Why you made the whole audio pipeline FP I can understand. However, even that I as an audio dude loves it, I still question the speed of it for (todays) common hardware. I wonder if an interface accepting float, ushort and short might not have been more sane (considering 16-bit shorts have so far ruled e.g. CD’s since 1985, and they haven’t been seriously challenged).

    Not only that FPU’s are still slow (much faster than what we grew up to, but still), but that the scalings we do are still basically 0-2^16 (for samples) times 0-2^8 (volume). Even with a 16-bit volume (not likely), we could easily put this into a plain ALU multiplication (after you-know-what). Heck, for many areas we might even get away with lookup tables.

    Add Screaming Cindy of any version, or even MMX (anno 1995?!), and I’d expect this to become 3-6 times faster.

    Assuming, of course, Microsoft isn’t aiming to become a serious player in audio… where going float’s flat out could be one option. But then, with your OS latencies…

    Looking forward to your elaboration.

  12. Anonymous says:

    Mike, Why go as low as MMX?

    SSE, SSE2 and SSE3 are especially there to replace the FPU functionality which should for a long time be considered deprecated nowdays.

    Even the Microsoft 2003+ compiler generates all the floating point code in SSE when compiled with the correct flags.

    When taking this into account it would be a shame NOT to write the complete pipeline in complete floating point in the year 2005.

    Floating point processing and variables makes everything much more sane; especially for audio I believe fix point should be put to rest.

    I would hate to think that Larry and his crew aren’t implementing specific optimized code for SSE and its variants.. but I’m sure they do, right Larry? 🙂

  13. 90% of the engine just pumps bits from place to place, so it doesn’t change anything.

    The 10% that does look at the values is optimized.

  14. Anonymous says:

    SSE is not there to replace scalar FPU functions but only vectorial ones. SSE2 is there to replace scalar FPU, mainly to compensate for the sucky FPU of the Pentium 4. I really hope noone will consider FPU deprecated in 32bit OS, since a great number of processors are still sold without SSE2 support (Athlon XPs, Socket 462 Semprons).

    In x64 there is no such problem – x87 code is deprecated and SSE2 is the (only) way to go.

    That said, I’m on the opinion you’ve done the correct thing. At the time of writing that RTL function, there was no need to optimize it, and so it was the correct choice not to do so until it’s really needed (that means : when a profiler cries about that). Delaying optimization of functions only when really needed improves both code readability AND deveopment times (every day you spend optimizing a function that doesn’t need to be optimized, is a lost day).

    So Cutler has little to go to the roof. Or better, he should have gone to the roof because the GDI developers were using an RTL function in an inner loop of a performance critical code without testing its performance beforehand.

  15. Anonymous says:

    What greenlight said, but without the smilie. I mean, if you anticipated someone else needing to use it, optimizing it would be a good use of time, but every function can’t be optimized for speed.

  16. Anonymous says:

    Great story. In the past, you mentioned that Gordon Letwin was the OS guru at MS. Did he participate in the NT development ?

  17. Anonymous says:

    > I could have written an assembly language

    > version of the 64×32 version (since the

    > processor supported 64×32 division

    > natively),

    The native support of 64×32 division in i386 is incomplete: if the quotient doesn’t fit in 32 bits, the CPU generates an exception.

    I mean, you’d have to implement some maths even if you were allowed to code in assembly language.

  18. Anonymous says:

    "I went to Dave Cutler (since he knows everything)"

    Larry: I’m getting to the party a little late, but I hope you’ll see this comment.

    Was that written sincerely? Or were your eyes pointed a little bit skyward when you wrote that? 😉

    Also, is "performant" microsoftie-speak for "good performance"? Is that an adjective or a noun?

  19. Anonymous says:

    Tim Smith: Mark Lucovsky came w/ Cutler from DEC. If you’ve never read Showstopper!–a pretty good book despite its faults–there are a few pages in there about him.

    I hope this link works. Start at the bottom of the page and read the next few pages: http://www.amazon.com/gp/reader/0029356717/ref=sib_vae_pg_95/002-8335644-9960825?%5Fencoding=UTF8&keywords=lucovsky&p=S030&twc=23&checkSum=0C5LiET6qE83Is95N7Th3BDa6O%2Fifzyb48gcLnlRLEE%3D#reader-page

  20. Anonymous says:

    Vy: Letwin was the OS/2 architect at Microsoft and didn’t get along with Cutler very well since their two products (initially) competed with each other. I’m sure Larry could answer better than me, but I don’t think he was involved in the NT project.

  21. Anonymous says:

    Larry: It probably doesn’t hurt to mention that the "C++ team" was Chuck Whitmer’s graphics group. I didn’t know that about the cfront transmogrifier, though. Very interesting!

    Weren’t there any native x86 C++ compilers you could’ve used in the late 80’s? Surely Intel had one?…

  22. Anonymous says:

    Mike: Nice tip, thanks! You can count me as someone who didn’t know that. Then again, I don’t write very many tight inner loops where speed is an issue.

    A quick question: since (my) compiler naturally promotes real numbers to doubles, is there a difference between

    float invRange = (float)1.0 / m_range;

    and

    float invRange = 1 / (float)m_range;

    (where m_range is an int) Is one better than the other?

    (I think I’d better stop now. 4 or 5 posts in the same topic is more than I deserve.)

  23. Anonymous says:

    Purplet, I disagree.

    Every FPU function can be reproduced with SSE [1] and it would be much more high performance and scaleable, I’m talking from programming experience.

    With FPU you are being limited to the weird pairing functionality of FXCH and likes, and don’t forget it is sitting on the MMX registers pipeline so no MMX unless you EMMS.

    SSE is NOT just for vectorize operations the SS (MOVSS, ADDSS etc) instruction set is a fairly comprehensive one which is only extended in the SSE2 and next generations, this instruction set is enough to implement all that the FPU does.

    Ofcourse, I did not mean that FPU should be considered deprecated for OS development, an OS should ofcourse support it, but as an 2005 application writer (dealing with medical 3d simulations) I hardly see the need to optimize for the FPU instead of just writing it in SSE.

    Daniel.

  24. Dan, it’s not my place to lay blame or point fingers (except at my own foibles), I’ve made it a point not to name names when a team in the past has made mistakes.

    I try very hard not criticize other people for the work they’ve done, especially if I wasn’t on the relevant team.

    Similarly, I’n not about to discuss the politics between Gordon and Dave. I have too much respect for both of them to do that.

  25. Anonymous says:

    Tuesday, October 18, 2005 12:36 PM by Dan McCarty

    > A quick question: since (my) compiler

    > naturally promotes real numbers to doubles,

    > is there a difference between

    > float invRange = (float)1.0 / m_range;

    > and

    > float invRange = 1 / (float)m_range;

    > (where m_range is an int) Is one better than

    > the other?

    In both cases there will be a floating-point division. In classic C and in your compiler, the division will be done on doubles. In more recent C the division can be done on floats if the compiler is coded to do it that way, but just like yours it can still do doubles.

    A compiler with zero optimization (or pessimization for bug testing etc.) might do conversions from integers to floating-point at execution time, in which case the second version of your code has to do two conversions. A compiler with even the slightest minimal amount of optimization will generate an appropriate floating-point 1.0 for your constant 1 in either case, so only m_range will be converted at execution time.

  26. Anonymous says:

    Dan: NT wasn’t being written for x86 in the late 80’s, so it didn’t matter if there was a C++ compiler from Intel. I think MIPS (Jazz machines?) was the big NT dev platform then.

  27. Gabe, not true.

    MIPS was certainly the first platform on which NT booted, but x86 was VERY close behind. I was self hosting NT in the very late 80’s running on x86 machines.

  28. Anonymous says:

    Just nitpickin’, but was MIPS really the first platform on which NT booted? The following seems to suggest there was real i860 h/w, even that it didn’t live up to expecteations, that booted NT before MIPS.

    http://www.winsupersite.com/reviews/winserver2k3_gold1.asp

  29. Mike, I’m pretty sure we had the R4000 version booting before the i860 version (but I’m not totally sure). We had an i860 emulator that ran (very slowly) under OS/2, most of the initial development was done on that, but we had realized that the i860 was a dead end before we got silicon.

  30. Anonymous says:

    >> Purplet, I disagree.

    Every FPU function can be reproduced with SSE [1] and it would be much more high performance and scaleable, I’m talking from programming experience.

    With FPU you are being limited to the weird pairing functionality of FXCH and likes, and don’t forget it is sitting on the MMX registers pipeline so no MMX unless you EMMS.

    SSE is NOT just for vectorize operations the SS (MOVSS, ADDSS etc) instruction set is a fairly comprehensive one which is only extended in the SSE2 and next generations, this instruction set is enough to implement all that the FPU does.

    ——–

    IIRC, SSE1 supports only single precision FP operations.

    This alone is enough to rule out its use in most applications except those where precision errors are not important (like 3d graphics visualization).

  31. Anonymous says:

    "…When you’re writing code that’s going into a system library, make darned sure that it’s written to be as performant as humanly possible.."

    Or use OO languange (C++, for example) and make your function(s) "private" thus explicitely telling others you did not intend anybody else use your functions. It may be faster (a lot faster, perhaps) than making evey function in question as perfomant as possible.. 🙂

Skip to main content