What assembly language DOES your code generate?

Pat Niemeyer had a fascinating comment in my article about programmers knowing roughly what assembly language their code generates:

Your example serves to demonstrate that developers should, under normal circumstances, *not* care about low level code or performance issues... They should just do the "right thing" where that is usually the simplest and most staightforward thing. If developers had followed this rule in Java all along then they woudln't have been bitten by all these stupid optimization rules that ended up being anti-optimizations when the VMs got smarter.

Trust the compiler people... trust the VM people and then when your code is working trust the profilers to tell you what is actually going on. People's intuition about performance issues is wrong more often thant it's right. You don't know anything unless you profile it - zip, nada, nothing. You may think you're a hotshot Java programmer (we all do) but you're wrong most of what you think about optimization. That's the real rule.

At some level, Pat’s absolutely right – the VM and compiler people know far better than you do how to optimize your code.  But please note my use of the phrase “Roughly” – You don’t need to know the exact instructions that are generated, actually, as I mentioned above Pat’s comment – in the face of modern optimizing compilers, it’s actually a bad idea to write your code in assembly unless you REALLY know what you’re doing (I’d be willing to trust someone like Mike Abrash or maybe a couple of others to write assembly language code for modern processors, but I certainly wouldn’t attempt to do it myself, even though I spent the first 10 years of my career at Microsoft programming exclusively in x86 assembly language) .

When I started writing x86 assembly language, the rule of thumb for performance was pretty simple:  As long as you don’t use multiply or divide, the smaller your code, the faster it is.  That was all you needed to know.  With the Pentium, this changed for Intel processors (it had always been the case for RISC processors).  The infamous Appendix H described in full how to write optimal code for the Pentium series processors.

All of a sudden, it became quite difficult to actually write efficient code.  And for every release of Intel’s processors, it’s become harder and harder to write the fastest code possible in assembly – typically compilers can do a much better job of it than mere mortals.

But I still stand by my statement – I disagree with Pat, especially if you’re writing systems code.  If you’re introducing several thousand extra instructions inside your POP3 server because you chose to do error processing by throwing exceptions (no, I didn’t do that in the Exchange POP3 server), or if your I/O subsystem is twice as large as it should be because you used macros for critical sections, you in fact be using the most “straightforward thing”.  But unless you know what the performance implications of doing the “most straightforward thing” are, it’s highly likely that doing the “straightforward thing” is the thing that stops your POP3 server from being able to support 50,000 users on a single Exchange server  Or it’ll be the difference between serving 100 million hits a day and 50 million hits a day.