FPO

I was chatting with one of the perf guys last week and he mentioned something that surprised me greatly.  Apparently he's having perf issues that appear to be associated with a 3rd party driver.  Unfortunately, he's having problems figuring out what's going wrong because the vendor wrote the driver used FPO (and hasn't provided symbols), so the perf guy can't track the root cause of the problem.

The reason I was surprised was that I didn't realize that ANYONE was using FPO any more.

What's FPO?

To know the answer, you have to go way back into prehistory.

Intel's 8088 processor had an extremely limited set of registers (I'm ignoring the segment registers), they were:

AX BX CX DX IP
SI DI BP SP FLAGS

With such a limited set of registers, the registers were all assigned specific purposes.  AX, BX, CX, and DX were the "General Purpose" registers, SI and DI were "Index" registers, SP was the "Stack Pointer", BP was the "Frame Pointer", IP was the "Instruction Pointer", and FLAGS was a read-only register that contained several bits that were indicated information about the processors' current state (whether the result of the previous arithmetic or logical instruction was 0, for instance).

The BX, SI, DI and BP registers were special because they could be used as "Index" registers.  Index registers are critically important to a compiler, because they are used to access memory through a pointer.  In other words, if you have a structure that's located at offset 0x1234 in memory, you can set an index register to the value 0x1234 and access values relative to that location.  For example:

MOV    BX, [Structure]
MOV    AX, [BX]+4

Will set the BX register to the value of the memory pointed to by [Structure] and set the value of AX to the WORD located at the 4th byte relative to the start of that structure.

One thing to note is that the SP register wasn't an index register.  That meant that to access variables on the stack, you needed to use a different register, that's where the BP register came from - the BP register was dedicated to accessing values on the stack.

When the 386 came out, they stretched the various registers to 32bits, and they fixed the restrictions that only BX, SI, DI and BP could be used as index registers.

EAX EBX ECX EDX EIP
ESI EDI EBP ESP FLAGS

This was a good thing, all of a sudden, instead of being constrained to 3 index registers, the compiler could use 6 of them.

Since index registers are used for structure access, to a compiler they're like gold - more of them is a good thing, and it's worth almost any amount of effort to gain more of them.

Some extraordinarily clever person realized that since ESP was now an index register the EBP register no longer had to be dedicated for accessing variables on the stack.  In other words, instead of:

MyFunction:
    PUSH    EBP
    MOV     EBP, ESP
    SUB      ESP, <LocalVariableStorage>
    MOV     EAX, [EBP+8]
      :
      :
    MOV     ESP, EBP
    POP      EBP
    RETD

to access the 1st parameter on the stack (EBP+0 is the old value of EBP, EBP+4 is the return address), you can instead do:

MyFunction:
    SUB      SP, <LocalVariableStorage>
    MOV     EAX, [ESP+4+<LocalVariableStorage>]
      :
      :
    ADD     SP, <LocalVariableStorage>
    RETD

This works GREAT - all of a sudden, EBP can be repurposed and used as another general purpose register!  The compiler folks called this optimization "Frame Pointer Omission", and it went by the acronym FPO.

But there's one small problem with FPO.

If you look at the pre-FPO example for MyFunction, you'd notice that the first instruction in the routine was PUSH EBP followed by a MOV EBP, ESP.  That had an interesting and extremely useful side effect.  It essentially created a singly linked list that linked the frame pointer for each of the callers to a function.  From the EBP for a routine, you could recover the entire call stack for a function.  This was unbelievably useful for debuggers - it meant that call stacks were quite reliable, even if you didn't have symbols for all the modules being debugged.  Unfortunately, when FPO was enabled, that list of stack frames was lost - the information simply wasn't being tracked.

To solve the is problem, the compiler guys put the information that was lost when FPO was enabled into the PDB file for the binary.  Thus, when you had symbols for the modules, you could recover all the stack information.

FPO was enabled for all Windows binaries in NT 3.51, but was turned off for Windows binaries in Vista because it was no longer necessary - machines got sufficiently faster since 1995 that the performance improvements that were achieved by FPO weren't sufficient to counter the pain in debugging and analysis that FPO caused.

 

Edit: Clarified what I meant by "FPO was enabled in NT 3.51" and "was turned off in Vista", thanks Steve for pointing this out.