How did real-mode Windows patch up return addresses to discarded code segments?


Last week, I described how real-mode Windows fixed up jumps to functions that got discarded. But what about return addresses to functions that got discarded?

The naïve solution would be to allocate a special "return address recovery" function for each return address you found, but that idea comes with its own problems: You are patching addresses on the stack because you are trying to free up memory. It would be a bad idea to try to allocate memory while you're trying to free memory! What if in order to satisfy the allocation, you had to discard still more memory? You would start moving and patching stacks before they were fully patched from the previous round of memory motion. The stack patcher would get re-entered and see an inconsistent stack because the previous stack patcher didn't get a chance to finish. The result would be rampant memory corruption.

Therefore, you have to preallocate your "return address recovery" functions. But you don't know how many return addresses you're going to need until you walk the stack (at which point it's too late), and you definitely don't want to preallocate the worst-case scenario since each stack can be up to 64KB in size, and can hold up to 16384 return addresses. You'd end up allocating nearly all your available memory just for return address recovery stubs!

How did real-mode Windows solve this problem?

It magically found a way to put ten pounds of flour in a five-pound bag.

For each segment, there was a special "return thunk" that was shared by all return addresses which returned back into that segment. Since there is only one per segment, you can preallocate it as part of the segment bookkeeping data. To patch the return address, the original return address was overwritten by this shared return thunk. Okay, so you have 32 bits of information you need to save (the original return address, which consists of 16 bits for the segment and 16 bits for the offset), and you have a return thunk that captures the segment information. But you still have 16 bits left over. Where do you put the offset?

We saw some time ago that the BP register associated with far calls was incremented before being pushed on the stack. This was necessary so that the stack patcher knew whether to decode the frame as a near frame or a far frame. But that wasn't the only rule associated with far stack frames. On entry to a far function,

  • The first thing you do is increment the BP register and push it onto the stack.
  • The second thing you do is push the DS register. (DS is the data segment register, which holds a segment containing data the caller function wanted to be able to access.)

Every far call therefore looks like this on the stack:

xxxx+6 IP of return address
xxxx+4 CS of return address
xxxx+2 pointer to next stack frame (bottom bit set)
xxxx+0 DS of caller

The stack patcher overwrote the saved CS:IP with the address of the return thunk. The return thunk describes the segment that got discarded, so the CS is implied by your choice of return thunk, but the stack patcher still needed to save the IP somewhere. So it saved it where the DS used to be.

Wait, you've just traded one problem for another. Sure, you found a place to put the IP, but now you have to find a place to put the DS.

Here comes the magic.

Recall that on the 8086, the combination segment:offset corresponds to the physical address segment×16 + offset. For example, the address 1234:0005 refers to physical byte 0x1234 * 16 + 0x0005 = 0x12345.

Since the segment and offset were both 16-bit values, there were multiple ways to refer to the same physical address. For example, 1000:2345 would also resolve to physical address 0x12345. But there are other ways to refer to the same byte, like the not entirely obvious 0FFF:2355. In fact, there's a whole range of values you can use, starting from 0235:FFF5 and ending at 1234:0005. In general, there are 4096 different ways of referring to the same address.

There's a bit of a problem with very low addresses, though. To access byte 0x00400, for example, you could use 0000:0400 through 0040:0000, but that's as far as you could go, so these very low addresses do not have the full complement of aliases.

Aha, but they do if you take advantage of wraparound. Since the 8086 had only 20 address lines, any overflow in the calculations was simply taken mod 0x100000. Therefore, you could also use F041:FFF0 to refer to the address, because 0xF041 × 16 + 0xFFF0 = 0x100400 ≡ 0x00400. Wraparound allowed the full range of 4096 aliases since you could use F041:FFF0 to FFFF:0410, and then 0000:0400 to 0040:0000.

Related reading: The story of the mysterious WINA20.386 file.

Okay, back to stack patching.

Once you consider aliasing, you realize that the 32 bits of flour actually had a lot of air in it. By rewriting the address of the return thunk into the form XXXX:000Y, you can see the 12 bits of air, and to stash the 12-bit value N into that air pocket, you would set the segment to XXXX−N and the offset to N×16+Y.

Recall that we were looking for a place to put that saved DS value, which is a 16-bit value, and we have found 12 bits of air in which to save it. We need to find 4 more bits of air somewhere.

The next trick is realizing that DS is not an arbitrary 16-bit value. It's a 16-bit segment value that was obtained from the Windows memory manager. Therefore, if the Windows memory manager imposed a generous artificial limit of 4096 segments, it could convert the DS segment value into an integer segment index.

That index got saved in the upper 12 bits of the offset.

Okay, let's see what happens when the code tries to unwind to the patched return address.

The function whose return address got patched goes through the usual function epilogue. It pops what it thinks is the original DS off the stack, even though that DS has been secretly replaced by the original return address's IP. The epilogue then pops the old BP, decrements it, and returns to the return thunk. The return thunk now knows where the real return address is (it knows which segment it is responsible for, and it can figure out the IP from the incoming DS register). It can also study its own IP to extract the segment index and from that recover the original DS value. Now that it knows what the original code was trying to do, it can reload the segment, restore the registers to their proper values, and jump to the original return address inside the newly-loaded segment.

I continue to be amazed at how real-mode Windows managed to get so much done with so little.

Exercise: The arbitrary limit of 4096 segments was quite generous, seeing as the maximum number of selectors in protected mode was defined by the processor to be 8191. What small change could you make to expand the segment limit in real mode to match that of protected mode?

Comments (19)
  1. Adam Rosenfield says:

    Could you require all functions to be 2-byte aligned so that you could repurpose the bottom bit of the IP for another bit of selector?

    [Wouldn't be much help here because we are patching return addresses, not function entry points. You would have to require all return addresses to be 2-byte aligned, which given the 8086 instruction set, would be expensive to enforce. ("Why is there a NOP immediately before half of my function calls?") -Raymond]
  2. Mijzelf says:

    The stack is 16 bits wide, so the pointer to the next stackframe has one bit left. (Actually more, I suppose. With a 64kB stack I suppose the frames were never 32kB from each other, so the most significant bit could be stripped either.

    [The bottom bit of the pointer to the next stack frame was already being used: "We saw some time ago that the BP register associated with far calls was incremented before being pushed on the stack." -Raymond]
  3. Joshua says:
    • The second thing you do is push the DS register. (DS is the data segment register, which holds a segment containing data the caller function wanted to be able to access.)

    Wow I dodged a bullet by not supporting Real Mode Windows. I had no idea you had to do that when programming targeting 16 bit Windows. Win 3.1 didn't seem to care.

  4. Mijzelf says:

    [quote][The bottom bit of the pointer to the next stack frame was already being used: "We saw some time ago that the BP register associated with far calls was incremented before being pushed on the stack." -Raymond][/quote]

    True, but you already know it was a far call, right? For a near call the stack wouldn't have to be patched, as the code segment wouldn't be swapped away with running code in it.

    [The way we knew it was a far call was by checking the bottom bit! -Raymond]
  5. Henning Makholm says:

    You can have two different entry points to the return thunk for almost the price of one, if you prefix the actual code with a single-byte INC BP instruction. The called function's epilogue will already have reset the low bit of BP to 0, so now it can be used as a local temporary in the return-fixup code, remembering whether the INC BP was executed or not while we save the user code's register values. Voilà! we then have freedom to choose 13 bits of the return offset.

    [That is brilliant. -Raymond]
  6. SMW says:

    I can't tell whether Raymond's reply to Henning is sincere or sarcastic.

    [Sincere. He found another bit of air that could be squeezed out. -Raymond]
  7. zwieback says:

    Real-mode Windows was the real innovation, once we switched over to protected mode we just used what Intel gave us. Not that I wish those day back…

  8. Jeff says:

    I can see a possible problem with this system: Suppose the DS index is 0xFFF and the return thunk's physical address is 0x1234F (or anything else that ends with a high nybble). The thunk will be invoked by returning to 0235:FFFF, but this will likely crash because code execution wraps around to 0235:0000.

    This could be fixed either by limiting the number of segments to 4095 instead of 4096, or by aligning return thunks so they never need to cross from one 16-byte paragraph to the next.

    [I glossed over many tiny details. This is one of them. But +1 on you for noticing. -Raymond]
  9. Rick C says:

    @Joshua, when Raymond says "you" he didn't mean you per se.  The compiler took care of a lot of this for you; IIRC the main thing you had to do was make sure no functions or individual data items (like an array) were bigger than 64KB.  In particular, the compiler inserted all that function prolog/epilog stuff for you.

  10. Adam Rosenfield says:

    [Wouldn't be much help here because we are patching return addresses, not function entry points. You would have to require all return addresses to be 2-byte aligned, which given the 8086 instruction set, would be expensive to enforce. ("Why is there a NOP immediately before half of my function calls?") -Raymond]

    I was obviously not thinking clearly when I wrote that.  If we were dealing with a fixed-width instruction set (like many RISC ISAs), we could be sure of instruction alignment for all return addresses, but yeah, imposing that on x86 would be a bad idea.

  11. Joshua says:

    @Rick C: I dabbled in 16 bit assembly language.

  12. Rick C says:

    @Joshua, so did I, but I targetted DOS not Windows, to avoid that stuff.

  13. I don't know how "small" this change is, but…

    As I understand it, *after* being patched, the pushed values look like this:

    DS = original return IP

    CS:IP = Address of return thunk, directly encoding 12 bits of the original DS, and indirectly encoding the original CS (by way of pointing to the return thunk for that CS).

    Therefore, the remaining 4 bits of DS must be derivable from the preserved 12 bits (the easiest way is probably to make them all 0 or 1).

    To push up to 8192 values, I would have two copies of each return thunk (or at least, two entry points for each thunk.) If DS is even, use thunk #1; if odd, use thunk #2:

    mov bx, dword ptr [thunk_entry_point]

    mov ax, ds

    ; not sure if this is better or worse than a test-jz

    and ax,  1

    not ax

    inc ax

    and ax, THUNK2_OFFSET

    add bx, ax

    [Having two return thunks is the solution I had in mind. -Raymond]
  14. Joshua says:

    [self] Wow I dodged a bullet by not supporting Real Mode Windows.

    I went back and checked. None of my assembly functions ever yielded so I was safe anyway.

  15. Neil says:

    @Joshua: Including not calling far functions in discardable segments?

  16. Joshua says:

    @Neil: As it happened, yes. (I'm assuming kernel is not discardable.)

  17. Jeff says:

    @Myria: Windows doesn't need to fix it – this is a call, not a return. The function has its own dedicated INT 3F thunk (as described in blogs.msdn.com/…/10322767.aspx) which "meow" will point to.

  18. Myria says:

    How were far function pointers stored as local variables on the stack handled?

    void (__far *meow)() = SomeFunction;

    SomeOtherFunction();

    (*meow)();

    How could the Windows kernel fix the copy of "meow" saved to the stack during the SomeOtherFunction call?  Specifically, how did the kernel know about it in the stack trace?

  19. Gabe says:

    Dare I ask what happens when somebody does a longjmp into a discardable segment?

    [Sure. -Raymond]

Comments are closed.