If I create multiple selectors each of size 4GB, do I get a combined address space larger than 4GB?


Every so often, someone comes up with the clever idea of extending the address space of the x86 processor beyond 4GB by creating multiple selectors, each of size 4GB. For example, if you created a 4GB selector for code, another 4GB selector for stack, and another 4GB selector for data, and assigned them distinct memory ranges, then you could load up each selector into the corresponding register (CS, SS, DS) and be able to access 12GB of memory.

Profit!

Well, except that it doesn't actually work.

Segment descriptors on the x86 contain the following pieces of information:

  • Various control bits not relevant to this discussion.
  • A segment base address (32 bits).
  • A segment limit (32 bits, encoded as a 20-bit value and an optional scale; details not important).

In practice, what happens is that the base address is set to zero and the limit is set to 0xFFFFFFFF, which gives each segment a range of 4GB. Segments create views into the linear address space. When you access memory by doing, say, mov al, ds:[ebx], what happens is the following:

  • The selector in the ds register is consulted to obtain its base address and limit. If ds references an invalid selector, then a fault occurs.
  • The value in ebx is checked against the segment limit of the selector held in ds. If it is greater than the limit, then a fault occurs.
  • The value in ebx is added to the selector's base address, producing a linear adddress.
  • That linear address is used to access the underlying memory.

The mechanism by which linear addresses map to physical addresses is not relevant to the discussion. (This is where page tables come in.) I'm also ignoring expand-down selectors and other details not related to addressing.

In other words, selectors don't reference memory direcrly. They are merely a window into the linear address space. If you create a selector whose base address is inside the [base address, base address + offset] range of another selector, then both selectors are accessing the same underlying memory.

Linear address space
Selector X
Selector Y

In the above example, we created Selector X with a base address of 0x50000000 and a limit of 0x1FFFFFFF. This gives selector X a reach of [0x50000000, 0x6FFFFFFF]: An access to X:0 refers to linear address 0x50000000, and an access to X:1FFFFFFF refers to linear address 0x6FFFFFFF. Higher offsets from selector X are invalid.

We also created Selector Y with a base address of 0x60000000 and a limit of 0x7FFFFFFF, giving selector Y a reach of [0x60000000, 0xDFFFFFFF].

Observe that the two selectors overlap. The addresses X:10000000 and Y:00000000 refer to the same underlying linear address space. Write a value to to X:10000000 and you can read it back from Y:00000000.

Indeed, this behavior on overlap is relied upon constantly. To use the x86 in flat mode, you create a code selector and a data selector, both of which have a base of 0x00000000. and a limit of 0xFFFFFFFF. You put the code selector in the cs register and the data selector in the ss, ds, and es registers. The fact that the ranges perfectly overlap means that reading data from a code address reads the same bytes that the CPU would have executed. Conversely, the fact that they overlap means that you can generate code by writing to the data segment.

Okay, you sigh, I can't give each selector its own 4GB of address space. The fact that the base address of the selector is a 32-bit value means that the best I can do is to create a selector whose base is 0xFFFFFFF0 and whose limit is 0xFFFFFFFF; that at least gives me linear addresses as high as 0xFFFFFFF0 + 0xFFFFFFFF, or a smidge under 8GB. Still, 8GB is better than 4GB, right?

Well, you don't even get 8GB.

3.3.5 32-Bit and 16-Bit Address and Operand Sizes

With 32-bit address and operand sizes, the maximum linear address or segment offset is FFFFFFFFH (2³² − 1).

"The maximum linear address is FFFFFFFFH."

This means that segments whose base + limit is greater than 0xFFFFFFFF are illegal. All of your selectors have to fit inside [0x00000000, 0xFFFFFFFF].

Now, maybe you could pull some super sneaky tricks like keeping all pages mapped not present, and then when a page fault occurs, determining which selector was the source of the faulting linear address and mapping in the appropriate page at fault time, and then setting the trap flag so that the kernel regains control after the instruction has executed, so that you can unmap the page immediately. But faulting at every instruction is going to make things ridiculously slow, and besides, it won't help you if somebody performs a block memory copy between two different "pseudo address spaces" that happen to have the same linear address. I guess at that point, you would change the selector base addresses so that the source and destination no longer land on the same page, but at this point you are doing so much work at every instruction that you may as well give up trying to execute code natively and just write a p-code interpreter.

Comments (39)
  1. Falcon says:

    At this point, one might be tempted to mention Physical Address Extensions. Even that won't help you, though - linear addresses are still 32 bits wide, but with paging and PAE enabled, they can map to larger physical addresses (36+ bits).

    1. Antonio Rodríguez says:

      PAE does not create larger address spaces, but allows you to use more than 4 GB of physical RAM in order to run several 4 GB processes without swapping. You can get the same with a 64-bit OS running on a 64-bit processor, so PAE is now obsolete.

  2. Martin Bonner says:

    If Microsoft had gone for a Harvard architecture when they started off writing 32-bit software, I suspect that Intel might well have made the linear address space > 4G. Harvard architecture would have made buffer overflow bugs a much less attractive target too. (I know there is return-oriented-programming, but I wonder if that would have been developed if hacking had *started out* so much harder.)

    1. Antonio Rodríguez says:

      "Harvard architecture" is an attribute of the processor, not the OS. So it should have been implemented by Intel *before* Microsoft could use it. And even in that case, Microsoft OSes maintain a great level of back compatibility, which would be broken by switching to a Harvard architecture. Many great OSes of the 90s (for example, BeOS) didn't succeed because they didn't offer backwards compatibility and didn't have enough native software to be useful.

      1. Evan says:

        > “Harvard architecture” is an attribute of the processor, not the OS.

        This *is* true, but you could also view it as a programming model -- and in that sense, it's potentially in control from the OS and compiler to an extent. For example, consider a compiler that produced a x86 program that only accesses data through ds and code through cs or something; in effect, you have a Harvard architecture style program because it's impossible to make a data pointer point at code or vice versa. (This is probably a bit wrong, but my overall point is that a system could impose more "restrictive" requirements on programs than the architecture allows, and hence provide a Harvard *programming model* on a von Neumann machine.)

        This also provides a way to provide backwards compatibility, if Intel or AMD comes along and restores the functionality of the segment registers -- programs that are compiled to support the Hardvard model would get it, while programs that aren't, wouldn't.

        1. smf says:

          The benefit of accessing data through the cs segment is that read only tables that are used within a subroutine can be paged in at the same time as the code. So you would need a separate read only data page and therefore slower for no clear advantage.

          The advantage of Harvard is that you can have separate code and data busses attached to completely different memory, so you can access both at the same time. It's only really used in dsp's and microcontrollers. There is no way that Intel would even consider it. It would require new motherboards and all new software.

          1. Bob says:

            And, since most Intel processors have separate I- and D-caches, they get this benefit of the Harvard architecture where it matters the most (where the bandwidth requirement per CPU is highest).

          2. smf says:

            @Bob

            Separate I and D caches is very common, even the Motorola 68030 from 1987 has it. Intel actually cache the decoded opcodes though. Looking back, it's strange that the RISC revolution just led to faster Intel CISC chips.

    2. Evan says:

      > I wonder if that would have been developed if hacking had *started out* so much harder.

      I suspect it would have. Even if you think that ROP is too complicated and subtle for someone to realistically think about it from first principals, remember that ROP wasn't the first code-reuse attack; remember at least return-to-libc. And return-to-libc attacks *are* simple enough that I strongly believe they'd have been developed even if return-to-stack attacks were never possible. (In some ways, return-to-libc attacks are simpler than return-to-stack, though I suspect they were developed later.) And I think ROP is as natural of a development from return-to-libc as it is to return-to-stack attacks.

      So I don't think it'd have made appreciable difference, to be honest, in terms of what non-return-to-stack attacks are known.

    3. ROP is basically applying the techniques from threaded code (https://en.wikipedia.org/wiki/Threaded_code instead of the parallel/concurrent meaning of the term) to attacks. I suspect that if we had common Harvard machines, we'd have seen more interpreters written in a threaded style, and thus attackers would have had the inspiration they needed to get to ROP earlier.

  3. Martin Bonner says:

    PAE allows you to have a 32-bit operating system running 16 processes, *each* of which has 4G of physical memory. (Or rather more processes with rather less memory each.) It would be very useful, and would make the 32-bit consumer operating systems much more attractive (even today, a desktop/laptop with 64G memory is a bit of a beast).

    1. Darran Rowe says:

      Two things, first that calculation is based upon all 64GB of physical RAM and all of the process' virtual address space being for user code. Without the user VA being set to 3GB and process not being marked as large address aware, each process would have 2GB of address space, which would mean 32 processes. With 3GB, that would be 22 and 2/3.
      Secondly, the reason why 32 bit consumer operating systems are less attractive isn't to do with PAE, it is to do with the virtual address limitation. Dealing with a lot of data in a limited amount of space is possible but awkward. It is also harder and slower compared to being able to deal with it in a larger amount of space.
      As an addition, the value for PAE that you have is also a bit dodgy. Yes, the original PAE gave 2^36, but it didn't actually need to stop there and it didn't. On 64 bit hardware, PAE uses all of the available physical address lines available. This can be verified by looking at the intel software developers manual. The reason why we didn't get any more extensions on 32 bit hardware was most likely because the virtual address space was being pushed a bit too far and the move towards 64 bit processors was already being made.

      1. Evan says:

        > Dealing with a lot of data in a limited amount of space is possible but awkward. It is also harder and slower compared to being able to deal with it in a larger amount of space.

        It also reduces dramatically the entropy available to ASLR implementations, and means that if you *do* have even one memory-hungry program, your choice to go with a 32-bit OS was probably bad.

    2. smf says:

      "PAE allows you to have a 32-bit operating system running 16 processes, *each* of which has 4G of physical memory.

      @Martin Bonner That statement is pretty much all wrong. On 32 bit windows each application has 2gb of address space, you can increase this to 3gb but you shouldn't do this with pae because it puts too much address space pressure on the kernel. Apps that are not pae aware can only have as much physical ram as will fit in their address space (so 2gb). If however an app is pae aware then it can allocate as much ram as it wants, but it has to take care of mapping the ram into the 2gb address space itself. So in a way it's similar to EMS expanded memory. You avoid the overhead of paging from hard disk, but extra instructions have to be written/debugged/executed.

      When Itanium failed and AMD brought 64 bit processors to the market then PAE was dead. There are too many advantages with going 64 bit.

  4. Stefan says:

    You actually can extend the address space, sort-of. Make each descriptor describe a gigabyte, but mark it not-present. An access traps, giving the operating system the chance to swap in that segment (and discard another). 100 descriptors -> 100 gigabytes virtual address space. This is pretty much what Windows-286 did with the swap file, but it had to deal with 64k segments only, not gigabyte segments.

    I have actually seen the 286 processor being described as having a virtual address space of 16 terabytes (16k tasks with a local descriptor table x 16k descriptors in each table x 64k memory per descriptor), although it had only 24 address bits (16 megabytes address space).

  5. Archibald says:

    > This means that segments whose base + limit is greater than 0xFFFFFFFF are illegal.

    Not quite. It's perfectly valid to have a segment whose base+limit is more than 4GB, it's just that the linear addresses will wrap when they overflow. Linux actually uses that to implement thread local storage: the TLS segment has a limit of 4GB and a base which depends on the thread, and puts the program's own thread locals above the base and libc's immediately underneath it. That means they can both access thread local variables without an extra indirection and without each needing to know the size of the other's thread local area.

    1. I can't find any text that documents the wrapping behavior. Volume 3A section 3.4 "Logical and linear addresses" says that step 3 is "Adds the base address of the segment ... to the offset to form a linear address," but says nothing about what happens if this addition overflows. (Implementation-specific behavior is explicitly called out if you try to do a DWORD read at offset FFFFFFFFh.)

      1. Myria says:

        I can't, either. There are a lot of statements that vaguely imply that all segment base + effective addresses are ANDed with 0xFFFFFFFF in legacy and compatibility modes (32-bit modes), but there is no direct statement of this fact. However, I know from experience that wrapping segments are in fact legal. In compatibility mode (as opposed to legacy mode), it still wraps around 4 GB even though the virtual memory space is much larger in long mode.

        That wrapping around 0xFFFFFFFF in an effective address is implementation-defined turned out to be very important in the security breakdown of the original Xbox. Intel CPUs don't care if EIP hits the end; they just continue executing at 0.

        1. smf says:

          @Myria The xbox exploit refers to the CS:IP in real mode wrapping from 0xffff:ffff to 0x0000:0000. Intel CPU's do (or did) to allow boot roms at either end of the address space, AMD instead throw an exception. The xbox started out with an AMD cpu and only switched to Intel quite late on. Masking of physical addresses in protected mode is completely different.

          https://events.ccc.de/congress/2005/fahrplan/attachments/591-paper_xbox.pdf

          "That wrapping around 0xFFFFFFFF in an effective address is implementation-defined turned out to be very important in the security breakdown of the original Xbox. Intel CPUs don’t care if EIP hits the end; they just continue executing at 0."

          1. smf says:

            And of course, the no exception on wrap around on Intel cpu's wouldn't have been a problem. Except that Microsoft tried to stop the system it detected a security problem by purposefully triggering the AMD exception and never went back to test it when the CPU changed to Intel.

  6. DonH says:

    It's been a long time since I had to look at this, but (at least on some x86's) couldn't you turn on the Page Size Extensions and get 4MB pages in a 36-bit linear address space instead of the normal 4kB pages in a 32-bit linear address space? I seem to recall that we considered doing this before rejecting it for a number of reasons (I/O costs would be too high for 4MB pages, and we thought programmers would be happier with flat 32-bit virtual addresses than with segmented 48-bit virtual addresses).

    1. Nope, PSE merely avoids allocating a page for the PTEs (and puts everything in the PDE). The linear address space doesn't change; only the way pages get mapped into it.

  7. MarcK4096 says:

    It seems like it would be much easier to recompile in x64 and just change your requirements to require a 64-bit OS. :)

  8. Yuhong Bao says:

    This reminds me of how the Morris worm was ignored when OS/2 2.x and NT OS/2 was being designed. Yes, I am talking about the decision to go with a flat instead of segmented address space.

    1. Seeing as x86 is the only major processor that supports segments, that basically says "We will never run on anything other than an x86."

      1. Yuhong Bao says:

        I know. But I don't think portable programs that can handle both segmented and flat address spaces would be hard to create. On malloc for example, you could just increase the data segment size.

        1. I'm sure people would have loved working with far pointers again. ("What do you mean there is no integer type large enough to hold a pointer? And there is no way to atomically exchange a pointer?")

          1. Yuhong Bao says:

            Yea, I have been thinking for example whether the data and stack would be in the same segment or in separate segments.

  9. cheong00 says:

    This reminds me of people who spreads the rumor that in 32-bit WinXP with 8GB RAM, you can install some ramdisk driver to turn the inaccessible additional 4GB RAM into ramdisk and use it.

    //sigh

    1. Yuhong Bao says:

      I think most of these used PSE36.

      1. cheong00 says:

        Not aware of that. Btw if the information found on Wiki PSE-36 page is correct, Win2k or above only supports PAE and therefore it should not work on WinXP systems.

    2. smf says:

      It's not a rumour, you can do it. XP SP1 allows you to access more than 4gb, later SP blocked it. You can either patch the binary or there are other ways for the ramdisk to access the extra memory. I don't know why you'd want to, but you can.

      1. cheong00 says:

        This "news" started to spread around the time of WinXP SP2, when 4GB of RAM become affordable and 4GB per bank memory modules enter the market. There was lots of people talking about it in 2008-2010 when people upgrade their machine but still want to use WinXP, and want to have more memory installed so that they can make use of it if they decided to upgrade to Win7 x64 later.

        AFAIK the ramdisk is still a windows driver, and it's choice of mode to access memory is still governed by the OS. If the OS choose not to support it, you cannot enable it by a driver. That's why I think it's rumour.

        1. Yuhong Bao says:

          Actually, I think the DRAM price per bit fell by about half between 2004 and 2005. One year after XP SP2 was released, with 512Mbit DDR/DDR2 chips being available, 4GB of RAM probably cost around $400-500. At that time, "XP Professional x64 Edition" existed, but driver issues was not the only problem. Though I admit that on the Intel desktop side only the Intel 955X chipset had 36-bit addressing.

    3. ender says:

      ImDisk claims to be able to use RAM above 4GB on 32-bit OSes with an additional driver that uses AWE.

  10. MV says:

    I worked out a trick once to launch a child process, then use the debugging APIs to read/write the memory of the child process. I couldn't use >4GB directly, but I could copy large blocks of data into my address space when needed, and push them back to the child process afterwards.

    Don't worry, this wasn't part of a real app - just an experiment to see if it would work (it does).

    1. I think you tried too hard. If you just wanted to stash memory somewhere, you can create a memory mapping and then unmap it. The memory is still there; you just need to map it back in.

      1. MV says:

        Yeah, that would have been easier. At the time I wasn't aware of the distinction between "virtual memory" and "address space". And the fact that the functions all say "File" in their names doesn't exactly broadcast the fact that they're all about memory tricks, not files. Probably some deep historical reason for the naming.

  11. Joshua says:

    I think this trick does work to get just under 8GB of address space, but not like that. You have to use the 32 mode that results from turning on PE but not PG so it never tries a page table lookup. But do you *really* want to deal with physical memory mapping like you have a 32 bit DOS? I don't think so.

Comments are closed.

Skip to main content