How exactly are page tables allocated on demand for large reserved regions?


Commenter pm100 asked to go into more detail into why reserving a large amount of address space used to be expensive, and now it's not.

I sort of answered it in my reply, but let's draw some pictures.

Recall that the paging structure is hierarchical. To describe a page of memory, you first select a page directory pointer from the page directory pointer table (PDPT). That points to a page directory (PD). You then select a then a page directory entry (PDE) from the page directory. This points to a page table (PT). You then select a page table entry (PTE) from the page table. The page table entry tells you where the memory for the page can be found.

PDPT
PDP PD
PDP ● PDE
PDP PDE PT
PDP PDE ● PTE
PDE PTE
PTE Physical memory
PTE ● Page N: 00 01 02 03 04 05 …
PTE

For pages that are reserved, the page table entry doesn't point to physical memory. Instead, it has a sad-face that says "If you try to access this memory, you will get a page fault."

PDPT
PDP PD
PDP ● PDE
PDP PDE PT
PDP PDE ● 😢
PDE PTE
😢 Physical memory
PTE ● Page N: 00 01 02 03 04 05 …
PTE

The above diagram shows a page table with some pages reserved (the sad faces), and some pages committed and present (the ones with a filled-in PTE).

If you reserved a huge chunk of address space, the traditional way of representing this was a page table full of sad-faces.

PDPT
PDP PD
PDP ● PDE
PDP PDE PT
PDP PDE ● 😢
PDE 😢
😢
😢
😢

But starting in Windows 8.1 and Windows Server 2012 R2, the memory manager optimizes this out, and instead of creating an entire page table filled with sad faces, it puts a sad face in the page directory:

PDPT
PDP PD
PDP ● PDE
PDP PDE
PDP 😢 ●
PDE

If you try to access memory through a sad-face page directory entry, the CPU raises a page fault.

Looking back at the original issue: The customer was reserving 100GB of address space. Each page table maps 512 × 4KB = 2MB of address spaces, so that's 51,200 page tables filled with sad faces. The memory manager optimizes this out and instead of creating 51,200 page tables filled with sad faces, it creates 51,200 page directory entries with sad faces.

But wait, that's still proportional to the size of the memory reserve, albeit with a much lower constant factor. How do we get the memory usage to be constant?

Page directory entries are grouped together into page directories. Each page directory holds 512 page directory entries, and if the page directory is filled with sad faces, we can replace it with a sad face in the page directory pointer table. Our 51,200 sad-face page directory entries become 100 sad-face page directory pointer. (On x64, there's another hierarchy level above the page directory pointer table, known as the page map level 4, but let's ignore that for now.)

A reservation of 100GB of address space turns into this:

PDPT
PDP
😢
😢
😢
😢
PDP
PDP

Okay, so the constant factor is now a lot lower. And if you filled the entire page directory page table with sad faces, then it can be replaced by a single sad face in the page map level 4.

So now we have an extremely low constant factor, but it's still a constant factor. How do you get the whole thing to be a constant?

Because eventually you will hit the top of the virtual memory hierarchy (currently page map level 4), and the madness stops.

If you think about it another way, you would have realized that reserving address space would never have required creating page tables, page directories, or page directory pointer tables in the first place: The address space was already invalid (namely, in the free state). Reserving address space doesn't make the pages valid; they're just invalid for a different reason (namely, in the reserved state). They were sad faces before, and they remain sad faces. Either way, they're still sad. You didn't have to do anything with the page tables to change sad faces to sad faces.

There is therefore no additional cost in terms of page tables, page directories, or page directory pointer tables. The cost is in the memory manager's internal bookkeeping, which is constant.

Comments (11)
  1. Mc says:

    Does it actually use a sentinal value of 😢 (Unicode crying face) or something more boring like -1?

    1. This is actually an insightful question disguised as a joke. I’ll answer it in another post.

      1. Piotr Siódmak says:

        lemme guess: 0x5ADDFACE? :)

    2. alegr1 says:

      The only requirement that the “valid” (the least significant) bit is zero. The rest of PTE can be used to point to OS structures that describe this region. It’s unlikely that it’s wasted on a joke bitmask.

  2. DWalker says:

    Are these sad-faces visible in memory dumps? That’s cute!

  3. Matt G says:

    I’m guessing that it uses the value 0xdeadbeef

  4. Nawak says:

    As a side note, not all architectures uses a hierarchical page table. x86 and ARM do, but PowerPC for instance doesn’t. They mostly use a hash-table for PTE (but on some PowerPC there isn’t even a page table, only TLB that are loaded by software).
    On these kinds of CPU, not doing anything special for “reserved but not commited” memory is the “obvious” way. I wonder if the x86 memory manager would have been “optimized” from the start if Windows had been *first* designed for that kind of CPU. Sometimes you don’t realize that there are shortcuts when you have always thought of a problem in a certain way, and it’s only when you (or someone else) approach it from a different angle that the question “Wait, is this intermediate step really necessary?” does pop up.
    Or maybe the shortcut was always known but wasn’t deemed worthy of the extra lines?

    1. Simon Farnsworth says:

      I suspect that it’s just that programming models have changed; in DOS days, address space and physical memory were the same thing, and thus the early reservers of address space were people who expected to back it with something already available to the system (mirrors of RAM or disk contents).

      Modern users reserve as much address space as they could possibly need in the worst case. These people don’t expect to back most of the space with anything, it’s just reserved and empty.

      1. Jan Ringoš says:

        It was probably also a case of not optimizing for the case that nobody has used. The approach of using continuous memory is quite recent. The good old days used to be all about linked lists and quite a lot of things in Windows are still implemented that way.

        1. Also worth noting that when the algorithm was first chosen, Win16 and Win32s apps would have been the norm, at 16 MiB of address space per app, and that restriction would still have mattered in developer thinking. While Win32 increased the address space to 2 GiB, it took a while for people to get the idea that they could actually use that, and even then, it wouldn’t be uncommon for the maximum possible need to be beyond 2 GiB (even though the typical case is closer to 8 MiB).

          It’s not until 64-bit CPUs become the norm (address space of 8 TiB in early 64-bit Windows, 128 TiB in current 64-bit Windows) that most processes had sufficient address space that they could simply reserve the largest continuous block that they could possibly need instead of worrying about running out of space.

    2. smf says:

      @nawak Windows NT was designed for Intel i860XR first and then ported to MIPS and then later to the 386 and dec alpha. MIPS only has a TLB, so I doubt that throwing PowerPC into the mix would have made much difference.

Comments are closed.

Skip to main content