Why is there a 64KB no-man’s land near the end of the user-mode address space?


We learned some time ago that there is a 64KB no-man's land near the 2GB boundary to accommodate a quirk of the Alpha AXP processor architecture. But that's not the only reason why it's there.

The no-man's land near the 2GB boundary is useful even on x86 processors because it simplifies parameter validation at the boundary between user mode and kernel mode by taking out a special case. If the 64KB zone did not exist, then somebody could pass a buffer that straddles the 2GB boundary, and the kernel mode validation layer would have to detect that unusual condition and reject the buffer.

By having a guaranteed invalid region, the kernel mode buffer validation code can simply validate that the starting address is below the 2GB boundary, then walk through the buffer checking each page. If somebody tries to straddle the boundary, the validation code will hit the permanently-invalid region and fail.

Yes, this sounds like a micro-optimization, but I suspect this was not so much for optimization purposes as it was to remove weird boundary conditions, because weird boundary conditions are where the bugs tend to be.

(Obviously, the no-man's land moves if you set the /3GB switch.)

Comments (29)
  1. Henke37 says:

    Some time indeed.

  2. Joshua says:

    I suppose it's possible to map it with a device driver but if somebody pulls that they get what they deserve.

    (Device drivers can access memory above the licensed memory limit so this is trivial in comparison.)

  3. smf says:

    "If the 64KB zone did not exist, then somebody could pass a buffer that straddles the 2GB boundary, and the kernel mode validation layer would have to detect that unusual condition and reject the buffer."

    How does it help? If I pass a 128k buffer that starts at 127k below the 2gb boundary then it will still straddle the boundary.

    If you can't write that kind of validation and test it properly then you're probably not supposed to be writing operating systems.

    [If you have a no-man's land, then the straddled buffer will hit a no-access page, and validation will fail automatically. -Raymond]
  4. Mordachai says:

    @smf – right, the processor will catch the attempt.  No software need check this.

  5. Mike Dimmick says:

    @smf: The Windows kernel checks that every page of every buffer parameter passed to a system call is accessible before starting work on the request.

    "When a user-mode application calls the Nt or Zw version of a native system services routine, the routine always treats the parameters that it receives as values that come from a user-mode source that is not trusted. The routine thoroughly validates the parameter values before it uses the parameters. In particular, the routine probes any caller-supplied buffers to verify that the buffers are located in valid user-mode memory and are aligned properly."

    msdn.microsoft.com/…/ff565438(v=vs.85).aspx

    Drivers are supposed to call the Zw version of the routine if the data is already in kernel space and/or has been validated already, or the Nt version if the data originated from user-mode and needs to be validated.

  6. s says:

    @smf – since you need to do the page accessibility check anyway, getting the check for straddling the address space for free is nice though. You always get fewest bugs in the code you didn't need to write.

  7. smf says:

    "[If you have a no-man's land, then the straddled buffer will hit a no-access page, and validation will fail automatically. -Raymond]"

    Is the memory above 2gb marked as valid and therefore excluded because of a different check?

    [Since the check is made in kernel mode, the memory above 2GB is valid to kernel mode. The validation code checks that the buffer does not end above the 2GB boundary. So technically, the no-man's land isn't required, but it's nice to have defense in depth. -Raymond]
  8. @smf: Raymond's last sentence should hint to the answer to this:

    "(Obviously, the no-man's land moves if you set the /3GB switch.)"

    In other words, this is in reference to a 32-bit process's address space on a 32-bit OS.  In that context, the memory above 2GB isn't accessible to the process anyway and would fail regardless.  If you're on a 64-bit OS or use the /3GB switch on a 32-bit OS, then the no-mans land moves to right before the 4GB or 3GB barrier, respectively.  I would presume that a 64-bit process would have a similar no-man's land at the end of its address space, though it would be a lot harder to hit up against (well, barring intentionally doing so, of course).

  9. Gabe says:

    Wouldn't it work just as well to have the no-man's land be page at 2GB (or /3GB)? If you tried to access every page of a buffer and hit the one at 2GB, it would still fault before accessing any actual kernel memory.

    Of course you can't change it now, because there's probably some important app that crashes if you give it that chunk of addresses.

    [One page would have been enough for this purpose, but allocation granularity is 64KB, so taking one page off the table is the same as taking 64KB off the table because there is no way to allocate the other 60KB. -Raymond]
  10. alegr1 says:

    Simply probing usermode pages is not good enough. You have to copy the data from usermode buffers while under try/except block to your internal buffers. This also makes probing unnecessary.

    If you only probe pages, they can become invalid in the meantime if another thread unmaps them.

    Probing only makes sense for MmProbeAndLock pages function on an MDL, to prepare the MDL for an I/O operation.

    [Yes, if you are capturing the entire buffer, then probing is redundant. If you're capturing only part of it, you still want to ensure the whole buffer is valid. Or maybe you're not capturing it at all but you plan on writing to it later, so you want to fail up front if possible. -Raymond]
  11. Evan says:

    @Gabe: "Wouldn't it work just as well to have the no-man's land be page at 2GB (or /3GB)?"

    No, because of the Alpha quirk.

  12. Muzer_ says:

    @Henke37 "Some time indeed." – I was nine years old when that linked article was published.

  13. j b says:

    @muzer…

    " I was nine years old when that linked article was published."

    So when I was your age, I was halfway in my IT studies, two and a half years to go. Not that THAT neecessarily indicates anything significant. I would just like to mention it :-)

  14. Muzer_ says:

    @j b: Well, I'm now halfway through my Masters undergrad degree in Computer Science (I'm on the third year of a four-year course), so yeah, sounds about right :)

  15. Dave says:

    Does the 64K VirtualAlloc() granularity discussed in the referenced article still exist in modern versions of Windows? I've been trying to find references to it but the current VirtualAlloc() docs, msdn.microsoft.com/…/aa366887%28v=vs.85%29.aspx, only talk about page granularity, while older (1990s) MSDN discussions were quite explicit about it.

  16. foo says:

    @Dave. The documentation you link to says to call the GetSystemInfo() function to determine the address granularity of the host system. On my test machine it reports 64kb. 64-bit Windows Server 2008 R2 (Standard) with Intel processor.

  17. Gabe says:

    Evan: What I'm saying is that now that Alpha is no longer supported, you could move the no-man's land to the first page of kernel memory without affecting the buffer validation code.

    Obviously there's no reason to do that — if you're that desperate for address space, just use 64 bits — just that it is possible.

  18. Evan says:

    @Gabe:

    Raymond's "One page would have been enough for this purpose" response made me realize I may have misread your question. If you're asking why the dead space isn't a single page, the 64K size answers. If you're asking why the dead space isn't *at* 2GB instead of just below it (regardless of size), that's because of the Alpha quirk and how loading addresses in the (2GB – 64KB) to 2GB range takes an extra instruction.

  19. Engywuck says:

    @Gabe: especially if some programs rely on this implementation detail (and it's quite probable at least some LOB ones do). At least it would take quite an effort to ensure that noone is inconvenienced too much by it, so the -100 points probably never get to a point where they see the 0 line :-)

    [#define INVALID_POINTER_VALUE 0x7FFFFFFF
    -Raymond
    ]
  20. smf says:

    "[Since the check is made in kernel mode, the memory above 2GB is valid to kernel mode. The validation code checks that the buffer does not end above the 2GB boundary. So technically, the no-man's land isn't required, but it's nice to have defense in depth. -Raymond]"

    The problem I see with defence in depth is that if you need it then how much defence do you actually have. Unless the exploit for each defence is mutually exclusive (so both defences cannot possibly be breached at the same time), then it's no better than security through obscurity.

    I can't argue that an extra 64k or the cycles taken by the check would be noticeable if you took one of the defences down, but I am worried that you think that either of them is breachable at all.

    If the page check can be breached below 2gb then there are privacy issues.

  21. Crescens2k says:

    @smf:

    The question here is how much defence do you think they have for the kernel/user boundary? When we have some major security issue here, do you think that they would do something like just use the buffer unchecked? There is more checks than you think, but the no mans land adds an extra level that can't be ignored because it will end up as a page fault generated by the processor.

    So no, the situation is nowhere near as bleak as you imagine. This gives even more defence in a situation where you can never have enough. It is also independent of the code that can be messed up, since it would be a hardware generated event.

  22. Joshua says:

    [If you have a no-man's land, then the straddled buffer will hit a no-access page, and validation will fail automatically. -Raymond]

    WriteFileGather has the unique ability to write to a buffer out of order. I sure hope I'm not declaring a zero day here.

  23. Gabe says:

    Joshua: The buffer validation happens before the system call even begins executing.

    I'm not sure where WriteFileGather comes into play, though, because a write operation reads from a buffer. Perhaps you meant ReadFileScatter? Either way, the problem doesn't even apply here because the scatter/gather functions take a list of pages — not a single contiguous buffer. That is, if you wanted it to use a 40k buffer, you would have to pass in a list of 10 pointers (one for each page).

  24. smf says:

    @Crescens2k

    If there is code that can be messed up then it could just as easily be wrong when the buffer doesn't straddle no mans land.

  25. Dave says:

    @foo: Thanks, somehow I glossed over that while looking for an explicit mention of a value.

  26. smf says:

    @Paul

    It doesn't appear to improve the efficiency as the code that tests for the buffer ending in kernel space is still run. If the buffer straddles no mans land then it's way less efficient taking the CPU exception.

    Better add some if (1==1) checks to it, just in case.

  27. Paul says:

    @smf

    Code that is wrong and has a buffer that doesn't straddle no mans land is indeed possible, this is however a very easy and efficient way of preventing problems by code that causes a buffer that *does* straddle this 64k. There will certainly be other checks in place to prevent invalid / malicious code doing other things, however a single 64k block used this way allows the hardware to trap the edge cases very efficiently and in a way that cannot be avoided at the software level (Raymond says this in the article / comment responses).

  28. Paul says:

    @smf

    The whole point of the article is the fact that this is "defence in depth", this is an area where a potential exploit *could* exist and this 64k buffer means if the executing code was compromised the hardware itself would provide an additional level of protection. As Raymond is poining out in the article, 64k is deemed a worthwhile trade-off for the additional protection it offers. If you seem to think being extra safe on a critical boundary is the same as doing a 1 == 1 then you are obviously missing the point of the article.

    This is an *additional* level of protection that protects in a different way to whatever checks already exist.

  29. Simon Farnsworth says:

    @smf

    The point is that the kernel does three things with a userspace buffer before starting a syscall:

    1. Check that it starts in user address space.

    2. Check that it ends in user address space.

    3. Check that each page within the buffer is accessible, as confirmed by the CPU.

    Checks 1 and 2 should be sufficient. However, if there's a fencepost error in either 1 or 2, allowing a buffer to enter non-user address space by 1 page, check 3 will catch it as long as there's at least 1 page black hole space between user address space and kernel address space. As Windows has an existing 16 page black hole (due to the Alpha limitation), and there's no significant gain from removing this black hole, it might as well stay in place and catch rare cases where the first two checks fail.

Comments are closed.

Skip to main content