The C language specification describes an abstract computer, not a real one


If a null pointer is zero, how do you access the memory whose address is zero? And if C allows you to take the address one past the end of an array, how do you make an array that ends at 0xFFFFFFFF, since adding one to that value would wrap around?

First of all, who says that there is a byte zero? Or a byte 0xFFFFFFFF?

The C language does not describe an actual computer. It describes a theoretical one. On this theoretical computer, it must be possible to do certain things, like generate the address of one item past the end of an array, and that address must compare greater than the address of any member of the array.

But how the C language implementation chooses to map these theoretical operations to actual operations is at the discretion of the C language implementation.

Now, most implementations will do the "obvious" thing and say, "Well, a pointer is represented as a numerical value which is equal to the low-level memory address." But they are not required to do so. For example, you might have an implementation that says, "You know what? I'm just going to mess with you, and every pointer is represented as a numerical value which is equal to the low-level memory address minus 4194304. In other words, if you try to dereference a pointer whose numeric value is 4096, you actually access the memory at 4194304 + 4096 = 4198400. On such a system, you could have an array that goes all the way to 0xFFFFFFFF, because the numeric value of the pointer to that address is 0xFFBFFFFF, and the pointer to one past the end of the array is therefore a perfectly happy 0xFFC00000.

Before you scoff and say "That's a stupid example because nobody would actually do that," think again. Win32s did exactly this. (The 4194304-byte offset was done in hardware by manipulating the base address of the flat selectors.) This technique was important because byte 0 was the start of the MS-DOS interrupt table, and corrupting that memory was a sure way to mess up your system pretty bad. By shifting all the pointers, it meant that a Win32s program which dereferenced a null pointer ended up accessing byte 4194304 rather than byte 0, and Win32s made sure that there was no memory mapped there, so that the program took an access violation rather than corrupting your system.

But let's set aside implementations which play games with pointer representations and limit ourselves to implementations which map pointers to memory addresses directly.

"A 32-bit processor allegedly can access up to 2³² memory locations. But if zero and 0xFFFFFFFF can't be used, then shouldn't we say that a 32-bit processor can access only 2³² − 2 memory locations? Is everybody getting ripped off by two bytes? (And if so, then who is pocketing all those lost bytes?)"

A 32-bit processor can address 2³² memory locations. There are no "off-limits" addresses from the processor's point of view. The guy that made addresses zero and 0xFFFFFFFF off-limits was the C language specification, not the processor. That a language fails to expose the full capabilities of the underlying processor shouldn't be a surprise. For example, you probably would have difficulty accessing the byte at 0xFFFFFFFF from JavaScript.

There is no rule in the C language specification that the language must permit you to access any byte of memory in the computer. Implementations typically leave certain portions of the address space intentionally unused so that they have wiggle room to do the things the C language specification requires them to do. For example, the implementation can arrange never to allocate an object at address zero, so that it can conform to the requirement that the address of an object never compares equal to the null pointer. It also can arrange never to allocate an object that goes all the way to 0xFFFFFFFF, so that it can safely generate a pointer one past the end of the object which behaves as required with respect to comparison.

So you're not getting ripped off. Those bytes are still addressable in general. But you cannot get to them in C without leaving the C abstract machine.

A related assertion turns this argument around. "It is impossible to write a conforming C compiler for MS-DOS because the C language demands that the address of a valid object cannot be zero, but in MS-DOS, the interrupt table has address zero."

There is a step missing from this logical argument: It assumes that the interrupt table is a C object. But there is no requirement that the C language provide access to the interrupt table. (Indeed, there is no mention of the interrupt table anywhere in the C language specification.) All a conforming implementation needs to do is say, "The interrupt table is not part of the standard-conforming portion of this implementation."

"Aha, so you admit that a conforming implementation cannot provide access to the interrupt table."

Well, certainly a conforming implementation can provide language extensions which permit access to the interrupt table. It may even decide that dereferencing a null pointer grants you access to the interrupt table. This is permitted because dereferencing a null pointer invokes undefined behavior, and one legal interpretation of undefined behavior is "grants access to the interrupt table."

Comments (21)
  1. Nathan_works says:

    Ah, it's been a long time since I worked on a project that involved public published specs and RFCs, with the various ways people would interpret should/must/will/always/may etc.

  2. Random832 says:

    Also, a null pointer doesn't have to have a numeric value of zero at all. Converting a null pointer to an integer could yield whatever number the compiler feels like, and vice versa (only integer constant expressions equal to 0 must be converted to null pointers – an integer variable that is 0 could become some other pointer.)

    Generally this sort of thing was done on architectures that had a pre-existing "invalid pointer value" convention. The x86 just uses general registers for pointers, and 0 is the easiest value to compare an integer to, so null pointer is 0. This was also true on the PDP-11, which is how NULL == 0 became the convention in C in the first place.

  3. Joshua says:

    nullptr != 0 is murderous on most modern code that assumes memset() on a struct sets the contained pointers to NULL.

    With the 0 address is invalid problem, I encountered a system that put the heap descriptor there. It was amusing because the heap was written in C, and so special care had to be taken to avoid using NULL inside the heap manager itself.

    The 0xFFFFFFFF problem is trivially avoided on all systems with arbitrary mapping by placing the top of stack there. In C, you can't walk the stack without invoking undefined behavior so this reduces that to a non-issue.

    [The sort of person who asks this question would then say, "So how do you make an array that ends at 0xFFFFFFFF?" -Raymond]
  4. Ah, I missed the "let's set aside implementations which play games with pointer representations" clause.

  5. Henning Makholm says:

    Virtual memory in general is arguably a case of "implementations which play games with pointer representations". Whether it's done with segment descriptors (as Win32s apparently did), or with paging, the effect is more or less the same from the viewpoint of user code.

  6. Mark Y says:

    Dereferencing null is undefined?  Cool!  I thought it was guaranteed to crash, just like a false assertion or something.  So crashing is the OS guarantee, not the language guarantee apparently.

  7. 12BitSlab says:

    S/360 was one of the first systems to do virtual to physical address mapping in hardware.  Just because you think that you are addressing a particular byte — even if you write in BAL (that's Basic Assembler Langugage for you youngsters) — you are not unless you have told the OS during configuration that you are running in a Virtual=Real environment.

    I am not crazy about the C convention of NULL=0.  However, if one looks back at the first implementations and the hardware limitations that they dealt with, it is very understandable why that convention exists in the form it does.

    The bottom line is that no matter how far forward we move, we must always deal with decisions from the past.

  8. Cesar says:

    Obligatory link: "What Every C Programmer Should Know About Undefined Behavior" (3 part series) blog.llvm.org/…/what-every-c-programmer-should-know.html

  9. GregM says:

    "Virtual memory in general is arguably a case of "implementations which play games with pointer representations"."

    In general, yes, but in this case, not really.  The same principles apply to accessing bytes 0 and 0xFFFFFFFF of the process's address space whether or not there is a virtual memory system between the address space and the actual RAM.

  10. Adam Rosenfield says:

    On Linux, you can use mmap(2) to map the 0 page and dereference null pointers, but you first need to write to /proc/sys/vm/mmap_min_addr (as root) to enable that.  Of course, if you do that inside the kernel, that's extra-bad, and it can be easily exploited, as demonstrated here: blogs.oracle.com/…/much_ado_about_null_exploiting1 .

  11. 640k says:

    The whole concept of a null pointer is a anti pattern, in the same way any magic number is. Actually, most of the C standard is an anti pattern in itself.

  12. Myria says:

    Prior to Windows 8, you could allocate the first 64k of address space by calling VirtualAlloc with a pointer of 1.  It would round down your pointer to 0 for the allocation, but not interpret your request as NULL = allocate anywhere.

    Windows 8 blocks off such allocations as a kernel exploit mitigation, for the same reason Linux does.  But, unlike Linux, there's no direct way to disable that feature, and no way at all on 64-bit.  Virtual DOS machines in 32-bit Windows 8 can still allocate the first 64k so that they can have an emulated real mode interrupt table.

  13. Myria says:

    On a related note to the original topic, it should be noted that casting to uintptr_t to do pointer arithmetic and casting back to pointer type is not portable, though it will work with any flat-addressing implementation.  This is because C does not require the uintptr_t to be any meaningful value – all that C requires is that casting to uintptr_t and casting back to the same pointer type survives a round trip.  A silly but compliant C implementation could convert to uintptr_t by doing a DES encrypt on the pointer, then DES encrypt back, and even use different encryption keys for each type of pointer.

    I think that it would be interesting to have a C implementation that is the worst-case scenario, breaking on as many undefined operations as possible.  The compiler would implement C in the most ridiculous ways possible while adhering to the Standard.

  14. I don't follow the argument that people are being shortchanged by two bytes. Is there a rule that a pointer has to actually refer to a memory address? Or can a conforming C implementation do something like:

    Assume a machine has exactly 4 GB of memory, so memory locations 0x00000000 through 0xffffffff are valid.

    Use a 64-bit type to store pointer values.

    Map nullptr == (void*)0 to a bogus memory location… for example, 0x12345678`9abcdef0

    Map (void*)p to the memory location at (p – 1) for all other values of p

    [That sounds familiar. Oh right, I wrote that. See the paragraph "For example, you might have an implementation that says…" -Raymond]
  15. Mark Y says:

    @Myria: That sounds very useful: think lint on steroids!

  16. JM says:

    @Myria: You're talking about the DeathStation 9000, which has never really sold well.

    The problem with this approach is that almost no program that's not a toy would even run under the most unreasonable interpretation of the Standard, and even those that do would probably find themselves unable to produce desirable output, and even those that could would be running as slow as molasses. It would be an exercise in constrained writing that does very little for improving the quality of code that actually needs to get things done.

    There is a huge gap between "break on as many undefined operations as possible" and "implement C in the most ridiculous way possible". The former is actually useful, the latter is at best exercise material for language lawyers. There are research projects for compilers/analyzers that try to nail down undefined behavior as much as possible — i.e., catch the stuff that actually matters. Unfortunately UB is UB for a reason, namely that it would be very hard for a compiler to detect that it is actually UB, in some cases being as hard as solving the halting problem. Even so, those attempts are more useful than an actual DS9K.

    In the end, if you really need to get rid of undefined behavior, you have to seriously consider not using C (at least not in its pure form). It places tremendous and in most cases unwarranted trust in the ability of the programmer to keep track of sets of fairly arcane rules that exist primarily to allow compiler writers to optimize the tar out of C code, something which is almost completely in opposition to getting reliable code.

  17. JM says:

    For what it's worth, this code is an honest attempt at constructing a pointer for the memory at location 0 provided your machine has such a location (the details of which, as Raymond points out, are not guaranteed by the Standard):

    int x = 0;

    void* p = (void*) x;

    This invokes implementation-defined behavior. The reason you need to get tricky is to avoid constructing a null pointer — which, as Random832 points out, is *not* necessarily zero. And even then, as Raymond points out, "memory location 0" need not correspond to what you would like it to be, but your implementation is ethically obliged to document that (technically, it need only document how pointers convert to integers and back, but leaving out a discussion on actual memory locations of that would be chicanery of the most deplorable kind).

    The confusion between 0 and the null pointer is one of the bigger warts of C, IMO. NULL helps pragmatically, but does nothing for the semantics.

  18. Matt says:

    @MarkY "Dereferencing null is undefined?  Cool!  I thought it was guaranteed to crash, just like a false assertion or something.  So crashing is the OS guarantee, not the language guarantee apparently."

    Nope. It's not an OS guarantee either. The OS won't ever normally allocate memory at address zero, but there's nothing to stop you telling it to. Try doing "VirtualAlloc(1, 4096, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE)" on your pre-Windows8 machine.

    In fact, this is the reason why null-dereferences in kernel mode are often exploitable as elevation of privilege attacks. The null-page is mappable and within the user-addressable region of memory, so if the kernel dereferences a null pointer, it reads attacker controllable data.

    And btw, this is the reason why on Linux and Windows8+ you can't map the null-page.

  19. Joshua says:

    @Matt: Sometimes you have to reconfigure to allow mapping the NULL page on Linux to run Wine (16 bit Windows apps).

    My only gripe about LLVM compiler was no way to remove the optimization about remove null pointer checks without disabling optimizations. The Linux team actually made a policy decision to never remove NULL pointer checks anymore after the last NULL exploit would have been prevented except for the optimizer removed the check. De-referencing *(NULL + some sufficiently large x) is still exploitable and there's no way around that.

  20. Wilmer E. Henao says:

    Those adresses beyond 0xFFFFFFFF sound very dangerous.  So no standard way to catch this exception, huh?

  21. Yuhong Bao says:

    @Joshua: AFAIK Win8 handles this by disabling NTVDM by default.

Comments are closed.