What possible use are those extra bits in kernel handles? Part 3: New object types


Last time, we saw how those extra bits can be used to multiplex HANDLE with other values. That was a specific case of a more general scenario: Expanding the handle namespace to include things that aren't handles. (You can also view today's example as a generalization of the sentinel value problem, where we need to generate an arbitrary number of sentinel values dynamically. Actually, multiplexing HANDLE with HRESULT is also just another special case: We expanded the handle namespace to include error codes too.)

As I noted in the base article, the people who are most interested in this sort of thing are people writing low-level class libraries or wrapping kernel objects inside a larger framework.

Suppose, for example, you are writing a library that manipulates kernel objects, but you also have private types of objects (say, a handle to a remote computer) that you also want this library to be able to manipulate. One way of doing this is to wrap everything inside some base class that virtualizes away what type of handle you're working on:

class ExtendedHandle {
public:
  virtual ExtendedHandleType GetType() = 0;
  virtual ~ExtendedHandle() { }
};

class KernelExtendedHandle : public ExtendedHandle {
public:
  static KernelExtendedHandle *Create(...);
  ExtendedHandleType GetType() { return KernelHandle; }
  HANDLE GetHandle() { return Handle; }
private:
  KernelExtendedHandle(...);
  HANDLE Handle;
};

class RemoteComputerExtendedHandle : public ExtendedHandle {
public:
  static RemoteComputerExtendedHandle *Create(...);
  ExtendedHandleType GetType() { return RemoteComputer; }
  LPCTSTR GetComputerName() { ... }
  ... other remote computer methods ...
private:
  RemoteComputerExtendedHandle(...);
  ... stuff necessary for tracking remote computers ...
};

Now, your library spends only 1% of its time manipulating these private object types; most of the time, it's dealing with regular kernel handles. In other words, 99% of your objects are of type KernelExtendedHandle. What used to be just a HANDLE (4 bytes) is now a EHANDLE that in turn points to an 8-byte structure (4 bytes for the vtable and 4 bytes for the HANDLE). Your memory requirements have tripled and you added another level of indirection (costing you locality), just for that 1% case.

But you can do better if you have those extra bits to play with. Since 99% of the objects you're wrapping are just plain old kernel handles, you can say that if the bottom bit is clear, then it's just a kernel handle, and if the bottom bit is set, then the upper bits tell us what we are really operating on.

typedef void *EHANDLE;

BOOL IsKernelHandle(EHANDLE Handle)
{
  return (!((INT_PTR)Handle & 1));
}

EHANDLE CreateHandleFromKernelHandle(HANDLE Handle)
{
  // if this assertion fires, then somebody tried to
  // use an invalid kernel handle!
  assert((!((INT_PTR)Handle & 1));
  return (EHANDLE)Handle;
}

EHANDLE CreateHandleFromExtendedHandle(ExtendedHandle Handle)
{
  // if this assertion fires, then somebody allocated
  // a misaligned ExtendedHandle!
  assert(!((INT_PTR)Handle & 1));
  return (EHANDLE)((INT_PTR)Handle | 1));
}

ExtendedHandle *GetExtendedHandle(EHANDLE Handle)
{
  assert(!IsKernelHandle(Handle));
  return (ExtendedHandle*)((INT_PTR)Handle & ~1);
}

ExtendedHandleType GetType(EHANDLE Handle)
{
 if (IsKernelHandle(Handle)) {
   return KernelHandleType;
 } else {
   return GetExtendedHandle(Handle)->GetType();
 }
}

void ECloseHandle(EHANDLE Handle)
{
  if (IsKernelHandle(Handle))
  {
    CloseHandle(GetKernelHandle(Handle));
  } else {
   delete GetExtendedHandle(Handle);
  }
}

Now the cost of a wrapped kernel handle is just 4 bytes: 4 bytes for the EHANDLE, which also doubles as the actual kernel handle. The cost of wrapped pseudo-handles is the same as before (4 bytes for the EHANDLE, plus the size of the corresponding XxxExtendedHandle class). We used the trick from last time in order to pack 33 bits into only 32 bits: Since we know that the bottom bit of both kernel HANDLEs and ExtendedHandle pointers are zero, we can use it as a discriminator.

If you are not confident that your ExtendedHandle classes all use aligned pointers, you can use a different packing mechanism by using your own handle translation table:

ExtendedHandle *ExtendedHandleTable;

ExtendedHandle *GetExtendedHandle(EHANDLE Handle)
{
  assert(!IsKernelHandle(Handle));
  return ExtendedHandleTable[(INT_PTR)Handle >> 1];
}

Using this technique, the upper 31 bits of an EHANDLE which refers to an ExtendedHandle is an index into a privately-managed table of ExtendedHandle objects.

This secondary handle table technique is entirely analogous to the trick which one Posix library uses to distinguish "real" process IDs from "virtual" process IDs, except that they are relying on undocumented behavior because the bottom bits of process IDs are not guaranteed to be zero!

So there you have it, three scenarios where you can take advantage of knowing that the bottom bits of kernel handles are always zero.

Comments (23)
  1. Adam says:

    Not a big deal, but near the end of the snippet:

    delete GetExtendedHandle();

    should be

    delete GetExtendedHandle(Handle);

    [Fixed, thanks, as well as another place I forgot to give GetExtendedHandle() a parameter. -Raymond]
  2. An Hero says:

    So, how much of a performance difference between the two methods do your extensive benchmarks reveal?

  3. Eddie says:

    @An Hero

    Yeah, but he is not talking about today’s computers, he is talking about yesterday’s computers that the code would be written for.  I am sure benchmarks would reveal that the difference is little.

    I think he is simply trying to justify why the design was made the way it was

  4. Leo Davidson says:

    4 bytes can make a huge difference, even on today’s computers.

    I worked on a system where someone added a single int member to an object without thinking. An easy thing to do since it’s just 4 bytes and we do that all the time. Problem was, there were a lot of those objects in memory during large operations and that 4 bytes per object pushed the process to the limits of the 2gig memory space for 32-bit apps (this stuff had to run inside Excel so even if we had 64-bit CPUs it wouldn’t matter).

    Okay, you might say that 32-bit is "yesterday’s machines" but that still isn’t a reality for the majority of code out there and using up several hundred meg of additional memory is a big deal even on a 64-bit machine.

    I don’t generally like having overloaded values (e.g. the outright wrongness that is atoi() returning zero on failure) but sometimes they are better than the alternative (though atoi() is still wrong!).

  5. Reginald Wellington III says:

    "So, how much of a performance difference between the two methods do your extensive benchmarks reveal?"

    Uh oh, here comes the "NO PREMATURE OPTIMIZATION!!!!!" crowd.  There’s nothing wrong with cutting memory usage of an object by 1/3 and getting rid of an indirection, especially if the implementation difficulty of the better method is roughly the same.

    Not all of us have to profile our code extensively to figure out what’s better.  We just do it the better way the first time.

  6. DrkMatter says:

    The important difference between atoi and the current discussion is that 0 is part of the set of results you would expect from atoi, whereas you should never expect an HANDLE with the bottom bits set: it’s part of the HANDLE definition.

  7. An Hero says:

    Leo:  I don’t mean this in a derogatory way; I am genuinely curious.  What the hell does your Excel plug-in do that it can use up almost the entire process address space?  I think I am going to have nightmares tonight.

    Reginald:  So you’ve never used a profiler, or even a simple GetTickCount() measurement in your entire life?  I guess you must be God.  Truly, I am in awe and blessed to be in your presence.

  8. Reginald Wellington III says:

    "Reginald:  So you’ve never used a profiler, or even a simple GetTickCount() measurement in your entire life?  I guess you must be God.  Truly, I am in awe and blessed to be in your presence."

    And the reward for best melodramatic response to a comment goes to…. YOU!  Bravo!  I look forward to more ridiculous conclusions from you in the future.

  9. Pierre B. says:

    Hero: don’t confuse the issues. That bit of wisdom about not doing early optimizations has too often been abused in exactly the way Reginald says. And it’s not only me and Reginald saying this, although I won’t go googling for references.

    Not doing early optimization is not a ticket to write bad code, choose bad data structures, not doing up-front design and choosing your implementation wildly from the get-go. It’s about not writing obscure and complex code without scientific measurement justifying the added complexity and increased maintenance and debugging costs.

    Often, the right and good and simple design is also efficient. You just have to think about your code before writing it.

  10. Hero:

    Premature optimization for time can be bad (though it isn’t always – inner loops expected to run on millions of items, for example should be optimized, and profilers/perf counters are one tool to help), but premature optimization for space without a time or coding readability impact is usually smart, and Raymond is pointing out a specific instance where you save time as well as space.

    There is a space/time tradeoff when you need to redo calculations instead of lookups, but that is mostly to do cacheing of complex calculations and not bit manipulation of the kind Raymond is espousing.

    A profiler or even calls to QueryPerformanceCounter (GetTickCount is NOT a high precision timer) is all well and good when measuring cycles, but working set can be a critical factor in performance. Once your program work becomes large enough that cache line fills start happening, you can measure the impact to performance in cache line fills.

    The cost of allocating one 4-byte integer is more than the size of the integer. This costs more in 64-bits than in 32-bits. Small improvements here and there can make a huge difference overall.

    The overall theme of the message is: Don’t waste space needlessly – you have bits, so use them wisely.

  11. Ulric says:

    it could be said the memory has a lot more than tripple with the c++ approach.

    I haven’t checked if each object has in fact two vtable pointer, rather than 1, but in any case in order to use that approach, you’d have to ‘new’ the object of the right class and use pointers.  So you’ve got multiple other pointers (probably two or three DWORD_PTR in the heap block around your object).  Now if you actually did need to store or associate data with the handle, and if you’re making a framework/libary it’s likely, then this may indeed be a needless early optimization.  but the article was just an example.

  12. Ulric says:

    to guys above:  this who ‘memory has doubled’ thing is an over simplication.  I understand why you think it’s a big optimization benefit if you see it that way.  

    But you need to know how much RAM a handle actually cost. It probably links to structure in the OS, there may be stuff associated with like security descriptor, heap block header/footer, etc.  Saving a DWORD in your code may in fact increase the memory per handle of just 5%, not 200%.

    Isn’t it in these comments a few weeks ago that someone was talking about lottery and chances.  A statement had been made that you had a 50% chance: either you win or you loose.  Same thing here, the problem is more complex.

  13. Pierre B. says:

    I’ll admit right away that I didn’t *all* the comments on each part of this serie.

    My question is: the fact that the bottom two bits are available is something at the NT-kernel level (in ntdef.h). How can we be sure those bits really are available when we are writing code at the Win32 level?

    Some APIs (the GetQueuedCompletionStatus function being the initial example) might use them now. Some APIs or sub-system might use them in the future.

    Is this trick reserved for the Windows team?

  14. Mark Sowul says:

    I propose yet another new tagline for Raymond’s blog – "You will never find a more wretched hive of carping and pedantry."

  15. KJK::Hyperion says:

    Pierre: to the best of my knowledge, the only tagged handles you’ll ever come across are console handles (console input and console screen buffers), which aren’t kernel handles in the first place. Win32 also special-cases STD_INPUT_HANDLE, STD_OUTPUT_HANDLE and STD_ERROR_HANDLE, which, zero-extended to a handle with UlongToHandle, count as an implicit GetStdHandle call – but this is an undocumented feature you most definitely didn’t hear about from me and should forget for your own good

    Curiously enough, handle tags are in fact used by the NT layer in cases where adding a new system call for a single flag was presumably deemed too expensive

  16. Sys64738 says:

    Thanks for sharing these low-level optimization techniques. While for business apps all the "managed" stuff (.NET, CLR, Java, etc.) can be good, for low-level stuff like operating systems, device drivers, and in general for high-performance applications these techniques can save lots of memory and increase speed.

  17. Medinoc says:

    @Ulric: You’re right on one point, adding four bytes to a dynamically allocated structure is usually cheap. Adding a dynamically allocated structure where there weren’t one in the first place, however, isn’t.

  18. Richard says:

    The thing which has most struck me about this whole series is (apologies to Raymond) /what if two people did this?/.

    Suppose some team maintaining some lower-layer code decides they need a flag with their handle, and add one. Further suppose some team maintaining some higher-layer code decides the same thing. Suppose further that the release schedules of the two teams are unrelated (possibly they’re in different companies altogether?). Result: assertions if you’re lucky/careful, all kinds of failure if not.

    At most one group can own the spare bits here, even if they’re not used by anyone at the moment. Fortunately, in my experience, the documentation’s been reasonably good in identifying who that is, but the thought of applications doing this still fills me with dread.

  19. SuperKoko says:

    "How can we be sure those bits really are available when we are writing code at the Win32 level?"

    The two lower bits are documented to be zero.

    "Some APIs (the GetQueuedCompletionStatus function being the initial example) might use them now."

    I don’t care if it uses them internally. I just need to pass a proper HANDLE, and it has to work. If I store data in the two lower bits, I have to mask them (myhandle_with_data & 3), so that it becomes a proper HANDLE, of course!

  20. 640k says:

    In the old days a true handle was defined as "a pointer to a pointer", how can any old application still work now that this concept has been broken by the kernel by manipulating the addresses?

  21. Pierre B. says:

    SuperKoko: obviously, you didn’t understand my comment. The fact that the two lower bits are zero is a contract at the NT kernel level. Win32 lives above the NT kernel, it’s a separate layer, and is not necessarily bound by the promises made by the kernel.

    KJK::Hyperion already provided an example of a Win32 call (GetStdHandle) which will return HANDLE with low-bits set. You can’t "mask them out" when you make Win32 calls later on, it will not work!

  22. SuperKoko says:

    "Win32 lives above the NT kernel, it’s a separate layer, and is not necessarily bound by the promises made by the kernel."

    I thought it was guaranteed at the Win32 level too, but it looks like I’m  wrong, as the documentation is in ntdef.h. Consequently, Raymond’s hack mayn’t be portable to Windows 9x/Me or future non-NT based versions of Windows and should never be used by Win32 programs intending to be portable. Basically, it should never be used.

  23. Good Point says:

    I guess Raymond doesn’t update his mistakes anymore.

    As Adam pointed out, ECloseHandle() is wrong.  Also I don’t think CreateHandleFromExtendedHandle() was meant to accept an ExtendedHandle by value.

    [I guess Raymond is too busy working (y’know that thing that he gets paid to do) to update mistakes immediately after they are reported, preferring to wait a few days to let all the corrections settle out. -Raymond]

Comments are closed.