cbKeyMost when indexing needs additional overhead

I recently helped someone track down why they were getting JET_errKeyTruncated. They declared an index over a JET_coltypLongBinary column with cbMax = 0x20c, and when they created the index, they specified cbKeyMost = 0x20c, and also passed in VistaGrbits.IndexDisallowTruncation (they were using ManagedEsent).

But then they were getting JET_errKeyTruncated when setting a value with length 0x204, which is obviously less than 0x20c. But why?

If the column was JET_coltypLongText and sorted as Unicode, the answer would be obvious. I'll back up a bit though:

When ESE creates indices, it normalizes the keys first so that 'memcmp' is able to work on the raw memory. That way, the B-Tree code doesn't have to know what type a particular key is; it just knows raw bytes and lengths:

  • For unsigned integers, this is easy. (Although depending on the endian-ness of the platform, the bytes may need to be shuffled around first.)
  • For signed integers, they need to be biased first, and then can be sorted as signed numbers. (e.g. for bytes, add 128 to them first, so that instead of [-128 .. 127] we end up storing [0 .. 255], which sort naturally.) When retrieving from the key, this can be trivially reversed. (Useful with RetrieveColumnGrbit.RetrieveFromIndex)
  • Date/times are normalized to a floating point number.
  • ASCII/ANSI text is all converted to upper case. This is the first of the 'lossy' normalizations, since we don't know what the initial case was.
  • Unicode is more complicated. We use the OS's LCMapStringEx function to normalize the text. Because the world is a big place, it has a very hard problem to solve. 'U' and 'u' and 'u-with-umlaut' should all be sorted together (depending on the flags that are passed in). This ends up expanding a string to many times its original length. A 20-character string takes up 40 bytes, and the normalized binary form could be twice or even up to five times that size (200 bytes -- although that's pretty rare).

Now for LongBinary. Internally we break up the LongBinary values in to much smaller chunks, and add a bit of overhead. Every 8 bytes we store another header byte.

For that 0x204-length LongBinary column, we do:

(0x204 + (8-1) ) / 8 = 65. This is the number of 'chunks'. Each of which takes up 9 bytes. That's 585 bytes. And then add another header byte, and that's 586 bytes (0x24a). That's 70 extra bytes on a 516-byte column, or about 14% more.

Summary: You may need a fudge factor when calculating cbKeyMost. You can't just assume it's the same as the column's cbMax.