cbKeyMost when indexing needs additional overhead

Article
10/08/2013

I recently helped someone track down why they were getting JET_errKeyTruncated. They declared an index over a JET_coltypLongBinary column with cbMax = 0x20c, and when they created the index, they specified cbKeyMost = 0x20c, and also passed in VistaGrbits.IndexDisallowTruncation (they were using ManagedEsent).

But then they were getting JET_errKeyTruncated when setting a value with length 0x204, which is obviously less than 0x20c. But why?

If the column was JET_coltypLongText and sorted as Unicode, the answer would be obvious. I'll back up a bit though:

When ESE creates indices, it normalizes the keys first so that 'memcmp' is able to work on the raw memory. That way, the B-Tree code doesn't have to know what type a particular key is; it just knows raw bytes and lengths:

For unsigned integers, this is easy. (Although depending on the endian-ness of the platform, the bytes may need to be shuffled around first.)
For signed integers, they need to be biased first, and then can be sorted as signed numbers. (e.g. for bytes, add 128 to them first, so that instead of [-128 .. 127] we end up storing [0 .. 255], which sort naturally.) When retrieving from the key, this can be trivially reversed. (Useful with RetrieveColumnGrbit.RetrieveFromIndex)
Date/times are normalized to a floating point number.
ASCII/ANSI text is all converted to upper case. This is the first of the 'lossy' normalizations, since we don't know what the initial case was.
Unicode is more complicated. We use the OS's LCMapStringEx function to normalize the text. Because the world is a big place, it has a very hard problem to solve. 'U' and 'u' and 'u-with-umlaut' should all be sorted together (depending on the flags that are passed in). This ends up expanding a string to many times its original length. A 20-character string takes up 40 bytes, and the normalized binary form could be twice or even up to five times that size (200 bytes -- although that's pretty rare).

Now for LongBinary. Internally we break up the LongBinary values in to much smaller chunks, and add a bit of overhead. Every 8 bytes we store another header byte.

For that 0x204-length LongBinary column, we do:

(0x204 + (8-1) ) / 8 = 65. This is the number of 'chunks'. Each of which takes up 9 bytes. That's 585 bytes. And then add another header byte, and that's 586 bytes (0x24a). That's 70 extra bytes on a 516-byte column, or about 14% more.

Summary: You may need a fudge factor when calculating cbKeyMost. You can't just assume it's the same as the column's cbMax.

cbKeyMost when indexing needs additional overhead

Additional resources