Serialization and Ordering: An Integral Example


Some people peeked (and poked) under the covers and have been surprised at how SQL Server is twiddling their UDT bits during serialization.  Let’s peek together.


Consider the following fragment of a UDT:



[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.Native)]
public struct Simple : INullable
{
  private bool isNull;
  public int val;


  public static Simple FromInt(SqlInt32 i)
  {
    if (i.IsNull)
      return Null;
    Simple u = new Simple();
    u.val = i.Value;
    return u;
  }


  …
}


Notice that we’re using native serialization in this UDT.


Now consider the following use of the UDT from T-SQL:



declare @s Simple
set @s = Simple::FromInt(3)
print convert(varbinary(100), @s)
set @s = Simple::FromInt(-3)
print convert(varbinary(100), @s)


What do we expect?  In other words, how do we expect SQL Server to serialize this UDT?  Here’s the result:



0x0080000003
0x007FFFFFFD


How do we explain this?  The leading “00” in each result looks like (and is) the stored form of the Boolean, but the next four bytes are a bit strange at first glance: they look like big endian two’s complement representations of the numbers, but with the first bit flipped.


In fact, that’s exactly right.  SQL Server ensures that the bytes are serialized in big endian order and then flips the first bit.


Why?  Sorting.  If the numbers are stored in this form, then the order of the binary representation agrees with the normal ordering of the integers, so comparisons can be done by comparing the bytes.


Why do we care about that?  It means that if we call the UDT byte-ordered, we (a) aren’t lying, and (b) can perform equality and range queries over it.


Cheers,
-Isaac K. Kunen, Microsoft SQL Server


Comments (5)

  1. That’s certainly interesting — and eye-opening!

    What advice can you give for handling this in user-defined serialization situations? If I switch this type to user-defined and serialize the bool followed by the int, it does not sort properly (as expected). Can this be easily resolved? I expect that the majority of developers working with this will be shocked when they find out that integers do not sort properly when serialized in this way.

  2. Isaac Kunen says:

    What advice can I give? Serialize things in an order-preserving way! 🙂

    Really, user-defined serialization is completely opaque to SQL Server. Since the server has no insight into what the user is doing, we can’t do any magic when it comes to ordering the bits.

    One thing we could consider doing—but really anyone could do—is to provide serialize/deserialize routines that do this for the user. The only advantage we have here is that we’ve already written this once. Generally, as you can see from the integer example, it’s not that tricky a transformation.

    Cheers,

    -Isaac

  3. Okay, well, what other types do we need to worry about? Numerics, obviously. Strings, I don’t think so. How about DateTime? Any trickiness there? And GUIDs sort by the last 6 bytes in SQL Server, so we should probably do something about that…

  4. Isaac Kunen says:

    I think you need to look at all of the types supported by BinaryWriter/Reader.

    Depending on what you think a bool should serialize as, it should be fine. (Flase maps to 0x00, true to 0x01.)

    Any integer type with more than one byte (short, ushort, int, uint, long, ulong) will need its bytes reversed to make them big endian.

    Any integer type that handles negative numbers (sbyte, short, int, long) is in two’s compliment, and therefore will need its first bit flipped.

    The integer type that is one byte and unsigned (byte) is fine.

    Floating point types (float, double, decimal) are not okay—how to handle them is left as an exercise for the reader.

    Arrays of bytes should be fine, assuming you want the sort order to match the order of the array—the bytes are fine individually, and the array of them are serialized in order.

    Character types (char, char[], and string) are very sensitive to the encoding. Strings are additionally length-prefixed.

    These types seem okay to me: bool, byte, byte[].

    These types look like they need munging: sbyte, short, ushort, int, uint, long, ulong, float, double, decimal, char, char[], string.

    Any other type would be handled by the user in entirety, and is completely user dependent. To serialize a DateTime, for example, I’d probably serialize the long ticks quantity which it wraps. That would have to be serialized like any long.

    Hope this helps.

    Cheers,

    -Isaac

  5. staceyw says:

    Don’t your need "IsByteOrdered=true" in the attribute of the type to get the compare operators?