Binary Encoding, Part 5

Past parts in the series:

The problem we saw last time was that a structural reduction for message fragments does not create a significant savings when the message is small. Although we are shaving a few bytes off of each element (the savings on closing an element is 2 bytes plus the length of the element name), the number of elements is small. On the other hand, there is a lot of boilerplate content in a SOAP message, which ends up dominating the size of the message. We can use that knowledge about the structure of the message to make the encoding more efficient.

Let's take another look at the first record we saw last time, the s:Envelope element record.

0x41, 0x01, 0x73, 0x08, 0x45, 0x6E, 0x76, 0x65, 0x6C, 0x6F, 0x70, 0x65

This element record has three common patterns in it that we can try to exploit.

The first common pattern is something that we're already doing without me having explained it. You'll notice that we've represented the lengths of each string in this example with a single byte. We'll frequently have very short strings like these but we will occasionally need to have longer strings as well. We don't want to pay for the maximum size of a value in every record though as we'll have a lot of these size fields throughout the message.

We are using the common encoding trick of a variable-sized integer here. We start without assuming how long the integer is. If the value is between 0 and 127, then we'll store it as a single byte as you'd expect. If the value is between 128 and 16383, then we'll set the first bit to 1 and take seven bits from the value to form the first byte. The second byte will have the remainder of the bits from the value. We can keep doing this expansion for more and more bytes by always using the high bit of each byte to say whether more bytes are coming or whether this is the last byte.

The second common pattern is that it will be very common to have a short prefix for the element. Instead of storing that prefix as even a short string, we can use some of our record types to precompose the element record with the prefix. We don't have an unlimited number of record types so we can't do this for every prefix but we have done it for the lowercase letters "a" through "z" since it's very common to use these single-letter prefixes. These are the record types 0x5E through 0x77. For example, the same s:Envelope element record as above can shave off another two bytes by writing it as:

0x70, 0x08, 0x45, 0x6E, 0x76, 0x65, 0x6C, 0x6F, 0x70, 0x65

The third and final common pattern is that almost every SOAP message will have an element called Envelope in it. We'll spend the remainder of the series looking at static and dynamic string tables to intern these common names.