Binary Encoding, Part 6

Article
09/28/2009

Past parts in the series:

We looked last time at some of the patterns used in the binary format for reducing the size of a document. So far we'd managed to trim about 12 bytes off of the Envelope element that I'd been using as an example. However, we can still do better given that we'll likely be sending that same element again and again.

String tables are a way of compressing repetitive text. The table associates a token with each string. The writer of the message replaces a portion of the message as it is being output with the token. The reader of the message receives both the encoded text as well as the string table, allowing them to reverse the process.

There are many ways that the string table might be communicated. For example, the table might be communicated along with the text in a header, as is typical for compression programs. The simplest mechanism though is to simply assume that the reader and writer have exchanged the string table ahead of time through some out of band exchange. Then, there's no need to represent the string table or any information about it during the message exchange.

We sometimes use the binary format together with a signal for such an implied use of a string table. This is called the static string table. Our static string table is populated with a variety of strings related to SOAP messages. For example, here are the first eight strings in the table (out of many dozens):

mustUnderstand
Envelope
www.w3.org/2005/08/addressing
www.w3.org/2003/05/soap-envelope
Header
Action
To
Body

These strings in the static string table are given tokens following the even positive numbers. Therefore, the list I gave you represents strings for the tokens 0, 2, 4, 6, and so on.

Now, I can introduce a third (and fourth) kind of record for declaring an element. A dictionary element record is similar to an element record except that the length of the element name and the bytes for the element name are replaced by the value of the token. Similarly, there are prefix dictionary elements like I introduced last time that allow us to skip both the prefix string as well as the element name string.

The record types 0x44 through 0x5D represent prefixed elements for the prefixes that are the lowercase letters "a" through "z". We can now shrink the element record I've been using as an example even further. The string Envelope has the token 0x02 so we can write the s:Envelope element record now in merely two bytes:

0x56, 0x02

Binary Encoding, Part 6

Additional resources