SOA and BLOBs -- using SOA principles for block-oriented data transfer (Updated)

Abstract: What happens when a business transaction, in Service Oriented Architecture, is too big to fit into a simply SOAP transaction?  This (updated) article describes a problem of this nature and the solution that allows block-oriented data transfer to work in an SOA-based application.

Introduction

Some things should simply not be done. 

If I have a large batch file, with ten thousand records in it, and I want to transfer it from point A to point B, using an SOA model to transfer the each record, one at a time, is really dumb.  The folks who believe that "all things must be done in XML" will not gain any points with me on this.

On the other hand, sometimes a single record (a single business transaction) is big... big enough to consider block-oriented data transfer.  This article is about one such situation and how I am proposing to address it.

The big business transaction

I deal with documents.  Some of them are in "source" format (word documents, powerpoint presentations, Infopath forms, even PDF documents), while others are simply scanned images of a document (in TIFF mostly).  These documents, the metadata that describes them, and the relationships that bind them, can form the basis for a "set of documents" that business can understand.  A good example would be the papers you have to sign when you buy a house.  There are deeds and warranty trusts and loan papers and all kinds of stuff.  I don't know what they all are, but I do remember that it took hours for my wife and I to sign them all.

Together, these documents make a package.

And now for the problem: we want someone to be able to submit all or part of a package of documents, from one party to another, over the web.  

Doesn't sound so hard, does it? Surely, we aren't the only folks to deal with something like this, but I haven't seen many examples of how this is done in other XML-based solutions. Not even legal e-filing, where this seems a natural requirement. Perhaps I just missed it. 

A "business document package" contains header information and many documents.  The list of documents changes over time.  In other words, I can create a set with four documents, add a fifth, then replace the third.  Each document can be a TIFF or another large-format document file (too big to fit in a SOAP message on HTTP). 

The SOA Mismatch

Service oriented architectures usually present the notion of a "business document" or "business transaction."  For the sake of clarity, I will use "business transaction" since my transactions themselves contain binary objects that just happen to contain documents... it would be too confusing to describe any other way.

So we have a business transaction.  This can be implemented in many ways.  SOA says that a document is self-contained and self-defining.  Therefore, the document set must be self contained and self defining.

Normally, in the SOA world, if a business transaction is updated, we could simply replace the entire transaction with entirely new values.  So, if a transaction is an invoice, we would find the existing invoice header, delete all the rows associated with it, replace the values in the header, and add in the rows from the document.  All this is done as "data on the inside." 

The problem is that the entire contents of the business transaction are huge.  Our self-contained transaction contains the header information and all of the scanned documents.  If each document is 2Megs, and we have 14 of them, then a 28MB SOAP message starts to seriously stretch the capabilities of the protocol.  It is literally too big to fit into a SOAP message without serious risk of HTTP Timeouts. 

So, we need the concept of an "incomplete" sub-transaction... and that's where the solution lies. 

(Note from nick: we decided to go a different direction: I've added details at the end of this posting).

The SOA solution

In our interaction, we have two computers.  The sending side, where the transaction originates and the receiving side, that needs to end up with all the data.  Both sides are computer applications with database support underneath. 

The new transaction is created by the sending side.  It will send the document header and enough information for the receiver to know what documents survive into the final form.  Any existing documents that changed will be deleted from the receiving side.  All documents that don't exist on the receiving side, when this process is done, are represented as an "incomplete records" in the receiving end's database, along with some size data.

Now, the sending side asks the receiving side for the id of a document that is marked as "incomplete".  The receiving side responds with a message stating that "SubDocument 14332 in document set AB44F is incomplete.  We have block 9 of 12".

The sending side will then go to the database and extract enough data to send just one block... in this case block 10.  That could be simply 100K in size.  Wrap that up in a SOAP message and send it.  The receiving side will get the message, which contains a complete header document and the contents of this block.  The interaction is done, and will start over with the sending side asking for the id of a document that is marked as incomplete.

The conversation

So, it look like this:

Sender sends:

<MyDocumentSet id="849C751C-FF5C-4438-A3F0-055B9EE786E3" >
   <Metadata Filer="Nick Malik" CaseNumber="ABC123" ---other stuff --- />
      <Contents>
         <Document id="EBDE445D-5C26-43da-A142-E12A350EC1B6" name="MyDocument1.pdf" --- other header info --- />
         <Document id="9E4F8C83-B2D1-4aee-8C53-B235D026CD1E" name="Document2.doc" --- other header info --- />
         <Document id="05B10DAA-2A01-406b-AAB0-6BAEEF98F7A8" name="MyDocument3.ppt" --- other header info --- />
         <Document id="7135612A-CE48-4371-ABFC-F8EF70DF76CF" name="MyDocument4.pdf" --- other header info --- />
      </Contents>
   </MyDocumentSet>

Sender gets the message and checks to see if that document set already exists.  If it does not, simply create the document set on the receiver side with four incomplete documents.  A much more interesting case happens if the document set already exists on the receiver side... so let's look at that.

The receiver looks up document set 849C751C-FF5C-4438-A3F0-055B9EE786E3 and sees that it currently contains five documents.  The first three documents in the existing document set are named in the list above.  The fourth document above doesn't exist in the existing document set, so it is an addition.  The other two documents in the destination document set must be deletions.

So we delete the two extra documents on the receiver side and add a document for MyDocument4.pdf, and flag it as incomplete.

Now, the sender asks the receiver for the id of any incomplete documents.  The sender replies with the id of the fourth row above: 7135612A-CE48-4371-ABFC-F8EF70DF76CF and the fact that no blocks of data have been successfully stored.

The sender side gets this response and decides to send block one of that document.  It goes to the database, gets the first 50,000 bytes of data, encodes it with Base64 encoding, and sends it back to the receiver as the following:

<MyDocumentSet id="849C751C-FF5C-4438-A3F0-055B9EE786E3" >
   <Metadata Filer="Nick Malik" CaseNumber="ABC123" ---other stuff --- />
      <DocumentBlock id="7135612A-CE48-4371-ABFC-F8EF70DF76CF" name="MyDocument4.pdf" --- other header info --- >
           <Block totalblocks=12 thisblock=1 size=50000>
FZGl0OzI0NTk2MDs+Pjs+Ozs+O3...a really long string of base64 characters ... 
           </Block>
      </DocumentBlock>
   </MyDocumentSet>

The receiver now appends this data to the current document on the receiving end.  Note that the receiver "knows" that, even though this message is complete, the document is not complete, because this is block 1 of 12 (see the <Block> tag above).

The sender then asks again: what documents are not complete.

The receiver responds again: 
Document 7135612A-CE48-4371-ABFC-F8EF70DF76CF is not complete... we only have one block of 12. 

The sender sends block 2... and on it goes until the last block is sent.  At this point, the reciever gets the final block, marks the document as complete, and appends the last set of data to the database.  The next time the sender asks "what is not complete" the receiver responds "everything is complete"

The loop terminates.

The motivation for doing block-oriented data transfer this way

Certainly, we could use FTP or some other mechanism for file transfer.  This method, though, has some characteristics that are interesting.  First off, this protocol is stateless.  That means that, at any time, the sender could stop asking about the status of documents on the receiver side, and nothing is lost.  The sender can go offline, or go to sleep, or lose connectivity, and nothing bad happens.

Secondly, because the block sizes are relatively small, SOAP doesn't time out.  We can handle extraordinarily large files this way (theoretically in the terabyte range).

Thirdly, the sender doesn't have to know much about the receiver.  It doesn't have to know if the document set already exists in the database on the receiver side, because the header data is sent with every block.  Therefore, no Commands are being sent.  (See my previous blog on "commandless" documents).

Pros and Cons (updated)

At the time of my first posting, this idea was being floated to our development team.  There are pros and cons to this solution that I can discuss in more detail now.

The advantage of this model is that the receiving side is not getting any data that it doesn't want or know what to do with.  The sending side asks "what do you need," and the receiving side responds with "file X Block 10".  However, this is still a communication protocol.  If the sending side decides not to ask, the receiving side has no option but to leave the content of its database incomplete. 

This is (a) counter-intuitive, and therefore hard to explain to business users and the development team alike (as I have discovered), and (b) we have mixed the details of data transmission with the details of data representation.  I hadn't thought carefully about this when I first wrote it, but, on hindsight, it's a bad idea.

An SOA transaction should be complete, self-describing, and self-contained.  The process above saves us from sending the same bits more than once over a wire.  That's its biggest advantage.  But that's not our biggest cost.  All of the wires that I care about, in my application, are owned by my company, and they are utilized at a fairly low rate.  Therefore, we don't save any measurable dollars by making the data transfer process efficient.

On the other hand, if we seperate out the data transmission from the data representation, then we can test each seperately.  I can test data transmission of a 20 GB file by transmitting any 20GB file and comparing the results with the original.  I can test data representation by creating the business document on one end and copying it to the other using sneaker-net (walking it over) from one dev machine to another.  This test isolation is important for reducing complexity, and that will save measurable dollars... real money from my bottom line.

The forces that led us to SOA still exist: we want to decouple the sides from each other and we must transfer these large transactions over HTTP or HTTPS connections.

The new interim solution

We decided to seperate the data transmission from the data representation.  Therefore, we will create an envelope schema that simply provides a transaction id, the current block number, to total number of blocks, and a data field. 

So a transmission could look like this:

<Transmission id="39B2A4DD-AD68-4ae9-AA68-FCC6A48A0FFA">
           <Block totalblocks=12 thisblock=1 size=50000>
FZGl0OzI0NTk2MDs+Pjs+Ozs+O3...a really long string of base64 characters ... 
           </Block>
</Transmission>

What goes in the Block?  A base-64 encoded form of the entire business transaction itself (possibly compressed).

The receiving side will collect together all the block, assemble the actual stream, decode it, and load it into an XML object.  From that, we can extract embedded documents.

This data is not optimized for transmission

We get a lot of data inefficiency in the data format here.  If we haven't thought seriously about compression before, it's starting to become important now.  Here's why:

Uploaded document, in PDF form, is a page of text.  Notepad would represent it as about 1K.  In PDF, it would be about 5K because PDF includes things like fonts and formatting.  That's fine. 

In our business document, that 5K becomes 6.7K, because, in our business document, we are embedding it in Base64 text.  Base64 is a format that represents three bytes (24 bits) as four characters of six bits each (24 bits).  Add about 2K of header information (to make our document complete) and the business transaction size hits 8.7K.  At this point, we take that 8.7K transaction and encode it, again, as Base64 for the sake of block transfer.  We now get 11.6K.

Our PDF went from 5K to 11.6K.  That's double it's original size, and that's assuming UTF-8 encoding in the XML.  If we go with UTF-16 encoding, the XML files can hit 20K. 

On the other hand, if we compress just before we pack the data into blocks for transmission, we can take that 8.7K document and compress it down to just over 5 K, (even though it is character data, it is not going to compress further, because it is randomized, which removes the advantage of compression).  We take that 5K document and encode it in Base64, we go back up to 6.7K.  Now, that is efficient for data transmission.

The receiving side has to decompress, of course, but this may be worth it.

Conclusion

After reviewing the initial proposal to embed the data transmission mechanism directly into the data representation structure, we rejected the idea in favor of a mechanism that wraps the data representation structure with a data transmission structure.  This allows us to test data transmission seperately from data representation.  It also allows us to stick to the original idea of keeping all of the business data together in a single business transaction, regardless of how large it grows to be.