Using Service Oriented Architecture concepts for database replication

Is SOA really useful for database replication?  Yes and no.  This posting will discuss a dilemna that a project of mine faced, and some of the principles that I needed to share with the developers before they could see (a) the problem, and (b) the solution using SOA.

The problem we faced

Here is a specific problem.  Imagine that you have two seperate web sites, hosted in their own environments behind firewalls (in their own DMZs).  Machines in area A cannot route packets to machines in area B and vice versa.  No outbound communication, period.  There is a database in area A that needs to be synchronized with another database in area B.  These database are NOT identical.  They are in fact, parts of two different applications respectively.  However, there are some tables where rows originate in area A and they have to get to the database in area B, and there are some tables where rows originate in area B and they have to get to the database in area A.

(This is not as bizarre as it may sound.  In one of the dot.com companies where I was a co-founder, we had two hosting environments: one for public users and one for business partners.  They were hosted in different physical locations.  The systems had different SLA requirements and different security requirements... and they couldn't route packets between themselves either.  Deja vu.)

So, how to do it?

Well, as you can imagine, there has to be a way for someone to communicate with the hosting server, and their is.  From the corporate network, I can access area A over port 80.  I can access area B over many ports, including port 80.  The traffic between my desk and these areas is entirely secure... it does not leave the corporate network and travel on the open internet at any time.  While the servers in area A cannot initiate traffic to me, they can respond to a request that I send in.  Same for area B.

For shorthand, and to make this discussion shorter, I will pretend that there is only one machine in each area, and I will call them Server A and Server B.  The concepts are the same.  Also, for the sake of simplicity, let's pretend that each machine is both an application server and a database server.  They are NOT in the same Windows domain, nor does the domain in either side trust the domain on the other. 

Our first solution

We decided to create a little windows service that would run in the secure corporate network.  The service would make a request from Server A, get a bunch of data, and then pass it, as a request, to server B.  Server B adds it to the database.  The service would then go in reverse, and make a request from server B.  The service would pass the response to a web service on server A, which would store the data.

Seems pretty straight forward, and we weren't planning to use SOA at all.  The request from A to go to B would start by requesting all of the rows from a specific table that were older than a specific time from a webservice on server A.  The response would contain a small number of rows (less than 10,000).  We would pass these transactions to server B one at a time.  We always ran them in a specific order (to get around any particular dependencies between the tables).

Let's see a show of hands for everyone who can see the problems with this.

  • First off, all the data coming from Side A, and going to Side B has to be extracted in a single transaction.  This is because there are relations between the tables, and we can't run the risk that, in between extracting data from a header table, and extracting data from a details table, someone could have inserted data into both.  That would make the data in the details table newer than the data we just extracted for the header table.   We'd get foreign key errors on the destination side.
  • Secondly, it scales badly.  Since the data feed has to occur as a large pile of data (see the first problem), we now have a batch process.  If the amount of data that has to be moved starts to really grow, we have to be concerned with the systems impact that occurs when this process kicks off.  Will the sites slow down?  Will they run at all?  We start going to 4am processes... which means our data can become many hours old (and usually will).
  • Thirdly, if we increase the number of servers, on either end, coordination gets to be hairy almost instantly, because a destination can either have a large file update, or not... which means that the results of a query on one destination is not going to differ from another destination by only one record... it's going to differ by many records, possibly thousands of records.   

Here's where I say: SOA fixes everything.  Well, it won't give us world peace, but it does fix these flaws in a reasonable manner.

First off, to make life easier, we put a web service both on Server A and on Server B.  The web service would handle the job of talking with the local database server.  Now, we can communicate over port 80 from "replication central" on the corporate network.  This makes the security guys happy, because we don't need other ports.  Also, the code in each web service is very similar, so it can be derived from the same code base, thus reducing the amount of code to test and maintain.

OK, so our infrastructure now has a database and a web service on Server A, and the same on Server B.  We have a windows service in the corporate network that can access both web services and make requests.  We will call this "replication central" or the RC.  One more important detail: the values used for primary keys, for the tables on BOTH sides, are GUID values.  We do not need to match against some arcane query on the destination side to determine if a row is already there.  This is all just component driven architecture.

Here's where it becomes SOA. 

When the RC calls up server A, we won't ask for a table.  We will ask for a business document.  The document will contain one independent header row and all detailed rows (and more... read on), in a single .NET object serialized to XML.  I can't describe the content of this object just yet... not until we work out a few problems.

What rows do we send?

First problem is this: "how do we know what header row to send?" 

I don't like dates for this kind of thing, so we are putting in "dirty flags" for all of the header rows.  The logic works like this: If the application updates any row in a table that gets replicated, we will find a header row (in that or another table) that should get it's "dirty" field set to true. 

For example: let's say we have an invoice header, and invoice lines, and each line can contain further detail lines (an invoice of clothing, where a line is "Docker Slacks" with sublines for a quantity broken out for each size and style).  Now, someone updates the quanity of plaid golf pants in Men's size 42 from 60 pairs to 80 pairs.  The invoice header will get the dirty flag set, because when we send the business document from one place to the other, it is the business document that decides what detailed rows to pull in, including this one.

Domain tables

Another problem comes from domain tables.   Let's say we have a "country" table, and a user enters a new country on Server A, and then adds a customer who lives in that new country.  When we send the new customer over the wire, we have to make sure that the new country gets registered first.  Should we have a seperate transaction for countries?  After all, they are independent values... or are they?   

I would state that the country table is not a business document.  It is a convenience: a data standard that all customers must adhere to, but not a business document.  Therefore, when we create transactions, in this case, we would not have a transaction called "country."  Instead, we would send the customer record from Server A to Server B, and in the document, we would have all of the information needed to create a country record for this user in the event that this is a new country.

So a snippet from the XML could look like this:

<Customer>
    <Address>
        <Line1>123 Main Street</Line1>
        <Line2>Apt C</Line2>
         <Country ID="{738A7639-C24D-42a6-8DB7-52C07B62781F}"
             Code="USA" Desc="United States">USA</Country>

Notice that the Country tag has attributes that completely define the country of the customer.  That way, when the customer information is passed from one machine to another, we can check to see if the country "USA" already exists.  If it does not, we add it using the attribute data.  Otherwise, we simply use the key value in our customer record on the destination side.

One caveat to this: this idea applies to simple domain tables. It is not useful when an otherwise small transaction refers to a much larger one.  For example, if a transaction represents a request for the status of a shipment, the shipment itself should not (normally) be contained in the request.  In this case, the list of possible responses would include the response: "never heard of that shipment id." 

Is it an update or an insert?

One common problem is that the data that comes over in a feed has an indicator that says "is this new data or updated data?"  My answer is "who cares."  Really.  If a document exists on one side, it must exist, in exactly the same format, on the other side.  There's no variation in this concept.  The source side has a single representation for an object, and it knows that some element, somewhere in it, has changed.  Given the concept of a single dirty flag on the header record, the source system is not actually able to say which field changed to cause the need for an update.  The document is simply dirty and needs to be sent again. 

An important notion is that the document does NOT contain any instructions for how the receiving end is supposed to process it.  That is up to the receiver.  The sender has no say in the matter.  For the sender to be aware of this creates coupling between the sender and the receiver, and that is not healthy.

How do we handle transactions?

In our situation, we can't create a distributed transaction.  We have web services.  We have data in a document.  So, how do we know if a transaction has been successfully applied?  Go back to old EDI principles for this one: the receiver sends back an acknowledgement.  So the Replication Central (RC) service asks Server A for a document (just one).  In a stateless response, Server A gives up a document.  RC sends that document to server B, on a synchronous call.  Server B applies logic to decide if rows need to be added or updated (this is why the GUID keys are so important).  If it works, Server B replies with "Thank you" or "That didn't work... and here's why".  (Architecturally, this could be done asynchronous, but doing it synchronous avoids the issue of handling call backs when Server B cannot send unsolicited messages to the RC.)

Now, the RC knows that the transaction was applied, but Server A does not.  So the RC makes another call to Server A, this time just to acknowledge the document and to turn off the "dirty" flag. 

The logic of the dirty flag is not boolean.  An item is not "clean" or "dirty".  There are three states.  It is "clean", "dirty", or "in process."  When the RC asks for a record, the web service on the database server needs to get one record that is marked as dirty.  It will compile a document.  Then, before sending the response, it will mark that record as "in process".  This is still dirty, but there's a catch.

If the user modifies a record when it is "in process", we set the dirty flag back to "dirty".  (the logic here is simple, since we are already setting the dirty flag when we modify a record... we don't care if it was "clean" or if it was "in process").  Note, we also need a date field for when the dirty flag value is changed (the reason will be clear in a moment).

The distinction comes when the acknowledgement is received.  If we acknowledge a record that is "in process", we should set it to "clean".  However, if we acknowledge a record that has been updated after the transaction was sent out (and is therefore "dirty"), it stays "dirty".  We have to send it again with the new updates.

Why do this one at a time?

You may have noticed already that we are only moving one document at a time.  The logic that the web service should use when retrieving a record can be best described in Structured Query Language: SELECT TOP 1 * FROM HeaderRow WHERE DirtyFlag = 'dirty'

If that query returns nothing, then (and only then) try this: SELECT TOP 1 * FROM HeaderRow WHERE DirtyFlag = 'in process' and (in English now) it has been more than an hour since the dirty flag was last modified. 

This gives you one record that "really should be sent" to the other side.  Why wait an hour?  You could make it a minute if you want.  I like leaving a window.  That way, if you start up four copies of your Replication Central service, they won't conflict with eachother.  In other words, it scales nicely.

Also, since we are only sending one transaction at a time, then the RC can run fairly continuously.  We can query the database continuously until it returns with "no changes" and then sleep for 10 minutes before we ask again.  We don't have to worry about "processing windows" or slowing down the server with the overhead of replication.

Final notes

A detail that should be obvious is that each transaction has to be uniquely identifiable.  Since we are using GUIDs for our keys, we can use the GUID of the header row as the transaction id.  That way, when the acknowledgement comes back, we know what row to apply it to.

One more note: I had said that data needs to go in both directions: from A to B and from B to A.  In our system, we are lucky... We never have the same table that can be modified on both servers at the same time.  I do not claim that this process will work with that situation, because data merging comes into play.  That said, the concepts can probably be expanded to cover data merging... but I don't have time to think about that right now.

Conclusion

Service oriented architecture isn't about web services, or HTTP, or XML.  It is about data.  How data is managed.  How it is transmitted.  And how it is replicated.  This document describes one, very tangible, mechanism that we are using to transfer data between servers in a service oriented manner.