There are multiple ways to store data in a cloud provider, specifically around Windows and SQL Azure. As part of a “Data First” architecture design, one decision vector – assuming you’ve already done a data classification of the elements you want to store – is to decide the transaction level you need for that datum. Once you’ve decided on what level of transactional commitment you need, you can make intelligent decisions about the storage engine, method of access and storage, speed and other requirements.
Although the list below is neither original nor exhaustive, these are the general considerations I use for a given data set. It’s important to note that in many on premises systems the engine choice at hand overrides these concerns. If you have a large Relational Database Management System (RDBMS) for instance, you might simply place all data there without further consideration. In a Platform as a Service (PaaS) like Windows and SQL Azure, however, selection of the proper engine for a particular dataset has implications ranging from cost to performance, and selecting the right engine is critical when you want to leverage the data across “Bid Data” analysis like Hadoop or other constructs.
Monolithic Consistent Transactional
The first selection is analogous to a local RDBMS system. The dataset is retrieved in a functionally single, monolithic transaction, i.e. kept together with ACID properties in mind. This is the most reliable type of data design for datasets that require a high degree of safety in the read/write pattern. As an example, a bank ATM transaction should be modeled in a monolithic way. If I make a transfer of funds from one account to another, I want the money to be subtracted from one account if and only if it is successfully added to the other. The bank, on the other hand, wants the money added to the second account if and only if it is subtracted from the first. This is a prime example of a monolithic (single atomic transaction), Consistent (if and only if) and Transactional (as a unit, with provision for roll-back and reporting if unsuccessful) data requirement.
The primary engine used for this type of data is often SQL Azure – an RDMBS in the same datacenters as Windows Azure. Placing both the calling application, whether that is a Data Access Layer-based code widget or a direct call from a Web or Worker Role, means that data is retrieved quickly and in a monolithic way. The costs for this method is based on overall database size. A consideration is how much data you can store this way. Database sizes have limits, although there are ways of overcoming size issues using technologies such as Sharding or SQL Azure Federations. There is also the consideration of performance. In an RDBMs that conforms to ACID properties, locking and other overhead for safety is at conflict with the highest possible read performance. But in some cases the ACID properties are worth the cost, as in the banking example.
You are not limited to SQL Azure in this model. Windows Azure Table storage, while similar to NoSQL offerings is different in that it is immediately consistent across all three replicated copies of data, offering a higher degree of safety. And while Table storage does not offer built-in support for transactions, there are ways to achieve certain transaction levels.
If consistency can be relaxed – meaning that a guaranteed read/write patter is not essential – then more options arise in Windows and SQL Azure. You can still use SQL Azure for this type of storage, with either automatic or programmatic hints allowing for “dirty reads”. Windows Azure Table storage is still consistent, but the selection of the method for querying the data such as separate copies of read and write data can be employed. Because of the relaxed transaction nature, higher speeds are possible by querying cached or separate datasets.
An example here is that same transaction from the bank, but a statement inquiry. Just after the money is deposited, the user wishes to query the current balance. The current balance – minus the transaction that just occurred – is retrieved and shown to the user, perhaps even combining the amount with the latest transaction, perhaps saved as a local cached object, with a caveat to the user.
At some point, the data becomes too large to fit inside a single processing session, and parallelism is used. In this case, either separate databases in SQL Azure or Windows Azure Tables, local data storage on the Web or Worker Role, or a combination of all with Caches is the right approach for the data design.
The biggest implication in this type of system is speed – a higher degree of data separation is essential, and so the dataset selection must fit the pattern. It is unacceptable to force an ACID-properties type workload into this environment. Typical examples here are the actual data asset payload for streaming video or music, read-only documents and so on. This pattern is often separated from the meta-data, which is kept in more of a transactional model.
As an example, assume you log on to a website to watch a movie or listen to music. The provider needs to verify your identity and account balance, which are transactional data loads. After that process is complete, the workload shifts to a copy – perhaps one of several – of the asset to stream to your location.
In this case, Windows Azure Blob storage, along with the Content Delivery Network (CDN – a series of servers closer to the user) is employed along with the transactional realtime requirements for the metadata.
At the furthest end of the data scale are large datasets that need deeper analysis, but not necessarily in realtime. Examples here are terrabytes of data requiring a Business Intelligence view, but with a tolerance of a few seconds to minutes or hours. In this case, Storage, Processing and Query methods, such as the Hadoop offering in Windows Azure, or perhaps the High Performance Computing (HPC) Windows Server in Windows Azure fit well. Here, the design of the data is often dictated by the source, and more emphasis is placed on the algorithms around processing and re-assembling the data.
There are, of course, other patterns. In many cases a single dataset may have needs in one or more of these categories – in fact, sitting at 30,000 feet typing this entry, I’m having that very design discussion with a gentleman sitting next to me. The key is to design data-first, and fit the technology to the requirement for each datum. Allow each function and engine to handle the data in the most efficient, effective way for cost, performance and utility.