Data Architecture in a Cloudy World

Like many Microsoft bloggers, I'm somewhat erratic, partly because of my own workload, and partly due to company decisions about what product information can be blogged about prior to public release. Anyway I'm back (for a while, at least).

I am writing a White Paper, tentatively titled "Data Architecture in a Cloudy World", (feel free to critique that title) and I'm going to experiment with posting parts of it, and collecting feedback before it is published.

Here's the intro to the paper:

"This paper provides best practice advice for two questions:

  • Logical data architecture: which kind of data stores should I use? A relational database and a data warehouse? A NoSQL data store? Big Data? Windows Azure Service Bus? Etc. etc.
  • Physical architecture: where to distribute the data stores: on premise? In the cloud? In a hybrid scenario?

Until the last decade, this used to be a straightforward problem: the data store would most likely be a relational database (with some content going into a file system), and the physical architecture would depend on your scaling and availability requirements, involving clusters, replication, hot backups, etc. While these were complex and deep subjects, they were at least well understood, and the best practices were somewhat widely known and uncontroversial.

Today all this has changed. There are many new storage options, driven by the emergence of Internet-scale applications: not only relational databases and data warehouses, but a host of “NoSQL” data stores, “Big Data”, and more. And the rise of the “Cloud” has led to several new, evolving physical architectures.

Choosing the right architecture is a critical decision for your application. In some cases the choice may be self-evident and easy; in other cases you may have to trade off conflicting factors in making your decision.

This paper provides current best practices for choosing both data and physical architectures. Once you have made your decision, there are additional architecture decisions to be made that are specific to a platform, which we will describe, and where available, provide best practices. And we provide links to other resources that go into more detail about each platform than we can do here."

The paper goes through each of the major architectures, and why you might pick one over the other, and for each architecture, goes into the secondary architectural decisions you may need to make. For example, in the case of Windows Azure SQL Database, if you need to scale out horizontally, then you will need to use the Federation feature, and the paper describes some of the design decisions involved in using Federation.

Some of your architectural decisions are well understood: for example, how to build a well-normalized relational database on premise. In such cases, I do not go into detail, and just link to some resources on the subject.

This was a pretty abstract overview of the paper. We've had some indication that the wide variety of data architecture choices available today has caused some customer confusion. I'd love to hear your feedback about whether you've found this with your own customers.

My next post will be the draft of a section tentatively titled Why Everything is Different with the Cloud and the Internet. It's main focus is the "CAP Theorem", how that affects the ability to do 2-phase commits of database transactions, and the architectural implications of that.

Normally we circulate papers inside Microsoft before they are published, but the subject I'm writing about has many experts on the outside as well, so I decided to experiment with soliciting feedback pre-publication. I'm interested in technical accuracy, but also, whether it addresses your questions as a customer, so feel free to wade in even if you don't consider yourself an expert.

An English/American folk proverb says, if you like to eat sausage, you probably don't want to watch how it's made. Well, this will be an experiment to see if that's true of writing also, hopefully not! I've been getting about 1700 hits per blog posting (a few less than Balmer or Sinovsky!), so if only 1% of you make comments, that will be a nice contribution!!