SQL Data Services goes full relational

A few days ago we announced the big news about SQL Data Services (SDS) switching to being a full relational database on the cloud.

I’ve been a strong supporter of this path for a number of reasons. Relational databases are very well understood and there is a large base of expertise for them in the market. Also, a lot of the existing applications and libraries out there are ready to run against a relational database, so SDS is enabling them to be ported to the cloud with minimum (or perhaps sometimes, no) effort.

With SDS going relational not only you get to reuse all your knowledge and codebase in the cloud, but you also get all the benefits of a cloud-based infrastructure: high availability, piece-of-cake provisioning, pay-as-you-go growth, etc.

One of the concerns I read about is the impact on scalability. My observation is that when you look at most of the storage systems in the cloud, they don’t have some magic formula for scalability, the trick is partitioning. Some systems are smarter than others in how they partition data and how dynamic the partitioning scheme is to adapt to varying system workloads. But in the end, you need to partition your data such that it’s spread across a bunch of nodes; if across your system you never (or rarely) depend on cross-partition operations, then you have a sustainable scalability path. That is independent of the actual organization of the data (e.g. relational, flexible entities, etc.) The ACE model on top of SDS had partitioning embedded in the model through scale units that surfaced as “containers”. In the new SDS world you can just partition your data across nodes, where each node has full relational capabilities. So it’s similar (partitioning), but each node gives you very rich ways of organizing and interacting with your data (full SQL!).

The other concern I heard is around TDS, the SQL Server client-server protocol, and how it would play in the Internet. In many cases the actual application that connects to SDS will be running in Azure as “web” or “worker” roles, and things should go smoothly. For the scenarios where the client is connecting to SDS from across the web, there are two challenges: firewalls and latency.

The server side of TDS by default listens in TCP port 1433, which a lot of firewalls will just block; furthermore, TDS is not HTTP, so a packet-inspecting intermediary could choose not to let the traffic through, regardless of the port number. This could certainly create some trouble that will need to be addressed at some point.

From the latency perspective, the short story is that I think it’s fine. TDS follows a simple request/response model, so interactions between clients and servers are straightforward and not chatty at all (things are more complicated when MARS is enabled, but that’s another story). We have experience tuning TDS for large WANs with high latency and things work out well as long as you optimize for those scenarios (e.g. batch queries together, etc.).

As a final note, there is the question about the SOAP/REST interfaces. In my opinion whenever you’re building the kind of rich applications that needs full SQL, rarely the data in the database can stand alone for direct access by consumers. Most of the time there is code on front (in the form of a middle tier) that manages access control, shaping, and even application-level constraints that don’t belong to the database. If you need a REST head on top of an SDS database, you can add ADO.NET Data Services to the equation, which will let you add all that logic fronting your data.

All in all, I’m really exited to see this happening. This gives Azure a whole spectrum of storage services, from blobs in Azure Blob Storage, to schema-less tables in Azure Table Storage, now to full relational with SQL Data Services.

-pablo