A data service is not a database

Since we put out the new version of the experimental Astoria online service there has been a broad set of questions coming, which is great because it was the goal of the thing...we're learning a bunch from it.

A class of questions that we are getting are "database questions". Things such as "how can I enforce a constraint" or "can I create an index?" or "what if I want to use hints in my queries".

I think that a key thing to understand is that a data service is not a database. There are many aspects that are fundamentally different when you design, secure, deploy, do capacity planning and such between these two.

Schema

Database schemas are typically heavily optimized for performance. At the logical level tables are often arranged to balance grouping important data together with row-density to make common scans over a subset of the columns as fast as possible; normalization and de-normalization decisions are made so that the database remains reasonably queryable and maintainable while avoiding too many joins, etc. Beyond that, in a database you have access to the physical level, and there it's common to tweak indexes, create materialized views, create covering indexes, organize data in file-groups, and so on.

In a data service, your schema becomes part of your public interface. When you publish an Astoria data service (or something similar), the URI space is dictated by the set of top-level resource containers, the members of the types it can contain and related constructs. The shape of the payloads will also be based on the various types contained in the various resource containers exposed by the service.

So in a service you'll want the schema to be optimized for its target use, so semantics tend to drive it. In a database schema will be about data organization and performance.

The gap between the two, in the case of Astoria, is filled by the ADO.NET Entity Framework mapping engine, which you can use if you have a service schema that's different than your database schema, which I think will be the common case for anything that's not a quick sample.

Physical aspects

As I mentioned when discussing "schema", when you create and manage a database you worry a lot about the physical level. This ranges from indexes to file-groups to spreading the data across disks in a smart way.

A service is a different beast in this sense as well. In a service you just put the data "up there". The service will choose the appropriate physical organization for the data, regardless of the visible service schema. (for example, in the Astoria experimental online service you describe your data as entities and associations and the system figures out a logical/physical schema to support it, along with a mapping to translate between them). The system supporting the service may or may not use a relational database, or even a database at all (there are many large-scale storage systems that use other models instead of traditional relational to avoid the impact of global metadata and the complexity of partitioning highly structured schemas for scale-out).

Interaction model

Even the way clients interact is different. Databases have a concept of a "session" which has a bunch of state in it; you can even do transactions, which carry even more state. In a service that has an HTTP interface it would be quite unnatural to have sessions, and certainly you don't want to have something like a transaction that spans more than a single HTTP request if you want to scale to large concurrent workloads.

Also, particularly in larger databases, access to the database is often restricted to middle-tier systems that are "trusted" by the database. In some cases those systems may even be allowed to do certain level of dynamic SQL (usually the larger the system or the scalability requirements, the less likely to allow dynamic SQL). In the service side of things, while you want to allow certain level of flexibility on how the data is obtained and shaped, you may not want to take arbitrary programs in a language as expressive as SQL and execute them in your systems.

Semantics and business logic

While sometimes some business logic is pushed to the database for practical reasons such as performance, typically most of it happens in the middle-tier, and then you connect all your client components to the middle-tier.

When you have a data service exposed to the web you cannot assume anything about the client. You cannot trust that the client is legitimate (there are ways to do this...it depends on your target scenario if you want to be restrictive or not), you may need to handle varying versions and you may need to handle clients written by other people that may or may not follow your requirements. All that means is that the data service does need to do some validation/business logic execution before it allows you to see data or to manipulate it. In some form, it acts as a middle tier that sits across the web from the client.

 

There are several other aspects I did not elaborate on here that differentiate these things. The bottom line is that while there are some similarities, in general it is important to avoid the tendency to assume that these things are the same.

-pablo