Intro to Azure DocumentDB

Article
12/07/2014

In the enterprise-scale era, data systems were dominated by the relational database and the focus was on making ACID semantics performant over a normalized relational schema. Although data sharding has been discussed for many years, vertical scaling became the primary way to improve application performance. In the web-scale era, it turns out that horizontal scaling is far more cost-effective than vertical scaling – and this led to a search for databases where the data could be distributed across many compute nodes.

Distribution of data brings its own challenges, best described through the CAP theorem which provides a convenient way to discuss the interaction of Consistency, Availability and network Partitions. As initially introduced (and proved) the CAP theorem stated that a distributed system could achieve only two of C, A, or P simultaneously. In a recent update, Eric Brewer indicated that more realistically network partitions were inevitable (even if rare) so that the only practical choices for a distributed database are CP or AP – or somewhere in between where consistency can be sacrificed for increased availability or vice versa.

The need for high-scale data storage also led to the development of non-relational databases with different data models – often referred to as NoSQL databases (No, as in not only). The web-scale era brought many different distributed databases that can be loosely classified into a few models:

document databases - storing JSON objects (documents)
key-value stores – storing key/value pairs
column stores – storing large numbers of columns for a particular key

These NoSQL databases are characterized by supporting high scale through the use of:

sharding to distribute data among many storage nodes
schema-less data models

NoSQL databases have proven to be a powerful way to get high-scale, with some installations having hundreds of compute nodes. Managing these at scale poses its own challenges, however - particularly in the cloud where the underlying compute fabric is built on commodity hardware with a higher failure rate than would be typical for an on-premises datacenter.

The Microsoft Azure platform provides several fully-managed, Database-as-a-Service options:

Azure SQL Database – fully managed SQL database providing most of the SQL Server functionality
Azure Tables – high-scale, schema-less, key-value store
Azure DocumentDB – document database

Azure SQL Database exposes a TDS endpoint that can be used just like any other SQL Server endpoint to manage databases and the data they contain. It exposes most of the functionality of SQL Server including point-in-time restore, automated geo-replication of data. Like many Azure services, SQL Database provides Basic, Standard and Premium pricing levels exposing a wide range of performance levels.

Azure Tables is an auto-sharded, key-value store that provides a cost-effective way to store high-scale structured data. A single table can contain up to 500TB at a cost of $0.07 per GB/month. Azure Tables has a scalability target of up to 20,000 operations per second.

Azure DocumentDB is a document database supporting the fully-managed storage of JSON “documents” in a highly-performant manner. This is currently in preview so provides a single performance level. As an idea it is broadly comparable with MongoDB, an extremely popular NoSQL database providing the storage layer for the MEAN development model for websites. However, DocumentDB was built from the ground up to be a fully-managed distributed database.

The entry point for information on DocumentDB is here. DocumentDB is extensively documented on MSDN. Ryan CrawCour (@ryancrawcour), a Program Manager on the DocumentDB team, describes it in this Cloud Cover show. The DocumentDB team has a frequently-updated blog describing new features as they are added. There is a collection of DocumentDB code samples on MSDN. Vincent-Phillipe Lauzon has released DocumentDB Studio, an application to manage and use DocumentDB.

This document provides a .NET-focused perspective on DocumentDB and is, to some extent, an extended description of this deck.

Resource Model

DocumentDB exposes database resources through the following logical hierarchy:

Database Account

Database

Collection

Document

Attachment

Stored Procedure

Trigger

User-defined functions

User

Permission

Media

The Database Account provides a billing and security boundary for access to DocumentDB. Each Database Account may host zero or more Databases, which in turn contain collections of documents and users. A document is the persistent representation of a JSON document. Each document may be associated with zero or more attachments, which provide metadata for associated files stored either as Media resources inside DocumentDB or as an external resource in the Azure Blob Service. Stored Procedures, Triggers and User-Defined Functions provide server-side JavaScript functionality. To enhance security, users can be created within a Database and associated with Permissions that can be used to restrict the user to read-only access to specific resources in the Database. Files can be uploaded as Media to DocumentDB and then associated with a Database as an Attachment.

Consistency Levels

DocumentDB is a distributed database with individual resources stored on multiple storage nodes. As such it is subject to the constraints of the CAP model. These are expressed to the user in terms of consistency levels. That is each operation on a DocumentDB resource has an associated consistency level that expresses how subsequent read requests on that resource are handled. A database account has a default consistency level (Session), which may be weakened for individual requests. Some consistency levels support quorum writes or reads, which means that a majority of the underlying storage nodes must respond affirmatively for a write or with the same data for a read.

DocumentDB supports the following consistency levels:

Strong – quorum writes and quorum reads. This means that requests are fully consistent.
Bounded Staleness – write order is guaranteed. Quorum reads may be behind by a specified number of operations (or time in seconds).
Session - write order is guaranteed within a client session. Reads are consistent within a session. This is the default for a new database account since it is deemed “usually sufficient” for an application.
Eventual – reads may be out of sequence, i.e., some reads may not see the latest changes.

Indexing Policy

DocumentDB provides an indexing policy for document collections to specify the indexing of documents in them. The default indexing policy is that all properties of a document are automatically indexed, which allows any property to be used in the specification of a query filter. This is convenient, but has implications for the size of a database and the performance of operations on a document. The indexing policy may be configured in various ways:

Index tuning – automatic indexing can be tuned for individual documents and paths within them – either including or excluding a property path from the index.

DocumentDB supports the following indexing modes:

Consistent – by default, indexes are synchronously updated on insert, replace or delete.
Lazy – indexes are asynchronously updated. This is targeted at bulk-ingestion scenarios.

Stored Procedures, Triggers and User-Defined Functions

DocumentDB supports the use of server-side JavaScript, in a manner that mimics traditional functionality in a relational database:

Stored procedures – invoked by the client
Triggers – invoked automatically by specific document operations
User-defined functions – invoked in the context of a single query

Each document collection can have a small number of associated stored procedures. A stored procedure is a JavaScript function that is uploaded to DocumentDB and compiled automatically. It can then be invoked from a client and allow sophisticated functionality to be provided for the manipulation of one or more documents in the collection. The script of a stored procedure may invoke create, update, delete and query operations against documents. An invocation of a stored procedure is wrapped automatically in a transaction on the primary replica, so that all operations in it are rolled back in the event of an error.

Triggers are also managed at the document collection level. A trigger is a JavaScript function that is invoked automatically before or after the invocation of create, replace or delete operations on a document. A trigger is of pre or post type, indicating whether it is invoked before or after the associated operation. Once deployed to a document collection, a trigger is invoked automatically whenever the associated operation is invoked on a document in the collection. The execution context of a trigger is automatically contained in a transaction on the primary replica, so that all operations are rolled back in the event of an error.

A user-defined function is a side-effect free JavaScript function that returns a scalar value inside a query. Once deployed to a document collection, a user-defined function can be invoked to modify the value of a property returned by a query.

Note that the invocation of stored procedures, triggers and use-defined functions is subject to resource usage constraints. Consequently, it is possible that the invocation is subject to throttling if too many resources are being used.

DocumentDB SQL

DocumentDB uses a SQL Dialect for queries. This has the benefit of providing a familiar query model, even if some of the underlying concepts differ somewhat from a traditional SQL. This is caused by the data model being hierarchical JSON documents rather than tabular relational tables. There is a detailed description of how to use the SQL dialect here.

Management

DocumentDB is supported only in the Azure Preview Management Portal. This support is being revved frequently, but currently supports:

Management of database accounts, collections, users, etc.
View of consumption statistics

The various client APIs provide management libraries that can be used to manage DocumentDB resources.

Note that you should be aware of the resource limits in DocumentDB, which are documented here.

Client Libraries

The core programming interface to DocumentDB is a RESTful API. However, DocumentDB also supports client libraries in the following environments and languages built on top of the RESTful interface:

.NET
Node.js
JavaScript client
JavaScript server
Python

RESTFul API

The core API for Azure DocumentDB is RESTful, and the various client libraries are built on top of this API. All DocumentDB operations happen within the context of a DocumentDB account, and the account name is used as a prefix for the URL. The URL for operations against a DocumentDB account named oban is:

https://oban.documents.azure.com

The DocumentDB resource model is exposed through the URL path. Individual resources in that path are identified by their internal ID, which is a base-64 encoded unique ID. For example the following shows the full path used for a RESTful operation against a document resource:

https://oban.documents.azure.com/dbs/YxM9AA==/colls/YxM9ANCZIwA=/docs/YxM9ANCZIwABAAAAAAAAAA==

This shows the resource hierarchy from DocumentDB account (oban), Database (YxM9AA== ), Collections (YxM9ANCZIwA= ) and Document (YxM9ANCZIwABAAAAAAAAAA== ) – along with the specification of a resource ID for each resource.

The RESTful API supports the following standard operations against all DocumentDB resources:

CREATE – create a resource
DELETE – delete a resource
PUT – replace a resource
GET – retrieve a resource (or a feed containing a collection of resources)
POST – perform a query returning a set of resources

The distinction between GET and POST is that GET is a point read of a specified resource while POST returns a set of resources satisfying a query contained in the request body. Queries in Azure DocumentDB are written in a SQL dialect, but queries can only be performed against indexed properties in a document. By default, all properties are indexed.

All operations against DocumentDB must be authenticated using HMAC authentication. A DocumentDB account has an account name and two management keys. There are two management keys to support key rollover, and either management key may be used. It is also possible to authenticate with resource tokens, which are essentially shared-access signatures that allow time-limited access to a set of resources for a specific user.

.NET API

The DocumentDB .NET API is a thin wrapper on top of the RESTful API. It uses the Newtonsoft.Json library for JSON serialization. The .NET Client Library can be installed as a pre-release NuGet package with the current version being a v 0.9.1-preview. The Newtonsoft.Json package is installed automatically as NuGet pre-requisite.

Class: Resource

The .NET API contains a class for each DocumentDB resource, and these classes are all derived from a Resource class declared as follows:

 public abstract class Resource : JsonSerializable {
 protected Resource();
 
 public void SetPropertyValue(string propertyName,
 object propertyValue);
 public T GetPropertyValue<T>(string propertyName);
 
 [JsonProperty(PropertyName = "id")]
 public virtual string Id { get; set; }
 [JsonProperty(PropertyName = "_rid")]
 public virtual string ResourceId { get; set; }
 
 [JsonProperty(PropertyName = "_self")]
 public string SelfLink { get; internal set; }
 [JsonConverter(typeof (UnixDateTimeConverter))]
 [JsonProperty(PropertyName = "_ts")]
 public virtual DateTime Timestamp { get; internal set; }
 [JsonProperty(PropertyName = "_etag")]
 public string ETag { get; internal set; }
}

JsonSerializable provides methods supporting the serialization of resources, as well as the setting and getting of resource values. The properties of the Resource class are common to all DocumentDB resources. The API serializes all public properties of Resource and derived classes using Newtonsoft.JSON, and its JsonProperty attribute allows the conversion of the property name between the conventional naming styles of .NET and JSON.

SetPropertyValue() and GetPropertyValue() are the setter and getter for the resource. Id is the application-provided name for the resource while ResourceId is the permanent unique name the DocumentDB service provides for the resource. In line with the style favored by RESTful APIs, DocumentDB makes extensive use of resource links and the SelfLink property specifies the full path to the resource. ETag is used to provide optimistic concurrency.

The following are examples of a SelfLink and ResourceId for a document:

SelfLink: "dbs/YxM9AA==/colls/YxM9ANCZIwA=/docs/YxM9ANCZIwABAAAAAAAAAA==/"
ResourceId: "YxM9ANCZIwABAAAAAAAAAA=="

Classes Derived from Resource

The following classes are derived from Resource:

Attachment
Conflict
Database
Document
DocumentCollection
Error
Permission
StoredProcedure
Trigger
User
UserDefinedFunction

These classes provide minimal additional functionality to their base class:

Attachment: represents a media attachment associated with a document. Adds a ContentType property, to specify the MIME type, and a MediaLink property to specify the self-link of the attachment location.

Conflict: represents some conflict associated with a resource. Adds a ResourceId and ResourceType (system type) of the conflict as well as an OperationKind property to indicate the kind of operation that led to the conflict. It also exposes a GetResource<T>() method to retrieve the conflicting resource.

Database: represents a DocumentDB database. Adds a CollectionsLink property containing the self-link for the document collections in the database, as well as a UsersLink property containing the self-link for the database users.

Document: represents a document in a collection. Adds an AttachmentsLink property exposing a self-link for any attachments associated with the document.

DocumentCollection: represents a collection of documents in a DocumentDB database. DocumentCollection adds the following properties:

ConflictsLink: the self-link for any conflicts in the document collection
DocumentsLink: the self-link for documents in the document collection
IndexingPolicy: the indexing policy for the document collection
StoredProceduresLink – the self-link for the stored procedures associated with the document collection
TriggersLink: the self-link for the triggers associated with the document collection.
UserDefinedFunctionsLink – the self-link for the user-defined functions associated with the document collection

Error: represents an error associated with a resource. Adds Code and Message properties to contain the code and message associated with an error.

Permission: represents a permission used to restrict access to a resource for a specific user. Adds the following properties:

PermissionMode: property that specifies a permission as ReadOnly or All.
ResourceLink: the self-link of the resource with which the permission is associated
Token: contains the access token that can be used to authenticate access conforming to this permission for a specific user

StoredProcedure: represents a stored procedure associated with a document collection. Adds a Body property containing the JavaScript text of the stored procedure.

Trigger: represents a trigger associated with a document collection. Adds the following properties:

Body: contains the JavaScript text of the trigger
TriggerOperation: specifies which type of operation the trigger is associated with (All, Create, Delete, Replace, Update)
TriggerType: specifies whether the trigger is invoked before (Pre) or after (Post) the operation
User: represents a user associated with the document database. Adds a PermissionsLink property containing the self-link of the permissions associated with the user.

UserDefinedFunction: represents a user-defined function associated with a document collection. Adds a Body property containing the JavaScript text of the user-defined function and a Type property specifying the type of the user-defined function (currently JavaScript is the only supported value).

DocumentClient

The DocumentClient class provides access to the Azure DocumentDB endpoints. The following fragment of the class declaration shows only those methods associated with a Document resource. There are similar methods for other resource types.

 public sealed class DocumentClient : IDisposable, IAuthorizationTokenProvider {
 …
 public DocumentClient(Uri serviceEndpoint, string authKey, ConnectionPolicy connectionPolicy = null, ConsistencyLevel? desiredConsistencyLevel = null);
 public Task OpenAsync();
 public Task<ResourceResponse<Document>> CreateDocumentAsync(string documentCollectionLink, object document, RequestOptions options = null,
 bool disableAutomaticIdGeneration = false);
 public Task<ResourceResponse<Document>> DeleteDocumentAsync(string documentLink, RequestOptions options = null);
 public Task<ResourceResponse<Document>> ReplaceDocumentAsync(string documentSelfLink, object document, RequestOptions options = null);
 public Task<ResourceResponse<Document>> ReplaceDocumentAsync(Document document, RequestOptions options = null);
 public Task<ResourceResponse<Document>> ReadDocumentAsync(string documentLink, RequestOptions options = null);
 public Task<FeedResponse<dynamic>> ReadDocumentFeedAsync(string documentsLink, FeedOptions options = null);
 public Task<StoredProcedureResponse<TValue>> ExecuteStoredProcedureAsync<TValue>(string storedProcedureLink, params dynamic[] procedureParams);
 public Task<DatabaseAccount> GetDatabaseAccountAsync();
 public object Session { get; set; }
 public ConsistencyLevel ConsistencyLevel { get; }
 …
}

The DocumentClient constructors take various forms of authentication credential: username/management key; or some form of user-specific resource authentication string. The class provides asynchronous methods supporting CRUD operations on the various DocumentDB resources. ConsistencyLevel provides the consistency level for the DocumentClient instance. Session provides access to the underlying session, providing support for session consistency.

In the DocumentClient constructor, the accountName is provided as part of the serviceEndpoint. The ConnectionPolicy provides configuration for the connection to DocumentDb, and the ConsistencyLevel provides the ability to reduce the consistency level from the default configured for the database. OpenAsync() is a utility method that should be called after the DocumentClient is created to ensure that it has been created successfully (with an Exception being thrown otherwise).

The XDocumentAsync() methods typify the methods exposed in the DocumentClient class for each type of Resource. The resource-specification parameter is typically either the permanent SelfLink to the resource or a specific resource object. The RequestOptions parameter allows the specification of various options including support for optimistic concurrency through RequestOptions.AccessCondition.

The XDocumentAsync() methods return either a Task<ResourceResponse> or Task<FeedResponse> depending on whether a resource or a resource feed is returned. The ResourceResponse and FeedResponse provide access to the HTTP response headers and response code as well as the actual resource or resource feed, for which they expose an implicit conversion.

The following example demonstrates the creation of a DocumentClient instance:

 Uri documentDbUri = new Uri(
 "https://ACCOUNT.documents.azure.com");
String authorizationKey = "KEY==";

 DocumentClient documentClient =
 new DocumentClient(documentDbUri, authorizationKey);

DatabaseAccount

A Database Account can contain zero or more Databases, which provide a logical container for the collection of documents. The following example shows the creation of a Database resource and the subsequent access to various properties of the Database class:

 Database database = new Database {
 Id = databaseId
};

 ResourceResponse<Database> response =
 await documentClient.CreateDatabaseAsync(database);
database = response;
 
String selfLink = database.SelfLink;
String collections = database.CollectionsLink;
String users = database.UsersLink;

Class: ResourceResponse

The ResourceResponse class encapsulates the response from a DocumentDB resource operation. It exposes resource-dependent quota and usage information for the operation. The ResourceResponse class contains the response headers for the operation, including the HTTP StatusCode. It implicitly exposes the typed resource from the response. The ResourceResponse class is declared as follows:

 public sealed class ResourceResponse<T>
 where T : new(), Resource {
 public long DatabaseQuota { get; }
 public long DatabaseUsage { get; }
 public long CollectionQuota { get; }
 public long CollectionUsage { get; }
 public long UserQuota { get; }
 public long UserUsage { get; }
 public long PermissionQuota { get; }
 public long PermissionUsage { get; }
 public long DocumentSizeQuota { get; }
 public long DocumentSizeUsage { get; }
 public long StoredProceduresQuota { get; }
 public long StoredProceduresUsage { get; }
 public long TriggersQuota { get; }
 public long TriggersUsage { get; }
 public long UserDefinedFunctionsQuota { get; }
 public long UserDefinedFunctionsUsage { get; }
 public string ActivityId { get; }
 public string SessionToken { get; }
 public HttpStatusCode StatusCode { get; }
 public string MaxResourceQuota { get; }
 public string CurrentResourceQuotaUsage { get; }
 public T Resource { get; }
 public double RequestCharge { get; }
 public NameValueCollection ResponseHeaders { get; }
 public static implicit
 operator T(ResourceResponse<T> source);
}

The FeedResponse class serves a similar purpose for operations which return a feed.

Data Model

DocumentDB uses the Newtonsoft.Json library for serialization, which was developed by James Newton-King (@JamesNK). The data model is a simple class, with no special base class, for which all public properties are serialized into JSON. The serialization library performs an obvious mapping from .NET to JSON. For example:

IList, etc. -> Array
Int32, etc. -> Integer
Float, etc. -> Float
DateTime -> String
Byte[] -> String

The following is a simple example of model class:

 class Album {
 [JsonProperty(PropertyName = "id")]
 public String ID { get; set; }
 [JsonProperty(PropertyName = "albumName")]
 public String AlbumName { get; set; }
 [JsonProperty(PropertyName = "bandName")]
 public String BandName { get; set; }
 [JsonProperty(PropertyName = "releaseYear")]
 public String ReleaseYear { get; set; }
}

The JsonProperty attribute (from the JSON.NET library) allows the automatic conversion between the upper-camel case naming convention for .NET and the lower-camel case convention for JSON documents.

Create a Resource

A create operation is used to create a resource.

The following example shows the creation of a document collection in a database identified by its self-link:

 DocumentCollection documentCollection =
 new DocumentCollection {
 Id = "SomeId"
};
ResourceResponse<DocumentCollection> response =
 await documentClient.CreateDocumentCollectionAsync(databaseSelfLink, documentCollection);
documentCollection = response;

The creation of other types of resource is similar except that the appropriate CreateXAsync method must be invoked instead.

Read a Resource

A point read of a DocumentDB resource returns a single resource identified by its self-link.

The following example shows the retrieval of a document identified by its self-link, and its subsequent deserialization into a strongly-typed object:

 ResourceResponse<Document> response = await documentClient.ReadDocumentAsync(documentSelfLink);

 
Album album = JsonConvert.DeserializeObject<Album>(response.Resource.ToString());

Delete a Resource

A delete operation is used to delete a single resource.

The following example shows the deletion of a resource identified by its self-link:

 ResourceResponse<Document> response =
 await documentClient.DeleteDocumentAsync(documentSelfLink);

Replace a Resource

DocumentDB currently supports only a complete replacement of a document resource (the HTTP PATCH operation is not supported).

The following example shows the point read of a document identified by its self-link, followed by its modification and subsequent replacement. The example also shows the use of RequestOptions object to implement optimistic concurrency.

 dynamic readResponse =
 await documentClient.ReadDocumentAsync(documentLink);

 RequestOptions requestOptions = new RequestOptions() {
 AccessCondition = new AccessCondition() {
 Type = AccessConditionType.IfMatch,
 Condition = readResponse.Resource.ETag
 } 
};

 Album album = (Album)readResponse.Resource;
album.ReleaseYear = "1990";

 ResourceResponse<Document> replaceResponse =
 await documentClient.ReplaceDocumentAsync(
 documentLink, album, requestOptions);

Read From a Feed

The .NET API can return all the resources in a collection as a paged "feed."

The following example shows the use of a FeedOptions object to control the paged read of a feed:

 String continuation = String.Empty;
 
Do {
 FeedOptions feedOptions = new FeedOptions {
 MaxItemCount = 10,
 RequestContinuation = continuation
 };

  FeedResponse<dynamic> response =
 await documentClient.ReadDocumentFeedAsync(
 documentCollectionLink, feedOptions); 
 continuation = response.ResponseContinuation;
} while (!String.IsNullOrEmpty(continuation));

DocumentDB Queries

DocumentDB supports queries at all resource levels, including Database, DocumentCollection and Document. The.NET API supports the following kinds of queries:

SQL
LINQ SQL
LINQ Lambda

The DocumentQueryable class exposes helper extension methods to create various types of query.

SQL Queries

The .NET API supports the use of the DocumentDB SQL dialect in queries.

The following example shows a simple query against a document collection, returning a set of documents:

 var albums = documentClient.CreateDocumentQuery<Album>(
 documentCollectionSelfLink,
 "SELECT * FROM albums a WHERE a.bandName = 'Radiohead'"
);
 
foreach (var album in albums) {
 Console.WriteLine("Album name: {0}", album.AlbumName);
}

Note that albums is the name of the DocumentDB collection.

LINQ Queries

DocumentDB supports the use of LINQ queries.

The following example shows a simple LINQ query against a document collection, returning a set of documents:

 IQueryable<Album> albums =
 from a in documentClient.CreateDocumentQuery<Album>(
 documentCollectionSelfLink)
 where a.BandName == "Radiohead"
 select a;
 
foreach (var album in albums) {
 Console.WriteLine("Album name: {0}", album.AlbumName)
}

LINQ Lambda With Paging

DocumentDB supports the use of lambda syntax for LINQ queries.

The following example shows how to use a FeedOptions object to control the paging of query results returned in response to a LINQ query presented in lambda syntax:

 FeedOptions feedOptions = new FeedOptions() {
 MaxItemCount = 10
};
 
IDocumentQuery<Album> query;
 
do {
 query = theDocumentClient.CreateDocumentQuery<Album>(
 documentCollection.SelfLink, feedOptions)
 .Where(a => a.BandName == "Radiohead")
 .AsDocumentQuery();
 
 FeedResponse<Album> pagedAlbums = await query.ExecuteNextAsync<Album>();
 
 foreach (Album album in pagedAlbums) {
 Console.WriteLine("Album name: {0}", album.AlbumName);
 }
 
 feedOptions.RequestContinuation = pagedAlbums.ResponseContinuation;
} while (query.HasMoreResults);

In the example, server-side paging is controlled by the RequestContinuation feed options token indicating how many documents the server should return with each request while client-side paging is controlled by the HasMoreResults property of the query response.

Summary

Azure DocumentDB is a fully-managed document database hosted in Microsoft Azure. It is designed to provide rich NoSQL capability to applications, which would otherwise have to deploy and manage some self-hosted NoSQL database. DocumentDB is currently a preview service.