Azure Cosmos DB (NoSQL) – How it works (Partitioning, Indexing, Replication, etc)


Azure Cosmos DB is including a schema-less NoSQL database, which also supports the MongoDB wired protocol and tools including mongoose, mongochef, and others.

One of the key questions about Azure Cosmos DB is what capabilities exposes. In this post, I focus the value proposition of Azure Cosmos DB, and show you how it works.
You will find that it's a globally and elastically scaled database, and you can take much benefit from this reliable database.

In this post I show you the samples using restful http raw for your understanding. But you can use DocumentDB API (Node.js, .NET, Java, etc), and it encapsulates so much difficulties (searching partitions, parallelism, etc).

Guarantee - SLA and Reserved Throughput

Cosmos DB has SLA of 99.99% availability, and reserved throughput with less than 10ms on reads and 15ms on writes. These service level is completely transparent. (You don't need to manage the details, and the database does.)

The performance level (reserved throughput) is determined when the collection is provisioned. Let's see the following example.

Get endpoints from Cosmos DB account

GET https://myaccount01.documents.azure.com/
x-ms-date: Tue, 06 Dec 2016 12:18:29 GMT
authorization: type%3dmas...
HTTP/1.1 200 OK
Content-Type: application/json

{
  "_self": "",
  "id": "myaccount01",
  "_rid": "myaccount01.documents.azure.com",
  "media": "//media/",
  "addresses": "//addresses/",
  "_dbs": "//dbs/",
  "writableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    }
  ],
  "readableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    }
  ],
  ...

}

Create a database using endpoint

POST https://myaccount01-eastus.documents.azure.com/dbs
x-ms-date: Tue, 06 Dec 2016 12:18:29 GMT
authorization: type%3dmas...
Accept: application/json

{"id":"db01"}
HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "db01",
  "_rid": "4Gt4AA==",
  "_self": "dbs\/4Gt4AA==\/",
  "_etag": "\"00000900-0000-0000-0000-583698a10000\"",
  "_colls": "colls\/",
  "_users": "users\/",
  "_ts": 1479973022
}

Create a collection in database

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls
x-ms-offer-throughput: 10000
x-ms-date: Tue, 06 Dec 2016 12:18:29 GMT
authorization: type%3dmas...
Accept: application/json

{
  "id": "test01"
}
HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "\/*",
        "indexes": [
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          },
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": 3
          }
        ]
      }
    ],
    "excludedPaths": [
      
    ]
  },
  "_rid": "6DpzAIPfiQA=",
  "_ts": 1480590914,
  "_self": "dbs\/6DpzAA==\/colls\/6DpzAIPfiQA=\/",
  "_etag": "\"0000dc00-0000-0000-0000-5840064b0000\"",
  "_docs": "docs\/",
  "_sprocs": "sprocs\/",
  "_triggers": "triggers\/",
  "_udfs": "udfs\/",
  "_conflicts": "conflicts\/"
}

Note : The authorization request header value is the url-encoded string of "type=master&ver=1.0&sig={signature}" format. The steps of getting this value is :
1) Construct your payload string. This string depends on the x-ms-date or Date header value, so it expires soon.
2) Create base64 encoded string of HMAC with SHA256 algorithm using the previous payload.
3) Set this encoded string as signature (sig).
Please see the official document "Access Control in the DocumentDB API" for details. (If you're using PHP, please refer my previous post "How to use Azure Storage without SDK" too.)

As you can see (see x-ms-offer-throughput request header), this example is assigning 10,000 request units (RUs) for this collection (named "test01"). Request unit (RU) is the measure of the API request processing, and Cosmos DB determines how much memory, CPU, and other resources are needed by the value of RUs. It affects to your fee of Cosmos DB consumption.

Here I don't describe the details about the calculation of this request unit (RU), but this depends on the number of requests and required read/write transactions. And you can estimate required RUs using Request Units Calculator Site.
But you also remember that the result of this site is just the reference data for your cost estimation, and it's not perfect. For instance, the complexity of a query will impact the consumption of RUs. If you need to send the query with much complexity, you might need more high RUs.

In such a case, you can measure RUs from the actual query. You can estimate how much RUs are consumed using x-ms-request-charge response header in CRUD request operations. For instance, the following request is consuming 3.08 RUs in this single query.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls/test01/docs
x-ms-date: Tue, 06 Dec 2016 03:29:35 GMT
authorization: type%3dmas...
Accept: application/json
Content-Type: application/query+json

{"query":"SELECT * FROM root ... "}
HTTP/1.1 200 Ok
Content-Type: application/json
x-ms-request-charge: 3.08
...

{
  "_rid": "GZsc...",
  "Documents": [
    ...

  ],
  "_count": ...
}

The estimate of RUs is important, because if you don't, you will waste the cost (money) or fall into performance degradation by the frequent errors.

Note : If you exceeds the reserved throughput per second, the error "Request rate is large" (Status 429) occurs. In such a case, please check x-ms-retry-after-ms response header and retry after the specified time. (SDK does this internally.)

Note : Cosmos DB now supports not only per-seconds RU (RU/s), but also per-minutes RU (RU/m).

Partitioning

Cosmos DB has the partitioning capability. One collection can take multiple physical partitions, and documents are distributed by the partition key.

Let's say that we set /divisionid as the partition key in the collection. The following is creating the collection (named "test01") with this partition key.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls HTTP/1.1
authorization: type%3dmas...
Accept: application/json

{
  "id": "test01",
  "partitionKey": {
    "paths": [
      "/divisionid"
    ],
    "kind": "Hash"
  }
}
HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "\/*",
        "indexes": [
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          },
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": 3
          }
        ]
      }
    ],
    "excludedPaths": [
      
    ]
  },
  "partitionKey": {
    "paths": [
      "\/divisionid"
    ],
    "kind": "Hash"
  },
  "_rid": "GZscAJ56rgA=",
  "_ts": 1480675514,
  "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/",
  "_etag": "\"00001202-0000-0000-0000-584150c60000\"",
  "_docs": "docs\/",
  "_sprocs": "sprocs\/",
  "_triggers": "triggers\/",
  "_udfs": "udfs\/",
  "_conflicts": "conflicts\/"
}

In this collection { "divisionid" : "div1", "name" : "engineering" } and { "divisionid" : "div1", "revenue" : 3000000 } resides in the same partition. (The documents with the same key are stored in the same partition.)

If you frequently pick up the documents using /divisionid, you can route your query to the appropriate partition.
Let's see how to route into the partition. (If you're using SDK, it's done by SDK and you don't need anything to do.)

First you get the information about key ranges of partitions from https://{your endpoint domain}.documents.azure.com/dbs/{your collection's uri fragment}/pkranges . In this example, "dbs/GZscAA==/colls/GZscAJ56rgA=/" (see _self property above) is the uri fragment of the collection.

GET https://myaccount01-eastus.documents.azure.com/dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges
authorization: type%3dmas...
Accept: application/json
HTTP/1.1 200 Ok
Content-Type: application/json

{
  "_rid": "GZscAJ56rgA=",
  "PartitionKeyRanges": [
    {
      "_rid": "GZscAJ56rgACAAAAAAAAUA==",
      "id": "0",
      "_etag": "\"00001402-0000-0000-0000-584150c60000\"",
      "minInclusive": "",
      "maxExclusive": "05C1A53DB92960",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgACAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgADAAAAAAAAUA==",
      "id": "1",
      "_etag": "\"00001502-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1A53DB92960",
      "maxExclusive": "05C1B53DB92960",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgADAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAEAAAAAAAAUA==",
      "id": "2",
      "_etag": "\"00001602-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1B53DB92960",
      "maxExclusive": "05C1BF5D153D90",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAEAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAFAAAAAAAAUA==",
      "id": "3",
      "_etag": "\"00001702-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1BF5D153D90",
      "maxExclusive": "05C1C53DB92960",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAFAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAGAAAAAAAAUA==",
      "id": "4",
      "_etag": "\"00001802-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1C53DB92960",
      "maxExclusive": "05C1C9CD673378",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAGAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAHAAAAAAAAUA==",
      "id": "5",
      "_etag": "\"00001902-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1C9CD673378",
      "maxExclusive": "05C1CF5D153D90",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAHAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAIAAAAAAAAUA==",
      "id": "6",
      "_etag": "\"00001a02-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1CF5D153D90",
      "maxExclusive": "05C1D1F5E1A3D4",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAIAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAJAAAAAAAAUA==",
      "id": "7",
      "_etag": "\"00001b02-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1D1F5E1A3D4",
      "maxExclusive": "05C1D53DB92960",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAJAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAKAAAAAAAAUA==",
      "id": "8",
      "_etag": "\"00001c02-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1D53DB92960",
      "maxExclusive": "05C1D7858FADEC",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAKAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgALAAAAAAAAUA==",
      "id": "9",
      "_etag": "\"00001d02-0000-0000-0000-584150c60000\"",
      "minInclusive": "05C1D7858FADEC",
      "maxExclusive": "05C1D9CD673378",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgALAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAMAAAAAAAAUA==",
      "id": "10",
      "_etag": "\"00001e02-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1D9CD673378",
      "maxExclusive": "05C1DD153DB904",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAMAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgANAAAAAAAAUA==",
      "id": "11",
      "_etag": "\"00001f02-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1DD153DB904",
      "maxExclusive": "05C1DF5D153D90",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgANAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAOAAAAAAAAUA==",
      "id": "12",
      "_etag": "\"00002002-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1DF5D153D90",
      "maxExclusive": "05C1E151F5E18E",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAOAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAPAAAAAAAAUA==",
      "id": "13",
      "_etag": "\"00002102-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E151F5E18E",
      "maxExclusive": "05C1E1F5E1A3D4",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAPAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAQAAAAAAAAUA==",
      "id": "14",
      "_etag": "\"00002202-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E1F5E1A3D4",
      "maxExclusive": "05C1E399CD671A",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAQAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgARAAAAAAAAUA==",
      "id": "15",
      "_etag": "\"00002302-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E399CD671A",
      "maxExclusive": "05C1E53DB92960",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgARAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgASAAAAAAAAUA==",
      "id": "16",
      "_etag": "\"00002402-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E53DB92960",
      "maxExclusive": "05C1E5E1A3EBA6",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgASAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgATAAAAAAAAUA==",
      "id": "17",
      "_etag": "\"00002502-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E5E1A3EBA6",
      "maxExclusive": "05C1E7858FADEC",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgATAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAUAAAAAAAAUA==",
      "id": "18",
      "_etag": "\"00002602-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E7858FADEC",
      "maxExclusive": "05C1E9297B7132",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAUAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAVAAAAAAAAUA==",
      "id": "19",
      "_etag": "\"00002702-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E9297B7132",
      "maxExclusive": "05C1E9CD673378",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAVAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAWAAAAAAAAUA==",
      "id": "20",
      "_etag": "\"00002802-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1E9CD673378",
      "maxExclusive": "05C1EB7151F5BE",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAWAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAXAAAAAAAAUA==",
      "id": "21",
      "_etag": "\"00002902-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1EB7151F5BE",
      "maxExclusive": "05C1ED153DB904",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAXAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAYAAAAAAAAUA==",
      "id": "22",
      "_etag": "\"00002a02-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1ED153DB904",
      "maxExclusive": "05C1EDB9297B4A",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAYAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAZAAAAAAAAUA==",
      "id": "23",
      "_etag": "\"00002b02-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1EDB9297B4A",
      "maxExclusive": "05C1EF5D153D90",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAZAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAaAAAAAAAAUA==",
      "id": "24",
      "_etag": "\"00002c02-0000-0000-0000-584150c70000\"",
      "minInclusive": "05C1EF5D153D90",
      "maxExclusive": "FF",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/pkranges\/GZscAJ56rgAaAAAAAAAAUA==\/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    }
  ],
  "_count": 25
}

Here's 25 partitions. (The number of partitions is automatically determined by the storage size and RUs of the collection, and it's not managed by your own.)
Each minInclusive and maxExclusive properties are the hex number and meaning the range of hash. In Cosmos DB (server side), MurmurHash is used by the partitioning hash algorithm. For example, if the key value is "div200", the computed partitioned key value is "05C1EDBFC1A70A". (I uploaded the hash partitioning programming code into "Github - tsmatsuz/DocumentDbPartitionResolveSample".)
That is :

"05C1EDB9297B4A" <= "05C1EDBFC1A70A" < "05C1EF5D153D90"

As a result, this document (which is having the key "div200") is deployed to the 23rd partition (id="23").

When you search documents using the partition key with "div200", you send the query only to the 23rd partition as follows, and you don't need to send to the others. The following "GZscAJ56rgA=,23" in the http request header means the 23rd partition.

Note : When you send query to the partitioned collections, the following "x-ms-documentdb-query-enablecrosspartition : True" is needed in the request header.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls/test01/docs
x-ms-continuation: 
x-ms-documentdb-isquery: True
x-ms-documentdb-query-enablecrosspartition: True
x-ms-documentdb-partitionkeyrangeid: GZscAJ56rgA=,23
authorization: type%3dmas...
Accept: application/json
Content-Type: application/query+json

{"query":"SELECT * FROM root WHERE (root[\"divisionid\"] = \"div200\") "}
HTTP/1.1 200 Ok
Content-Type: application/json
x-ms-item-count: 3

{
  "_rid": "GZscAJ56rgA=",
  "Documents": [
    {
      "divisionid": "div200",
      "content": "cont200",
      "id": "75233879-4ebf-4c37-ac9d-a3b373ac4441",
      "_rid": "GZscAJ56rgAOAAAAAACADg==",
      "_self": "dbs\/GZscAA==\/colls\/GZscAJ56rgA=\/docs\/GZscAJ56rgAOAAAAAACADg==\/",
      "_etag": "\"0001ee0b-0000-0000-0000-584152af0000\"",
      "_attachments": "attachments\/",
      "_ts": 1480676011
    },
    {
      "divisionid": "div200",
      ...

    },
    {
      "divisionid": "div200",
      ...

    }
  ],

  "_count": 3
}

Note that the partitioned collection is having not only pros (benefits) but also cons (caveat).
Let's consider the case if you search the documents without partition key. In this case, you must send the query to all the partitions (25 partitions in this case), and there is a significant overhead. Then if your application is having various kinds of documents, it's better that you divide these documents into the different collections which might be a single partition collection or partitioned collection, and set the appropriate key for each partitioned collection. (You must care which one should be the partition key.)

If you need to use the partitioned collection and need to search without partition key, it's better to send the query in parallel. When using .NET SDK, you must specify the following MaxDegreeOfParallelism property. (Of course, more high RUs are needed for the parallel execution.)
Otherwise, you fall into the sequential search of so many partitions, and it would take the tremendous time to return the results despite of setting the higher performance level (RU). (Inside, each call meets the performance level, but the overhead of total calls exceeds so much.)
Especially if you're using .NET SDK, it automatically determines whether the multiple calls are required by the LINQ query expression, but it doesn't automatically change into the parallel calls. You remember that you must set the parallel calls by your own.

...
using Microsoft.Azure.Documents.Client;
...

var client = new DocumentClient(
  new Uri(
    "https://myaccount01.documents.azure.com:443/"),
    "RQCgm3a...");  // key

var query = client.CreateDocumentQuery<MyDoc>(
  UriFactory.CreateDocumentCollectionUri("db01", "test01"),
  new FeedOptions
  {
    EnableCrossPartitionQuery = true,
    MaxDegreeOfParallelism = 10
  })
  .Where(p => p.UserName == "Tsuyoshi Matsuzaki");
var tsmatz = query.AsEnumerable().FirstOrDefault();
...

Note : You can also do this partitioning task in the client-side programming code. (In this case, you provision multiple collections instead of server-side partitions.) This is called client-side partitioning.
It makes you free to customize and apply any partitioning policy (range partitioning, lookup partitioning, etc) by your strategies, and SDK is having several helper classes for this client-side implementation. (The consistent hash ring by the MD5 hash is used in SDK by default.)

Replication

Another aspect of Cosmos DB scalability is "replication" (replica). Cosmos DB has not only the distributed strategies, but the globally distributed strategies.

Let's see the Cosmos DB management in Azure Portal.
If you click the global map, you can see the "replicate data globally blade" in the portal. To add the replication region, you just click and save the location in this blade. You can add or remove even if your collection is running !

When you set regions as above screenshot, you can read (or query) documents from any 3 regions ("East US", "Japan East", and "West Europe") of the client's choice, and you can write documents to only the write region ("East US").
You can get these endpoints from the restful api (or SDK) as follows.

GET https://myaccount01.documents.azure.com/
authorization: type%3dmas...
HTTP/1.1 200 Ok
Content-Type: application/json

{
  "_self": "",
  "id": "myaccount01",
  "_rid": "myaccount01.documents.azure.com",
  "media": "//media/",
  "addresses": "//addresses/",
  "_dbs": "//dbs/",
  "writableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    }
  ],
  "readableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    },
    {
      "name": "Japan East",
      "databaseAccountEndpoint": "https://myaccount01-japaneast.documents.azure.com:443/"
    },
    {
      "name": "West Europe",
      "databaseAccountEndpoint": "https://myaccount01-westeurope.documents.azure.com:443/"
    }
  ],
  "userReplicationPolicy": {
    "asyncReplication": false,
    "minReplicaSetSize": 3,
    "maxReplicasetSize": 4
  },
  "userConsistencyPolicy": {
    "defaultConsistencyLevel": "Session"
  },
  "systemReplicationPolicy": {
    "minReplicaSetSize": 3,
    "maxReplicasetSize": 4
  },
  "readPolicy": {
    "primaryReadCoefficient": 1,
    "secondaryReadCoefficient": 1
  },
  "queryEngineConfiguration": "{\"maxSqlQueryInputLength\":30720,...}"
}

These region are prioritized, and if you don't specify any specific region using SDK, the region of the first priority (in this case, "East US") is selected.
You can change this priority by drag-and-drop in Azure Portal. (You can also change the write region by the manual failover operation.)

This global multi-region setting affects to the availability and performance.

For the availability perspective, Cosmos DB is able to fail over automatically (transparently) with these regions. Cosmos DB is also having the local replication (4 replicas by default), and these replicas will work on any level of troubles.

For the performance perspective, let's consider that you're serving the applications or services in the world-wide. All the partitions might be replicated in all regions, and then your customer can read the data from the nearby region with low network latency.

Note that if you consume 20,000 RUs for each 3 regions, the total 60,000 RUs are needed for the performance level. But, if the read operations are distributed to other regions, eventually each RUs in one region would be reduced. When using replication, you must carefully estimate the costs.

Note : This replication mechanism is not for the database backup (the same like other database). For instance, if you overwrite the data because of your mistakes, the wrong data is also replicated soon.
If you want to take backup, Cosmos DB can hold your backup automatically (each 4 hours) in Azure blob storage. For more details, please see "Automatic online backup and restore with Azure Cosmos DB". (For restoring database, you seems to need the support request.)

Indexing (Query optimization)

Cosmos DB provides the rich query expression by SQL, including joins, string functions (contains, concat, ...), spatial functions, and user-defined functions (custom functions), etc. You can also use LINQ in C#, and the query by the expression tree is extracted and executed on the server side.

Cosmos DB automatically indexes the documents, and this helps the high throughput of these query in most cases. But, you remember that not all the query is covered by the default index settings, and the index designing is still important for detailed cases. Here I explain how it works and how to design.

Now let's see the following rest example. This example is creating the database collection with index policy.

After this index policy is provisioned, when you insert the document { "divisionid" : "div001", "divisioninfo" : { "membercount" : 5, "name" : "engineering" }, "divisionloc" : "Japan" }, the 3 indexes of /divisionid, /divisioninfo/membercount, and /divisioninfo/name will be created. When you search documents with these properties, these indexes are used and improves the query performance.

Note that path property is the document path. When you use "?" (question) in path, it means that it's the exact path. When you use "*" (asterisk), the all paths under the specified path are included.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls
authorization: type%3dmas...
Accept: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "automatic": true,
    "indexingMode": "Consistent",
    "includedPaths": [
      {
        "path": "/divisionid/?",
        "indexes": [
          {
            "dataType": "String",
            "precision": -1,
            "kind": "Hash"
          }
        ]
      },
      {
        "path": "/divisioninfo/*",
        "indexes": [
          {
            "dataType": "String",
            "precision": -1,
            "kind": "Hash"
          },
          {
            "dataType": "Number",
            "precision": -1,
            "kind": "Range"
          }
        ]
      }
    ],
    "excludedPaths": [
      {
        "path": "/*"
      }
    ]
  }
}
HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "\/divisionid\/?",
        "indexes": [
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": -1
          },
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          }
        ]
      },
      {
        "path": "\/divisioninfo\/*",
        "indexes": [
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": -1
          },
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          }
        ]
      }
    ],
    "excludedPaths": [
      {
        "path": "\/*"
      }
    ]
  },
  "_rid": "GZscAKjHAQ0=",
  "_ts": 1480945705,
  "_self": "dbs\/GZscAA==\/colls\/GZscAKjHAQ0=\/",
  "_etag": "\"0000e300-0000-0000-0000-5845702d0000\"",
  "_docs": "docs\/",
  "_sprocs": "sprocs\/",
  "_triggers": "triggers\/",
  "_udfs": "udfs\/",
  "_conflicts": "conflicts\/"
}

The index kind must be "Hash" (hash index), "Range" (range index), or "Spatial" (spatial index). The hash index works when the equality query (=) is used. When you want to improve performance for the query with comparison (<, >, <=, !=, ...) or sorting (order-by), the range index will work well.

Note : In Azure Cosmos DB, all the document is having the id property (which is automatically assigned to all the documents). This id property is unique in one single partition and having the hash index.

In the previous example I defined the index policy manually, but by default, Cosmos DB indexes every path (/*) in the document tree, and the string properties are indexed as the hash index and numeric properties as the range index.
The following is the result of provisioned index policy by default.

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "\/*",
        "indexes": [
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          },
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": 3
          }
        ]
      }
    ],
    "excludedPaths": [
      
    ]
  },
  "_rid": "GZscAIxwGQ0=",
  "_ts": 1480993718,
  "_self": "dbs\/GZscAA==\/colls\/GZscAIxwGQ0=\/",
  "_etag": "\"0000e900-0000-0000-0000-58462bbc0000\"",
  "_docs": "docs\/",
  "_sprocs": "sprocs\/",
  "_triggers": "triggers\/",
  "_udfs": "udfs\/",
  "_conflicts": "conflicts\/"
}

When you set "lazy" as the index mode (indexingMode), the index is created asynchronously and it responds soon when the document is updated (create/update/delete). The default is "consistent".

Note : You can also use etag to avoid conflicts and keep the consistency of data.

The index precision is the byte of precision, and "-1" means the maximum precision.
For example, if you specify "7" bytes as index precision for the number property (8 bytes), you can reduce the index storage, but the query consumes more IOs (inputs and outputs).

As I described before, this understanding is important for your actual applications.
For example, let's say that you have the date/time property and use this property as the query parameter of timeline search. Cosmos DB is the JavaScript based database, and it's not having the date/time type in native. In such a case, it's better to store this value as the epoch time (numeric time format) and set the range index for this numeric property.

Indexing is transparent in most cases, but the designing is still important for detailed cases.

Consistency

When the data is updated, this updates will be applied to any replicas. As you know, waiting all these updates affects to the performance of Cosmos DB, and there exists the trade-off between the consistency and performance.
To meet your business needs, Cosmos DB is having the several level of consistency.

You can use 5 types of consistency level, which is "Strong", "Bounded staleness", "Session", "Consistent prefix", and "Eventual" (left to right, getting relaxed), and  by default, the Session consistency is used.
You can specify x-ms-session-token request header (which is returned as the response header beforehand) in each restful requests, and Cosmos DB identifies the specific client session using this header value. (When using SDK, this is done by SDK automatically.)
The Session consistency keeps the consistency of this client session.

On the other hands, "Strong" and "Bounded staleness" consistencies are concerned about the global consistency of replicas. I don't describe the details of other consistency levels, and please refer the following resource for more details.

Consistency levels in Azure Cosmos DB
https://docs.microsoft.com/en-us/azure/documentdb/documentdb-consistency-levels

The consistency level is defined in the scope of database account, but you can apply to the specific requests using x-ms-consistency-level request header. (You can change the consistency level for each request.)

Note : When you handle some bunch of operations, you can also use the consistent transactional operations with server-side JavaScript stored procedures or triggers.

 

Change logs :

2017/05  renamed "DocumentDB" with "Cosmos DB", added about RU/m and 5 consistency levels (changed from 4 levels)

 

Comments (1)

Skip to main content