Designing a Cosmos DB database

Premier Developer consultant Julien Oudot brings us this blog with some considerations when designing a Cosmos DB database.

The intent of this article is to summarize some of the key aspects to keep in mind while designing a database using Cosmos DB.

Azure Cosmos DB is Microsoft's globally distributed, multi-model database. Sometimes referred to as a server-less database, the promise is the ability to transparently and indefinitely scale your data with high throughput, low latency and high reliability.

To achieve this goal, a few best practices need to be applied. There is no need to have advanced DBA skills to design your storage layer in Cosmos DB, but a deep understanding of the data and how it will grow is important.

Choosing the right partition

The partition key may be the most important thing to choose when creating a Cosmos DB collection. It is a boundary for the transaction capabilities and drives the elastic scale capabilities provided by Cosmos DB. Changing its format basically involves migrating the existing data from the previous partition schema to the new one, which is time consuming and might involve down time.

The partition key has to be chosen when creating the collection and this is a mandatory field for multi-partition collections. When choosing the partition key, a few things have to be considered:

  • Read should minimize cross-partition look ups
  • Write should be evenly distributed across different partition key values

More generally, for better performance, the partition key should be a known field when querying the database.

Single Collection versus Multi-Collections

In terms of collection granularity, three strategies are usually observed:

  • One document type per collection
  • One tenant per collection. When Cosmos DB users need to store data for different customers, there would be one collection per customer.
  • All document types and all tenants in a single collection

The first option allows faster querying of a single document type because users know which collection they need to target depending on the query. It also gives the ability to better isolate the throughput consumption per type of queries and to do more fine-grained throughput provisioning. Because server-side functions (stored procedure and trigger) are tied to a collection, this option prevents users from running cross document type transactions, which is probably something needed in the case of a hierarchical document structure (see next section for more details about hierarchical documents).

The second option would be ideal to be able to bill throughput consumption to the tenants hosted on the platform. However, depending on the domain, it might not be relevant to assign some of the document types to a single tenant.

Finally, the third option is to have all tenants and all document types in a single collection. In this case, cross document type transactions are easy and there is a single stored procedure, trigger and user defined function repository, while they would need to be duplicated in the different collections for the previous options. One down side of this approach is the isolation between queries and tenants which is not as good as in the previous solutions. It is also more challenging to monitor the resource consumption per tenant.

Document types - flat structure versus hierarchical

When working in Cosmos DB, making sure that the document size is bounded is important. Indeed, every write request will be performed on the entire document, so its size has to be controlled for better performance.

When defining the document structure, there are two opposing strategies:

  • A flat structure with only 1 level of document containing all sublevels. Also known as data modeling with denormalization
    This approach is good for:

    • “Contains” relationships between entities or one-to-few relationship
    • When the sub-level type doesn’t change frequently and has a limited size
    • Better read performance
      Example of Music Track document:
  • A hierarchical structure where top level documents reference lower level documents. Also known as data modeling with normalization
    This approach is good for:

    • One-to-many or many-to-many relationship
    • Frequent changes of related data
    • Write performance of a single document type


A flat structure is more aligned with the NoSQL approach and allows to read or write a single document per query. However, because of the denormalization it involves, the quantity of data in such structure can become problematic. Due to the data duplication, there are also some challenges in terms of consistency. In practice, a mix of the two approaches is often chosen. For each document type, it is recommended to estimate:

  • How big the document can be?
  • For array types – how many items will there be?
  • What type of query will need to be performed on this type of document? What does it mean from a performance and reliability standpoint (multiple partition access, transactions, …)?

Here is an example of how the hybrid approach would look:


Here, the Author contains a list of book ids represented as independent documents. In the same way, the book documents reference a list of authors. On the contrary, images are embedded inside the author document.

Trigger versus stored procedure

Along with automatic indexing, another great feature of Cosmos DB is the ability to define server-side functions called triggers and stored procedures. Although they have the same name as in the SQL world, triggers are slightly different since they cannot be enforced at the database level. The name of the invoked trigger has to be used when working with documents.



This is actually the same when invoking a stored procedure except that users have to provide a URI or link to the procedure.


In essence, triggers and stored procedures are close concepts in Cosmos DB. With triggers, application code is usually simpler while stored procedures require arguments to be prepared prior to the call, which can also give more flexibility.

Real world use case for trigger

Recently, some customers reached out to us with a simple constraint. They wanted to be able to detect when a child document type (meaning that it depended on a parent document) was created pointing to a parent that did not exist. This was a perfect use case for pre-trigger because they already provided all parameters necessary to perform this check on the server side.

The trigger logic would query the existing database looking for other documents with the same parent id (this field being the partition key for the parent, it should be a fast query). If at least one document is found, then the document can be created. Otherwise an exception would be thrown from the trigger code, which would abort the operation.

Users would need to explicitly set the trigger every time they want to use it (as shown below).



The few design choices described in this article can drastically change the performance and scalability of the database. Keeping this guidance in mind will help you leverage the power of Cosmos DB to build low latency and massively scalable applications anywhere in the world.

Comments (3)

  1. Hey Pam,
    What is the better approach to organize data across documents? To have tons of smaller documents or the have few bigger documents?

    1. Andy Welch says:

      Hi Andrew,
      CosmosDb is slow to write but fast to read. Each write has to propagate to (by default) three copies of the target node. This is, I understand, one of the main governing factors around “eventual consistency”- within the consistency latency it is possible to update or write a document and then read back either old data or fail to read the new document. On top of this, indexes may be updated. This is the same as for RDMBS – an unmanaged index overhead can slow writes down while making reads more convenient. In other words, there is a tradeoff.
      In terms of document size – I suggest you run a simple experiment. Compare a “large” and “small” document size – say 1000 bytes vs 100 – for write, read and update. Perhaps a larger relative difference will be necessary to get a clear result. Obviously you’d expect more data to incur a larger overhead.
      In systems I’ve worked on we’ve typically been driven by cost concerns to use a single collection for multiple document types. Where documents have grown in size we’ve revisited what is included in each with reference to patterns of read and write activity. For instance, a document containing user details included a log of user activity. Each time the user logged in, we read the user document, checked password, and read permissions, we then updated the list of log events and updated the document. This meant writing several kilobytes of data to achieve the addition of an entry approximated 20 bytes in size. An inefficient model we decided. Hence we decided to move these records out of the user document into individual documents. Using id & partition key we used the timestamp of the event as the id and made the partition key a combination of the user’s email and the record event type – for example “|UserEventLogEntry”
      This meant each login event required a write of only 50 bytes compared to several kilobytes. As a result we could also easily query the user event log based on their email address.

    2. Andrew – there’s no hard and fast rule. A large document takes longer to write and read.
      In my experience, often writing small updates to large documents requires needlessly rewriting most of the data in the document. If those writes could be separated from the larger main document the savings are significant. Hence put them in a related document.
      I suggest doing some experiments of your own.

Skip to main content