Recently I had to give an internal talk on why our team uses MongoDB as its database.
Relational Databases: I had experience with SQL Server of course and when I worked in Azure Service Bus I learned how cool SQL Azure can be (or Microsoft Azure SQL Database as it's called now). It’s really a relational database in the cloud! I really liked the fact that you could have an application that worked on top of SQL Server and then use pretty much the same application in Azure. As some people know, Service Bus can be installed in your own servers and how you do that is by installing it on top of SQL Server. It was amazing to me that you could use the same profiler and use store procedures in both. It was almost perfect. Of course after working with it for a while I got to experience firsthand the disadvantages that it can have for a large project. At some point I felt as if we were trying to fit a square peg in a round hole. But I still remember how cool it felt that it was even possible. Overall that was a good learning experience.
Key/Value Storage: Of course when I worked in Azure I also got to work a lot with Azure Table Storage. It’s very interesting too: it’s really scalable and it forces you to understand how your data is going to be accessed so that you can partition it efficiently. And once you do that you don’t have to worry about having too much data. It helps you focus on the data, without having to worry about constraints/schemas/indices/relationships and stored procedures. All you need to worry is the data. Very simple and very powerful. However, to be honest I felt it was too much of a jump from the concept of database I had learned in school. Having to put all the business logic in the application layer felt to me closer to an Excel file than to a Database.
And then I got to work with MongoDB.
So what is Mongo?
It’s a Document database. This means that we don’t have the concept of schemas. Your data is persisted in collections of documents.
However unlike Azure Table Storage the documents are nested data. You could have documents inside your documents. Whatever makes sense to your application. So there is no need to flatten your data to match a “row”. You can think of documents almost as an object in a programming language.
- It allows querying the nested data for your document. You can query values in arrays, etc. So the document is not just a blob.
- As I said there is no schema: some documents might have fields that other documents don’t have. This is very handy when adding new fields. You can have old and new data living in the same database without having to do data migration.
- You can add indexes to the nested data! This can make it really fast
- It’s super versatile. It allows changing the actual DB engine to one that fits your needs (more on this later)
- It’s really scalable
- With a relational Database you typically need to scale up if you want to improve performance. That means that you need to add more memory or a better disk or even change to a more powerful machine. This gets expensive very quickly and there is a physical limit on how powerful you can get it.
- Instead of this Mongo supports Scaling out. That is: you can add extra servers to your existing database servers and use them to increase availability (replication) or performance (sharding).
- Sharding means splitting your collection into pieces. This can help with throughput because different servers will be serving the different pieces.
- Replicating means copying data to other servers. This is done through Replicasets in MongoDB and they give us data consistency and partition tolerance. If one or more servers go down we can keep working.
MongoDB was designed as a Pluggable Storage Engine. This means that if your application has a particular pattern (writeheavy or readheavy) you can select a different storage engine. This avoids “one size fits all” that make a DB not good for certain use cases
- MMAPv1 is a good default engine that has collection level concurrency control. However as of the latest versions of Mongo it has been slowly deprecated for Wiredtiger.
- WiredTiger provides more granular concurrency control. For some benchmarks it shows better performance (up to 10x) than MMAPv1.
- Facebook released the Rocks storage engine (write-optimized)
- TokuMX is an engine that uses data compression & Fractal Indexing to improve performance (Open Source)
Profile inside Mongo?
Unlike other databases the profiler is part of Mongo from the start. You can easily enable it for your application with just one line:
This will take a profile of every operation that takes longer than 100 m
Then you can sort to find the slowest ones like this:
In addition of looking at the profiles you can also look at the logs where the slowest operations are logged by default.
MapReduce and Aggregation framework
Once you have your data it’s pretty typical that you want to extract information from it. Instead of having to pull all the data and doing your processing in your application layer MongoDB allows you to take advantage of the server to do some of this.
It has full support of the MapReduce paradigm. This allows you to do grouping operations on your collections such as counting, sums, projections, etc. The best part is that it can be done on your shards so you can do parallel processing of large amounts of data. And this works both for input and output so your results can go to sharded collections as well.
This is good but overkill for most reporting needs. Mongo also has support for in memory aggregation operations. These are really fast and almost as powerful as MapReduce.
- No joins (sometimes this means that you need multiple trips to the DB)
- No stored procedures (but you get a powerful Aggregation pipeline and MapReduce support instead)
- No transactions (you need to implement two-phase commit yourself)
But you get a
- Document DB
- Very powerful Indexes
- Aggregation pipeline and MapReduce support
- Replicasets and Sharding
- Multiple engines
It’s fast, it’s scalable and it’s flexible.