Why Pay for Sorting If You Don't Need It? (1 of 3)

Log-structured Merge-trees (or LSM trees) are used by many famous data stores, such as BigTable, HBase, Rocks DB, Cassandra, etc. LSM trees provide desirable features such as memory efficiency, quick updating, and range query, i.e. access to all the records in a key range, according to one predefined total order of the keys.

Unfortunately, LSM trees can be costly. They require periodical data sorting, which causes high write amplification that shortens SSD life.  Write Amplification means that in order to make room to store k bytes of data, the system moves k * n bytes of other data, resulting an amplification factor of n+1. A SSD’s lifetime is dictated by the number of write operations it can endure. As a result, higher write amplification leads to shorter replacement cycle.

Sorting may amplify write by 40 times when space is tight. At the same time, the amplified SSD writes are accompanied by roughly same number of SSD reads, leading to total amplification of 80, leaving only 1.25% of the I/O bandwidth to clients. Over-provisioning in SSD space and I/O bandwidth is required to avoid the problem.

Furthermore, according to a Samsung study, write patterns of the LSM tree causes SSD fragmentation, leading to more write amplification and performance degradation in the SSD device level.

The cost of data sorting is high. On the other hand, some may argue that sorting is necessary to achieve high memory efficiency. After all, LSM trees do not have to keep all keys in memory because data are stored in order. It seems that we have to take memory efficiency and data sorting as an all or nothing package.

Or do we?

Introducing Venger Index, sorting not required, yet memory efficient. It is a hash based index that maps a key to the corresponding record’s location on SSD. It uses less than 3 bytes of DRAM per key. More importantly, since data records do not have to be stored in order, we can experiment with different data layouts, and reduce write amplification by 70% comparing to LSM trees with similar space over provisioning. Not having to worry about record ordering, we have more freedom in taking advantage of open channel SSD or multi-stream SSD interfaces, reducing write amplification to 1 to 2, and potentially removing SSD fragmentation.

Venger Index is best suited for distributed flash stores require low latency and low cost, with average record size 1KB or bigger. It does not support range query, which can be implemented by maintaining indexing data structure, e.g. B-tree, separately. However, this feature is better to be implemented by an upper layer who understands the semantics of the data, so as to avoid load balancing problems in distributed store, and to avoid imposing the cost on data sets that do not need this feature.

In our next posts we explain why LSM tree compaction is costly, and how we build Venger Index to achieve low cost and high performance.