Alex Chi Z.

Key-Value Separation in LSM Storage Engines


Table of Contents

Introduction

General-purpose LSM storage engines, such as LevelDB and RocksDB, store a set of user-written key-value pairs together in an ordered SST (Sorted String Table). During the compaction process, the engine merges the upper level SSTs with the lower level SSTs to generate a new SST file. In this process, both the keys and values inside the SST are rewritten, resulting in significant write amplification. If the size of the values is much larger than the keys, the write amplification caused by compaction can introduce substantial overhead.

In WiscKey (FAST ‘16), the authors proposed an SSD-friendly design for LSM tree-based storage engines. It reduces write amplification by separating keys and values. Key-value separation involves storing large values elsewhere and storing a value pointer (vptr) in the LSM tree pointing to the location of the value. In WiscKey, this place where the value is stored is called the Value Log (vLog). Consequently, during LSM tree compaction, there is no need to rewrite the values, only the organization of the key indexes needs to be rearranged. This greatly reduces write amplification and mitigates SSD wear out.

The idea of key-value separation sounds straightforward, but in practical implementation, many issues need to be considered:

Now, it has been seven years since the publication of the WiscKey paper. In the industry, a number of key-value separation LSM storage engine implementations have emerged. In this article, I will introduce four open-source implementations of key-value separation storage engines, compare their differences, and thereby highlight the tradeoffs in implementing key-value separation, providing readers with more insights for selecting a key-value separation storage engine in scenarios with large values.

If you are reading this blog post, you might also be interested in mini-lsm, a tutorial on building an LSM-tree storage engine written by me, based on my previous experience with RocksDB and other key-value storage engines.

Write Path and Storage Layout

In this article, unless otherwise stated, we assume that the storage engine has enabled Write-Ahead Logging (WAL) and uses leveled compaction for LSM trees.

Without Key-Value Separation

Writing to LSM

First, let’s take a look at how normal LSM storage engines handle write requests. When a write request from a user is received, the engine directly writes the key-value pair to both the memtable and the WAL (Write-Ahead Log). The memtable typically uses a skip list to store key-value pairs. The WAL is used for persistence to recover the information in the memtable after a server failure. At this point, the user can be notified that the write operation has been persisted.

The background operations of the LSM engine mainly include two tasks: flushing the memtable to disk and doing compaction on the LSM tree. At any given time, only one memtable is actively being written to in memory, while the other memtables are immutable. When the memtable in memory exceeds its limit, the engine converts the contents of the memtable into an SST (Sorted String Table) and flushes it to disk, also deleting the WAL. SST is a file that stores ordered key-value pairs. The data in SST is stored in blocks and compressed within each block. After writing the content of the key-value pair, there are several index blocks at the end of the SST, which record the key at the beginning of each data block for quick positioning of the key. SST may include other structures like Bloom Filters to accelerate queries.

The level 0 of the LSM tree consists of multiple SST files, and the key ranges of the SST files in level 0 may overlap. From level 1 onwards, the key ranges of SST files in the same level do not overlap. If any of the levels exceed the size limit, the storage engine will compact files in that level with the files in a lower level, and then move them to a lower level. As a result, the keys and values written by the user gradually move to the lowest level of the engine over time.

When the LSM tree is initially empty, if SST files are compacted from level 0 to L1, …, and finally, the last level, level-by-level, it introduces some unnecessary write amplification. Therefore, most LSM storage engines provide the ability to compact to the base level (Lbase). In this case, the size of the SST files that each level can accommodate is dynamically adjustable. SST files may be directly flushed from level 0 to level 6, reducing the write amplification when the LSM tree is initially populated.

BadgerDB

In BadgerDB, large values are directly written to the value log (vLog) when the user requests to write them. The entire process is shown in the following diagram.

Writing to Badger

When a user requests to write a key-value pair to Badger, if the value is smaller than a threshold, the entire process is the same as normal LSM engines. If the value is greater than or equal to the threshold, the writing process is as follows:

TerarkDB

Writing to TerarkDB

Since TerarkDB is based on RocksDB, the entire write process is consistent with RocksDB in the foreground stage. The key-value pairs are written to the memtable and WAL, and then the memtable is flushed to disk in the background. Compared to Badger, the value needs to be written to the WAL in the foreground and to the v-SST (Value SST) during the flush. This process introduces an additional 1x write amplification.

During the background flushing process, TerarkDB splits the memtable based on the size of the values. The small values are bundled together with the keys into an SST, following the same process as normal LSM engines. The large values and their corresponding keys are written to the v-SST as an SST. Since the v-SST includes index blocks, it allows for quick positioning of the value based on the key during queries. In the LSM tree, TerarkDB only stores the keys corresponding to the large values and the file number where the value is located, <key, fileno>, without storing the offset of the large value.

Because large values are stored in SST format, large values in a single v-SST in TerarkDB are sorted by key. At the same time, with the help of existing tools in RocksDB, TerarkDB can perform block-level compression on v-SST. In this case, the default key-value separation threshold in TerarkDB is usually set to a smaller value, around 512B, while BadgerDB sets this default value to 4K.

Titan and BlobDB

Writing to Titan / BlobDB

Titan is a plugin for RocksDB, so the overall foreground process is almost the same as RocksDB. The background process is similar to TerarkDB, with the only difference being the way large values are stored. Titan stores large values in a special format called blob file. The blob file contains ordered key-value pairs where each pair is compressed individually. Therefore, during the flush process, the storage format in the LSM tree for large values in Titan is <key, <fileno, offset>>.

BlobDB uses the same storage layout and implements a similar write path to Titan. The blob file also contains ordered key-value pairs which are compressed per-record.

Comparison of Write Paths

Storage EngineBadgerDBTerarkDBTitan / BlobDB
Foreground Writing of Large Values
(affects write amplification)
Written directly to vLogWritten to WAL, then flushed to v-SSTWritten to WAL, then flushed to blob file
value ptr contents<fileno, offset><fileno><fileno, offset>
Sorting of vLog
(affects scan performance)
In the order of writesSorted by keySorted by key
Compression granularity of vLog
(affects space amplification)
Per-recordPer-blockPer-record
Whether vLog includes indexesNoYesNo
Default key-value separation threshold4K512B4K

Garbage Collection

After users delete or update keys, normal LSM engines will delete the old records during the compaction process, as shown in the following figure: If the compaction process finds that there are different version of key-value pairs with in the upper and lower layers, or if there are delete tombstones in the upper layer, the engine will delete the keys in the lower layer and keep only one copy of the key-value in the new SST.

Compaction in normal LSM

After separating large values from the LSM tree, we need to handle garbage collection of the large values to reduce the amount of garbage data on the disk and minimize space amplification caused by key-value separation.

Garbage Collection in BadgerDB

Garbage collection in Badger

As shown in the above figure, BadgerDB stores the garbage estimation value of each vLog in the DISCARD file. When a user updates or deletes a key, the counter of the corresponding vLog file will be incremented by 1. When the counter reaches a certain threshold, the vLog file will be rewritten.

During the vLog rewriting process, BadgerDB scans whether the current key being processed exists in the LSM tree. If it does not exist or has been updated, it will ignore that key.

After the new vLog is generated, BadgerDB writes the new positions of these key-value pairs back to the LSM tree. Therefore, the GC process of BadgerDB will impact the user write throughput of the LSM tree. This leads to a question: if a user has already deleted a key, but during GC, the old value corresponding to this key is written back to the LSM tree, will it cause correctness issues when reading?

Garbage collection consistency in Badger

As shown in the above figure, BadgerDB internally stores keys as a combination of key + timestamp. When writing back to the LSM tree, BadgerDB does not need to consider whether the current key has been deleted or updated by the user. This does cause the new value (corresponding to the key-vptr) written by the user to be placed below the old value (corresponding to the key-vptr). However, during reading, BadgerDB scans all layers, which resolves this potential correctness issue. I will also explain this in more detail in the subsequent read path section.

Compaction and Garbage Collection in TerarkDB

Garbage collection in TerarkDB

In TerarkDB, v-SST garbage collection does not require writing back to the LSM tree, reducing the impact of GC on user foreground traffic.

As shown in the above figure, TerarkDB selects v-SSTs that need to be garbage collected based on the garbage amount of each v-SST in the background. The GC process iterates through each key to confirm whether they have been deleted or updated in the LSM tree. After GC, new v-SSTs are produced. TerarkDB records the dependency relationships between v-SSTs in the MANIFEST. If the key accessed by the user points to an older v-SST, TerarkDB finds the latest v-SST based on the dependency relationships and reads the value from it. Overall, the special feature of TerarkDB is that garbage collection does not affect the write traffic in the foreground.

Compaction in TerarkDB

During the LSM tree compaction process, TerarkDB also updates the file number. As shown in the figure above, if the engine detects that the file corresponding to a large value has been deleted during compaction, it writes the new file number into the SST generated by compaction based on the dependency relationships.

In practical production environments, this GC method may cause the bottom layer of the LSM tree to depend on almost all v-SSTs, and a particular v-SST may be referenced by many SSTs, resulting in performance degradation. TerarkDB performs a special rebuild compaction for these SSTs that have complex dependency relationships. During the rebuild process, the engine simplifies the dependency chain as much as possible to ensure the normal operation of the system.

Garbage Collection in Titan

Titan’s regular garbage collection (regular GC) adopts a similar strategy to BadgerDB, using statistics to determine which blob files to collect, and then rewriting the corresponding blob files and writing the new vptr back to the LSM tree. The entire process is shown in the following figure.

Regular garbage collection in Titan

Since Titan is developed as a plugin of RocksDB, it does not have access to the internal version number of each key. Therefore, when writing back to the LSM tree, it needs to register a WriteCallback to avoid race conditions where the GC process overwrites later user updates. This greatly impacts the user write throughput during the engine’s GC process.

To solve this problem, Titan introduces another GC scheme called level merge, as shown in the figure below. During the background compaction process of the LSM tree, Titan rewrites the corresponding blob files and updates the vptr in the SST, without interfering with foreground user writes.

Level-merge garbage collection in Titan

Titan’s Level Merge is only enabled in the last two levels of the LSM tree.

Garbage Collection in BlobDB

BlobDB uses a relatively simple garbage collection strategy. It defines two parameters for developers to set: GC threshold and GC age cutoff. When the age of a blob file is too old, the engine will trigger a compaction task that rewrites the blob file as well as the LSM tree SSTs. When the garbage rate in a blob file is too high, the engine will also trigger a compaction.

Compaction in BlobDB

BlobDB does not support directly compacting/garbage-collecting blob files. Blob files are only compacted when SSTs linking to the blob files are compacted.

Comparison of Garbage Collection Strategies

Storage EngineBadgerDBTerarkDBTitanBlobDB
vLog GC TaskRewrite vLog, Write back to LSMMerge v-SST + RebuildMerge blob files, Write back to LSM (Regular GC)N/A
Compaction TaskN/AUpdate file numbers to the latestRewrite blob files (Level Merge)Rewrite blob files

Read Path of Large Values

As shown in the figure below, in normal LSM storage engines, reading a single key requires traversing the memtable, L0 SST, and one of the SSTs in each subsequent level. Once the same key is encountered, the result can be returned to the user.

Get in normal LSM

In the scenario of key-value separation, the reading path differs from the normal LSM tree.

BadgerDB

Get in BadgerDB

As shown in the above figure, in BadgerDB, due to the problem that newly-written data may be placed below previously-written data, each read operation needs to scan all levels. After traversing the memtable + L0 + one of the SSTs in each level, the latest version of the value can be accessed, and based on the vptr, accessing the vLog.

During the scanning process, since the value in vLog is stored in the order of writes, sequential scanning is likely to become random reads on vLog. In order to avoid the performance degradation of scanning, BadgerDB will pre-fetch the value of the next entry from the disk before the user accesses the next value through an iterator. Thus, it effectively utilizes the bandwidth of the SSD and improves scan performance.

TerarkDB

Get in TerarkDB

As shown in the above figure, in TerarkDB, it first needs to find the v-SST file corresponding to the key in the LSM tree. Then, it accesses the latest v-SST based on the dependency relationship and finds the corresponding key in the index of v-SST, and then accesses the value.

Titan and BlobDB

Get in Titan

Titan and BlobDB need to find vptr from the LSM tree when reading, and then accesses the corresponding blob file.

Comparison of Read Paths

Storage EngineBadgerDBTerarkDBTitan / BlobDB
Scan PerformancePoor, requires prefetchGoodGood
Point Read (key)Needs to access all levelsStops at the first encounterStops at the first encounter
Read ValueReads directly using offsetNeeds to access v-SST indexReads directly using offset

Alternative Implementation — Neon Page Server

Besides implementing a full key-value separation strategy in the storage engine, it is also possible to implement a specialized engine that has some capability of separating key and large values and targets one kind of workload.

Neon’s page server, the underlying storage engine of Neon’s serverless Postgres service, implements an LSM-like structure that separates Postgres pages and redo log entries. The storage engine stores each entry of Postgres redo logs and each Postgres page at a given timestamp as key-value pairs. The engine leverages the property that redo logs are generally small while pages are large 8-kilobyte values. Managing these two kinds of data separately can reduce write amplification compared with using a general-purpose LSM-tree storage engine.

Neon page server

As in the above figure, the WAL service streams logs to the page server. The page server stores WALs as in-memory layers. Then, these in-memory layers are flushed to the disk as delta/image layers. Redo logs are stored in delta layers, and 8K pages are stored in image layers. While the compaction process merges or repartitions the delta layers, the image layers can be left untouched and be processed or garbage-collected separately.

Summary

Key-value separation in LSM trees can reduce the write amplification of storage engines. However, at the same time, it also brings additional costs in other areas, such as space amplification, impact on foreground write performance, impact on compression ratio, more complex read path, and the need for scanning all levels in the LSM structure. There are tradeoffs in each of the key-value separation design, and developers will need to choose a suitable key-value separation engine based on their workloads.

Reference

This blog post was originally posted on 08/07/2021 in Simplified Chinese. This is translated by ChatGPT with extra content on RocksDB’s BlobDB and Neon’s page server.

Feel free to comment and share your thoughts on the corresponding GitHub Discussion for this blog post.

#LSM #Log-Structured #Storage Engine #Key-Value Separation