RocksDB notes

Introduction

RocksDB is an LSM-tree-based key-value store written in C++17 (as of 2022). This pages contains various notes on RocksDB as it is not a trivial product once we go a little more in-depth. We cover some installation issues, RocksDB options, RocksDB plugin functionalities, RocksDB file layout, and how to work with db_bench.

Installation

If you intend to develop or really use RocksDB please do not use your package manager. You lose control over the RocksDB configurations, what version you use and can not alter the code. Instead clone RocksDB and checkout to a version. For example:

git clone https://github.com/facebook/rocksdb/
git checkout v<version_nr>

RocksDB has many built dependencies, look in INSTALL.md at dependencies and be sure to install them all. RocksDB can be installed both with just UNIX Makefiles and CMake. Plugins behave differently between CMake and Makefiles. If you intent to use ZenFS you need to use the UNIX Makefile (something with restrictive plugins and pc files). The explanation in the Stosys guide also follows the Make installation. Therefore, try to use the Makefile if possible.

Install RocksDB with CMake

Then you can built RocksDB applications with:

mkdir -p build
cd build
cmake ..
make -j <application_name>

Note that for performance measurements with e.g. db_bench, you need to built in release. In this case remove CMakeCache first to ensure no dirty state.

rm CMakeCache.txt
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j db_bench

Also to get a release mode static library use make static_lib. This has been true up till 2022 June at least. Check install.md in cloned version of RocksDB to be sure.

Some things that can go wrong are:

If you get errors to undefined reference to google:FlagRegisterer::Flag..., something probably went wrong with your gflags installation (you should already have a gflags installed as otherwise RocksDB would not compile). A good guide to solve this issue is https://github.com/gflags/gflags/issues/203 the answer by EricOops is a life saver. EricOops recommends building and installing gflags and glog manually. However, do not forget to first purge the current gflags installations just to be sure.

Install RocksDB with UNIX Makefiles

The Stosys handbook (at https://animeshtrivedi.github.io/course-stosys/) should be sufficient. It is mostly analogue to the make installation. You set all required flags in front of the make command. Then you do <FLAGS> make <target>. To install RocksDB in release and built db_bench in release use: DEBUG_LEVEL=0 make install db_bench.

Important built flags for RocksDB

You can also set extra flags, some interesting ones are (ccmake makes it easy to set such flags, but you can also use -D):

USE_RTTI=1, see https://en.wikipedia.org/wiki/Run-time_type_information. It is enabled by default in debug and disabled in release. When working on a plugin, this might be interesting to temporarily set to true.
PREFIX= (does not work with cmake installation!, alter CMAKE_INSTALL_LIBDIR instead), when installing rocksdb with make install this determines the path.
DISABLE_JEMALLOC=1, JEMALLOC is a very nice and fast malloc implementation, but makes life hard sometimes with development. It is better to turn it off during initial development (but by all means set to 0 to get higher performance in some cases).
DISABLE_WARNING_AS_ERROR=1, -WALL and a few others

Built file system plugin

File system plugins can be added by moving/copying the directory of the plugin to rocksdb/plugin. The name of the directory should be equal to the plugin name. The plugin can either have a UNIX Makefile or a Cmake file. The name of this file should again be equal to the plugin name. In the "built" file a few variables can be set. In this page we will only cover Makefiles. CMake plugins are more restrictive, they can not specify install directories, run scripts and some other quirks. The variables in the Makefile should all be prefixed by the plugin name:

plugin_HEADERS-y: the headers of the plugin to include
plugin_SOURCES-y: the source files of the plugin to include
plugin_CXXFLAGS: extra built flags needed (e.g. -u)
plugin_LDFLAGS: extra linking flags ZenFS has a great example of a proper makefile for plugins. After copying, you should rebuilt RocksDB! Do this with Makefiles when using Makefile plugins:

<FLAGS> ROCKSDB_PLUGINS="plugin_name" make <target>

Some common issues that can occur are:

When linking against RocksDB (for e.g. file system plugins), it might not be in the expected location. In that case a symlink can suffice. For example: sudo ln /home/user/local/lib/librocksdb.a /usr/local/lib/librocksdb.a.

Built ZenFS file system plugin

To install ZenFS do:

Clone LibZBD (https://github.com/westerndigitalcorporation/libzbd).
Checkout to stable LibZBD version.
Follow LibZBD installation instructions.
Clone ZenFS (https://github.com/westerndigitalcorporation/zenfs).
Checkout to a stable version.
Follow the installation readme of the version.

RocksDB file layout

RocksDB makes use of a file system to store its data. When altering or debugging RocksDB, it is invaluable to know what the files mean. When you develop a file plugin, you can also use this to mirror existing file systems. The following files are essential to "default" RocksDB (see rocksdb/file/filename.cc for true names):

"[0-9]+.log": log files used by the WAL. These are used to back changes in the LSM-tree memtable. This file only holds a few entries and a new one is created after each flush. It is the first I/O structure holding key-value pairs. This one should be the important one to debug to verify initial persistence issues. Try with different value sizes and different entries (bounded by memtable size). The number represents what WAL it is. Each new WAL gets a higher number.
"[0-9]+.sst": holds an SSTable. These files are write once, but should be randomly readable. A new SSTable is created after each flush and can be created after each compaction. The number represents the version number.
"MANIFEST-[0-9]+": Holds the Manifest, that is to say all index data of the LSM-tree. It knows the names of relevant SSTables, what files need to be deleted etc. The file is regularly appended (not a fixed size). Only one MANIFEST file is used at the same time. When a RocksDB database is created, this is one of the first files that is created. It is essential that this file is written without errors to guarantee persistence. However, it is generally not read again after startup. Meaning that all of your data can be read correctly (memtable, SSTables...), but the MANIFEST can be completely incorrect. This can give the false illusion that if your file system states that a test passes, the data is correct. Then we restarting the database it is all corrupt. When this happens, this file is the likely cause.
"CURRENT": pointer to the current manifest file. There is only one "CURRENT" and this file is very easy to debug. It should contain the name of the current manifest file in plain text. Always created on database creation.
"LOG": contains info to use. For example, when compactions or flushes occurred. This file is essential for debugging.
"OPTIONS-[0-9]+": RocksDB unique; contains the options used when the database was created. This is used for options that need to be known on database creation or startup. For example, the layout of data of SSTables.
"LOCK": used to guarantee that the database is not opened multiple times. When a database is opened, the "LOCK" file is created. Then, when another database tries to open the database it checks if there is a "LOCK" file, it errors. Prevents concurrency issues.
"IDENTITY": Used to identify the database. It is a unique string id.

Massaging RocksDB

A common way of forcing heap memory to do stuff is through "heap massaging", by setting certain configurations or workloads to bring the memory into a certain state. We want to be able to do similar things with RocksDB. We want to masssage it into a form that is easy to test (for security experts, we use a broader massaging definition here than heap massaging). For example, we want to force RocksDB to flush or compact. In this section, we discuss a few of such approaches.

How to test flushes

In order to test flushes, we must be able to both trigger flushes and be able to measure its effects. There are multiple ways to force flushes. One is to flush manually with:

ColumFamilyHandle cf;
FlushOptions opts;
db.Flush(opts, &cf);

This can be done in a simple test that writes "n" key-value pairs and then flushes. This allows us to test flushes in isolation. However, this always requires manual interaction. This might be great to see what flushes will do, but it tells us nothing about how flushes are scheduled or issued by the database itself. Therefore, we come with another simple approach. In this approach, we set all size variables really low and only write a key-value pairs. For example:

options.max_bytes_for_level_base = 4 * 1024; // 4KiB
options.write_buffer_size = 1024; \\ 1KiB
options.target_file_base = 1024; \\ 1KiB

This sets the constraints of flushes low. After writing just a few key-value pairs, we will see flushes occurring.

Flushes are issued by a background thread and can write multiple ".sst" files to storage. These should be observed.

RocksDB, RocksDB options and db_bench

RocksDB can be run and benchmarked in various configurations. The data for this section is at the moment a mirror from https://krien.github.io/notes/2022-07-17-benchmarking-rocksdb.html.

Interesting parameters for RocksDB

RocksDB allows setting various options. To look for all options look at include/rocksdb/options.h, include/rocksdb/advanced_options.h and include/rocksdb/convenience.h. There are also "extra" internal options defined in options/db_options.h and options/cf_options.h. These two should not be altered by hand, but provide more in-depth knowledge about RocksDB (in other words, alter at your own risk). We note down a few of the most interesting options.

Options for opening/creation:

Set these options to determine how to open the database. For example, when/how to create.

options.create_if_missing: creates db if missing. Needs to be true on database creation. Can be used for persistency tests. If this is set to false, the database can not be recreated and must exist.
options.error_if_exists: Set to true if creating a NEW database. This prevents reloading if the database already exists.

Options for paths/mounted devices:

Set these options to force using a file system.

options.wal_dir: specify directory where WALs will be stored. Can be used to force database to use a file system. For example pointing to a directory mounted with F2FS.
options.db_log_dir: specifies the directory where info logs will be written. Can be used to avoid interference of info logs with the rest of the system.

Options essential for testing I/O:

Please consider these options when testing storage.

options.write_buffer_size: size of the buffer that is used for the memtable. The bigger it is set, the more will be kept in RAM and the less flushes are needed (albeit larger flushes are done in the end). Set low to force a lot of I/O and making debugging your system easier.
options.target_file_size_base: the approximate size to use for files. This can be used to for example get SSTables to be approximately the size of an SSD zone.
options.target_file_size_multiplier: multiplier in file size for SSTables between levels.
options.max_bytes_for_level_multiplier: The difference in max bytes for each level. Can be used to tweak the step between levels.
options.compression: allows setting compression type. Set to kNoCompression in most cases, unless you want to test with compression on. There is compression support for Snappy, ZLib, BZip, LZ3, LZ4HCC, Xpres and ZSTD.
options.bottommost_compression: Set compression for last level. Disabled by default.
options.max_bytes_for_level_base: maximum size in bytes for a SSTable level from level 1 onwards
options.use_direct_io_for_flush_and_compaction: force flush and compaction to use direct I/O. No strange buffering when testing I/O.
options.use_direct_reads: please enable to force all reads (user and compaction) to go through DIRECT I/O. This is what you want for testing I/O.

Options for reliability/sync:

These options determine the reliability. Lower reliability, means more in memory and less to storage in general.

options.use_fsync: Force all I/O to fsync. This prevents excessive buffering in WAL, which means lower performance, higher reliability and more storage I/O!

Options for controlling flushes/compactions:

These options can be used to tweak background I/O.

options.level0_file_num_compaction_trigger: maximum number of Level0 SSTables before triggering compaction.
options.max_background_jobs: increment to allow for more flush/compaction threads. The default is 2, which is a bit low in 2022...
options.level0_slowdown_writes_trigger: slowdown writes when the number of L0 tables reaches this size. Prevents flooding L0.
options.level0_stop_writes_trigger: Stop all write I/O to level 0 to prevent flooding L0.

Options for WAL

These options can be used to control WAL logic.

options.track_and_verify_wals_in_manifest: sets more checks for WALs. For example, sane erroring if a WAL misses or has incorrect size. Can be used for debugging.
options.max_total_wal_size: force flush memtables after WAL exceeds this size (default is sum[write_buffer * max_write_buffer_number] * 4).
options.WAL_ttl_seconds and options.WAL_size_limit_MB: these can be used to control WAL deletion. By default archived WAL logs are deleted asap. Setting these allows only deleting after a combined size is reached or a certain time has passed (ttl has precedence over size limit).
options.manual_wal_flush: ONLY allow manual WAL flushes. Can be used to investigate certain WAL patterns.

Options for Manifest

These options can be used to control Manifest logic.

options.max_manifest_file_size: when the manifest reaches this size, a new one is created and the old one is deleted. Set this low to test the manifest file (default is 1G, which it will not reach unless you do large I/O).
manifest_preallocation_size: preallocated file size for the manifest. By default is 4MB, which can be either too little or too much depending on the test case.

Misc options

These options might be useful.

options.info_log_level: alter what is written to the info log (.LOG file).
options.max_open_files: allows increases amount of open files. This is useful if you want to test RocksDB with a lot of small files. This is not ideal behaviour for RocksDB, but sometimes we want to test the irrational :). For example, try setting target_file_size_base to 512 bytes and write 100GB, you will see that this is the option to increase.
options.enable_pipelined_write: allow both writing to WAL and memtable at the same time (pipeline). Ordinarly, all writes go through WAL and THEN through memtable. This solves this inefficiency.
options.unordered_write, two_write_queues, allow_2pc: complicated option that allows for complicated persistency. It reorders some I/O and reduces consistency, but can increase speed.
options.allow_concurrent_memtable_write and options.enable_write_thread_adaptive_yield: concurrent writes to memtables.
options.num_levels: Allow using a different number of levels. This might be a good option to set if device is either very small or ZNS zones are very large.

Interesting parameters for db_bench

histogram: very important. Shows a histogram at the end of each benchmark, that represents latencies. This is a good tool to get tail latencies of the key-value store.
num: number of key-value pairs. This is the parameter to set when testing for large I/O (e.g. TBs of data)
value_size: sets the value_size in bytes for each key-value pair. Can be used to check how well the store performs with different types of data. Can also be interesting to check the relation between e.g. blocksize and kv-pair size.
key_size: sets the key_size in bytes for a key-value pair. This is interesting for similar reasons as the value_size, but also because keys are used in the metadata of e.g. SSTables. When the key size becomes very large, so does the metadata that comes along with it. This can be used to test if the key-value store support large keys and what quantity.
threads: dependent on the benchmark. Generally means the client threads. For example for write benchmarks, it refers to concurrent client writes.
use_existing_db: if set will use an already existing db. Some benchmarks require an existing db, others requires a new db. This parameter is good to set if multiple benchmarks will be run in series and they should continue on a dirty state.
compression_type: see options.compression. set it to none when you do not want compression.
write_buffer_size: see options.write_buffer_size.
use_direct_reads: see options.use_direct_reads. uses direct I/O for reads. A good parameter to set when you want to test storage.
use_direct_io_for_flush_and_compaction: see options.use_direct_io_for_flush_and_compaction. uses direct I/O for background operations. A good parameter to set when you want to test storage.
max_background_jobs: see options.max_background_jobs.
target_file_size_base: See options.target_file_size_base.
level0_file_num_compaction_trigger: See options.level0_file_num_compaction_trigger.
max_bytes_for_level_base: See options.max_bytes_for_level_base.
max_bytes_for_level_multiplier: See options.max_bytes_for_level_multiplier.
target_file_size_multiplier: See options.target_file_size_multiplier.
num_levels: see options.num_levels.
wal_dir: see options.wal_dir.
db_log_dir: see options.db_log_dir.

Interesting workloads for db_bench

db_bench comes with a lot of different benchmarks. It is possible to do multiple, by using "," in the argument. For example benchmarks="fillseq,fillrandom". Some benchmarks that I find interesting to test the underlying storage:

fillseq: fills key-value pairs in sequential order. Ideal to test a key-value store in alpha stage and to test random vs non-random I/O.
fillrandom: fills key-value pairs in random order. Some keys MIGHT occur multiple times, requiring overwrite functionality.
overwrite: keeps overwriting a set of the same keys. This is a good stress test. Especially to test garbage collection of the LSM-tree.
readwhilewriting: tests resource contention. It uses x readers that all try to read, while another process keeps writing. It can be used to test for excessive locking and scaling issues.
stats: returns the stats of the last run. In order to test big I/O, for example more than 1TB, and the intent is to test Garbage Collection, we recommend to check out the benchmark from ZenFS. This one writes 1TB of fillseq, followed by 1TB of fillrandom and then does 1 hour of readwhilewriting.
disable_auto_compactions: Set to true if you only want manual compactions. Ideal for full control.

Example benchmark

There are quite a few benchmark parameters to keep track of. Therefore, we present an example benchmark. This is a stress test for filling ZNS SSD. In this benchmark we will fill 1TB of randomly generated key-value pairs of size 1016 bytes each (1016 * 1_000_000_000 is more than enough for 1TB). We want to stress I/O so we disable compression and use direct I/O. Additionally, we set the file size to approximate the zone size of a ZNS SSD and set the write buffer size to a few GB to prevent excessive flushing. Max bytes for level multiplier is by default 10 but that is to large when testing for ordinary SSDs (we will not see high levels). We increase concurrency, by increasing the number of background jobs to 8 (1 for each level and 1 flush thread). Lastly, we fix the seed to make the job reproducible. This job leads to:

ZONE_CAP=<calculate zone cap here>
./db_bench --num=1000000000 --compression_type=none --value_size=1000 --key_size=16 --use_direct_io_for_flush_and_compaction \
--use_direct_reads --max_bytes_for_level_multiplier=4 --max_background_jobs=8                                                \
--target_file_size_base=$ZONE_CAP --write_buffer_size=$((1024*1024*1024*2))  --histogram                                     \
--benchmarks=fillrandom --seed=42

Benchmark with ZenFS

When benchmarking with ZenFS we need to do a few things. First pick an appropriate ZNS SSD and use its name. Pick the name in /dev/<name> as ZenFS automatically uses /dev/. Then do the following to setup ZenFS:

# In all lines <dev> is the devicename
echo deadline | sudo tee "/sys/block/$dev/queue/scheduler" # ZenFS requires deadline as a scheduler
rm -rf /tmp/zenfs-aux # ZenFS requires a temporary LOG file, but it is not allowed to already exist!
cd $ZENFS_DIR # ZENFS_DIR should be the utils directory of ZenFS
./zenfs mkfs -zbd=$dev -aux_path=/tmp/zenfs-aux

When this succeeds, you should see a message such as: “ZenFS file system created. Free space: 3968745 MB”. Otherwise, assume that it has failed.

Now benchmarks can be run on ZenFS. For some good examples go to https://github.com/westerndigitalcorporation/zenfs/tree/master/tests. In particular look at get_good_dbbench_params_for_zenfs.sh. What is immediately noticeable is that using ZenFS requires different dbbench commands. You should modify the fsuri to point to the ZenFS filesystem with the arg -fsuri=zenfs://dev:$dev with dev the device name. Then it should already work, but it is not optimal. In addition, we should set the target filesize to equal approximately the size of a zone. This size should then be used in the arg: –target_filesize_base. The write buffersize, set with –write_buffersize, should also approximate this size.

Once done remove the file system with (not mounted :)):

nvme zns reset-zone /dev/$dev -a

Benchmarking with F2FS

F2FS supports ZNS SSDs out of the box, provided a recent version of F2FS is used. However, it does require some additional setup and things to keep track of. The first idiosyncracy is that F2FS supports sequential zones for most of its data, except for at least a part of the metadata used. ZNS can support a few zones that can be written to randomly, but does not require to support them. Further on, such zones may not be enough to hold all metadata. Whenever the amount of randomly writable space is not enough, F2FS should warn you by default. For example, 100GB requires at least 4GB of random space and 7TB requires at least 16GB of random space. When the amount of space is not enough, we have to use an additional device as there is no other way. This does hinder benchmarks as F2FS “cheats” in this regard. To keep side-effects to a minimum, try to use a NVMe device with similar performance.

Install F2FS

When using F2FS with F2fS-tools in 2022 and using the default kernel, ZNS is not supported by default. In that case, F2FS needs to be built manually. In this case, we have to be careful. Do NOT use the version on github as it does not seem to be maintained, instead clone from git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git. Then it should be as simple as following the instructions from the repo. One thing to be aware of is if you have all dependencies properly installed and set. During the configuration phase (`/configure.sh``) you should see a list of capabilities with “yes” or “no” next to it. If “blkzoned.capacity” is no, you can create a ZNS file system (at least the command completes without errors or warnings), but you can not actually use it… In this case, be sure you have a modern kernel, the headers are installed and the kernel is built with all of the required configs. In my case I had to also update a few libraries, such as BPF.

Create the file system

To create the filesystem on ZNS, Nick Tehrany has an excellent paper on how to use F2FS on ZNS at https://arxiv.org/abs/2206.01547. First ensure that the ZNS device is actually empty! As at the moment (June 2022) F2FS makes no attempt to reset the device with for example:

nvme zns reset-zone /dev/$dev -a # replace dev with the ZNS device and do this for every namespace used by the filesystem.

After this it is sufficient to say:

mkfs.f2fs -f -m -c /dev/$devzns /dev/$devnvme # With devzns the seq-only ZNS namespace of a ZNS device and devnvme the randomly write-able namespace of a ZNS device (or an other ordinary device)

Then mount the filesystem at the preferred mount point, such as /mnt/f2fs with:

mount -t f2fs /dev/$devnvme /mnt/f2fs # with devnvme the device the randomly writeable area defined in the previous command.

Running db_bench with F2FS

Be sure to use the pointed mounted with F2FS only, by specifying the db and wal directory with setting –db and –waldir to be directories within the mounted filesystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly