Write Stalls

RocksDB has extensive system to slow down writes when flush or compaction can't keep up with the incoming write rate. Without such a system, if users keep writing more than the hardware can handle, the database will:

Increase space amplification, which could lead to running out of disk space;
Increase read amplification, significantly degrading read performance.

The idea is to slow down incoming writes to the speed that the database can handle. However, sometimes the database can be too sensitive to a temporary write burst, or underestimate what the hardware can handle, so that you may get unexpected slowness or query timeouts.

To find out whether your DB is suffering from write stalls, you can look at:

LOG file, which will contain info log when write stalls are triggered;
Compaction stats found in LOG file.

Causes of Write Stalls

Stalls may be triggered for the following reasons:

Too many memtables. When the number of memtables waiting to flush is greater or equal to max_write_buffer_number, writes are fully stopped to wait for flush finishes. In addition, if max_write_buffer_number is greater than 3, and the number of memtables waiting for flush is greater than or equal to max_write_buffer_number - 1, writes are stalled. In these cases, you will get info logs in LOG file similar to:

Stopping writes because we have 5 immutable memtables (waiting for flush), max_write_buffer_number is set to 5

Stalling writes because we have 4 immutable memtables (waiting for flush), max_write_buffer_number is set to 5
Too many level-0 SST files. When the number of level-0 SST files reaches level0_slowdown_writes_trigger, writes are stalled. When the number of level-0 SST files reaches level0_stop_writes_trigger, writes are fully stopped to wait for level-0 to level-1 compaction reduce the number of level-0 files. In these cases, you will get info logs in LOG file similar to

Stalling writes because we have 4 level-0 files

Stopping writes because we have 20 level-0 files
Too many pending compaction bytes. When estimated bytes pending for compaction reaches soft_pending_compaction_bytes, writes are stalled. When estimated bytes pending for compaction reaches hard_pending_compaction_bytes, write are fully stopped to wait for compaction. In these cases, you will get info logs in LOG file similar to

Stalling writes because of estimated pending compaction bytes 500000000

Stopping writes because of estimated pending compaction bytes 1000000000

Whenever stall conditions are triggered, RocksDB will reduce write rate to delayed_write_rate, and could possibly reduce write rate to even lower than delayed_write_rate if estimated pending compaction bytes accumulates. One thing worth to note is that slowdown/stop triggers and pending compaction bytes limit are per-column family, and write stalls apply to the whole DB, which means if one column family triggers write stall, the whole DB will be stalled.

Non-blocking Writes

If a write slowdown/stop is triggered, application threads that do Put/Merge/Delete etc. will block. If a slowdown is in effect, each write will sleep for sometime (typically 1ms) before proceeding. If writes are stalled, the thread can be blocked indefinitely. If blocking the thread is not desirable, applications can avoid it by setting no_slowdown = true in WriteOptions. All writes with this option will be immediately returned with Status::Incomplete() if they could not be completed due to a slowdown/stall.

Internally, RocksDB tries to batch write requests from different threads together before writing to the WAL in order to increase performance. However, writes with no_slowdown set will not be batched with writes that don't have it, which might result in a slight performance hit.

Write Stall Mitigation

There are multiple options you can tune to mitigate write stalls. If you have some workload which can tolerate write stalls and some don't, you can set some writes to Low Priority Write to avoid stalling in those latency-critical writes.

If write stalls are triggered by pending flushes, you can try:

Increase max_background_jobs to have more flush threads.
Increase max_write_buffer_number to have smaller memtable to flush.

If write stalls are triggered by too many level-0 files or too many pending compaction bytes, compaction is not fast enough to catch up with writes. Note that anything reduce write amplification will reduce the bytes need to write by compactions, thus speeds up compaction. Options to try:

Increase max_background_jobs to have more compaction threads.
Increase write_buffer_size to have large memtable, to reduce write amplification.
Increase min_write_buffer_number_to_merge.

You can also set stop/slowdown triggers and pending compaction bytes limits to huge number to avoid hitting write stall. Also take a look at "What's the fastest way to load data into RocksDB?" in our FAQ if you are bulk loading data to RocksDB.

Write Buffer Manager Stalls

WriteBufferManager provides an option allow_stall that can be passed to WriteBufferManager constructor. If set true, it will enable stalling of all writers when memory usage exceeds buffer_size (soft limit). It will wait for flush to complete and memory usage to drop down. Applications can avoid it by setting no_slowdown = true in WriteOptions

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write Stalls

Causes of Write Stalls

Non-blocking Writes

Write Stall Mitigation

Write Buffer Manager Stalls

Clone this wiki locally