Skip to content

Collection of storage device benchmarks to help decide what to use for your cluster

Notifications You must be signed in to change notification settings

TheJJ/ceph-diskbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 

Repository files navigation

Ceph Disk Benchmarks

In a reasonable Ceph setup, transactions on block devices for a Ceph OSD are likely the one bottleneck you'll have. When you place the OSD journal (block.wal) or the database (block.db) on a SSD, its performance and durability is particularly important.

"Normal" device benchmarks won't typically help you, as Ceph accesses the block devices differently than usual filesystems: it synchronizes every write and waits until the drive confirms the write operation. In particular this means that any write cache is always flushed directly. Other benchmarks usually do not consider this special drive access mode.

List of Devices

Sorted by IOPS - since they're relevant for Ceph. max (but max IOPS are too

  • IOPS: write-sync operations per second for one job
  • max IOPS: sum of parallel write-sync operations for multiple jobs
  • cache: write cache activation status (hdparm -W)

The more 1-job IOPS in sync mode can be done, the more transactions can be commited on a Bluestore OSD with its bstore_kv_sync thread, which is usually the bottleneck (it calls fdatasync on the block device).

Additionally to the metadata kv_sync, writes with pool data happen in other threads: The number of shards (osd_op_num_shards_ssd = 8 by default on SSD) determines the number of additional IO jobs needed. In the benchmark, the peak #jobs should therefore be at least 8.

In this list we ignore CPU/RAM/... of the host since we assume the storage device is much slower than the server. If you get different results, please open an issue so we can either update the benchmark or figure out what went wrong.

SSDs

ID Size Proto 1job IOPS max IOPS peak #jobs cache Notes
Intel SSD 750 PCIe 400GB NVMe 64235 192440 8 - asymptotic
Samsung MZQLW960HMJP-00003 960GB NVMe 34090 268030 16 - linear up to ~8 jobs, then asymptotic
Samsung PM1643a 960GB SAS 18545 93229 16 - asymptotic
Samsung PM863a 240GB SATA 17983 58876 10 off asymptotic
Samsung PM883 7.68TB SATA3.2 6G 12680 59338 16 off asymptotic; cache on: 5094 @ 1job, 27521 @16
Pliant LB206S MS04 200GB SAS 5028 5028 1 - more jobs slow down. 2: 2651, 6: 1088, 8: 745, 10: 784
Samsung 983DCT 960GB NVMe 4000 22570 8 - asymptotic
WD Blue WDS100T2B0A-00SM50 1TB SATA 1751 2225 2 off 2 jobs already saturate
Intel SSD S4510 480GB SATA 1600 48409 15 off linear until capped
Intel SSD 545 512GB SATA 1500 6460 8 - asymptotic
Samsung PM961 128GB NVMe 1480 1480 1 - more jobs slow down. 2: 818, 3: 1092, 4: 525, 5: 569
Transcend SSD 220s 1TB NVMe 1420 5760 8 - asymptotic
LENSE20512GMSP34MEAT2TA 512GB NVMe 1150 3164 4 - asymptotic
Samsung SSD 860 PRO 512GB SATA 1033 5915 15 - asymptotic
Sandisk Extreme Pro 960GB SATA 860 3400 8 - asymptotic
Sandisk Ultra II 960GB SATA 600 3640 8 - asymptotic
Samsung SSD 860 EVO 1TB SATA 490 1728 14 - asymptotic
Samsung SSD 970 PRO 512GB NVMe 456 840 2 - 2 jobs already saturate
Samsung MZVLB512HAJQ-000L7 512GB NVMe 384 1164 10 - asymptotic

Entries are sorted by 1-job IOPS.

Create a Benchmark

Please add your benchmark (or validate) existing entries, and submit your changes as pull request!

Device model and link

Get device model number (Device Model:), connection link version and speed (SATA Version is: etc):

smartctl -a /dev/device

Write cache

As the Ceph journal is written with fdatasync, each IO operation waits until the drive confirms that the data was written down permanently. Hence the write cache is written to, and then flushed. If we bypass it, we make the commits faster when we turn off the write cache:

# see current status
hdparm -W /dev/device

# disable the write cache
hdparm -W 0 /dev/device

fio Benchmark

Beware, the device will be written to and existing OSD data will be corrupted!

Each OSD with Bluestore has one bstore_kv_sync thread, which writes with pwritev and invokes fdatasync after each transaction. This is what we try to benchmark.

fio --filename /dev/device --numjobs=1 --direct=1 --fdatasync=1 --ioengine=pvsync --iodepth=1 --runtime=20 --time_based --rw=write --bs=4k --group_reporting --name=ceph-iops

In the output, after the ceph-iops summary was printed, look for write: IOPS=XXXXX.

  • Increase numjobs (e.g. by doubling the value) to find out the performance behavior for parallel transactions and figure out the upper IOPS limit.

Contributing

This list is intended to be expanded by YOU! Just run the test and submit a pull request!

Corrections and verifications of listed benchmarks would be very helpful, too!

Contact

If you want to reach out, join #sfttech:matrix.org on Matrix.

License

This information is released under CC0.

About

Collection of storage device benchmarks to help decide what to use for your cluster

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published