Papers and Blogposts

What Every Programmer Should Know About SSDs

(link)

Drives not Disks

don't think of SSDs as fast disks
no moving head -> random read 100 times faster
- 10 ms vs 100us

Parallelism

hard disks have one disk head -> well for sequential accesses
SSD has dozens or hundreds flash chips which can be accesses concurrently
- SSD stripe files across the chips at page granularity
  - schedule hundreds of random IO requests concurrently to keep all flash chips busy
  - use multithreading or asynchronous IO
- SSDs achieve almost the full bandwidth with random page reads

Writing

high bandwidth because multiple are flash chips accessed concurrently
writes 10 times slower than reads
- 100us vs 1ms
writes cahed in RAM
- need to call sync/flush
- doesn't do anything in datacenters because of USPs
write occupies a flash chip 10 times longer than read
- read after write has significant latency

Out-Of-Place Writes

SSD blocks consist of hundreds of pages
NAND flash pages can't be overritten
- append only
- page updates are written to new location, old page is invalidated
  - logical and physical page adresses decoupled
  - mapping table stored on ssd, Flash Translation Layer (FTL)

Garbage Collection

Because of out-of-place writes SSD will run out of free blocks
- space of old/invalidated pages must be reclaimed
- whole block gets erased, valid pages written back to block

Write Amplification and Overprovisioning

because of grabage collection, a logial (software) write can trigger multiple physical (flash) writes
- ratio is called write amplification
- if 3 logical writes require 4 physical writes, write amplification is 1.33
high write amplification decreases performance and lifetime
random writes worse than sequential
higher memory usage => higher write amplification
- most SSDs reserve some space to ease this

Summary

because of fast SSDs operating system I/O stack is often the performance bottleneck
tutorial on coding for SSDs
experiments

Why mmap is faster than system calls

(link)

quick background on OS, system calls, buffers, IO etc.
lists reasons why mmap() should be faster
- no page faults when reaccessing data
- no additional memory copy necessary when data is in mapped memory
did experiments comparing syscall and mmap
mmap is way faster
explains this behavior comparing CPU operations
mentions fater (AVX/SIMD) copy function used by mmap() as main reason
source

Are you sure you want to use MMAP in your database management system?

(link)

blog post from one of the authors
not really something new
mentions OS evolution has fallen behind hardware
admits OS controlled IO has some advantages
- no overhead when a page is mapped
mentions pointer swizzling as an fast alternatives
- done by LeanStore and Umbra
- difficult to implement
- supports only tree-like data structures
there should be a mmap-like interface with more control and better performance

re: Are You Sure You Want to Use MMAP in Your Database Management System?

(link)

answer from the RavenDB CEO, a database which uses mmap()
argues using mmap() saves a lot of time for the bigger tasks
points out the things the OS will take care of when using mmap()
- including tracking of dirty pages
doing everything manually is also a difficult task
using fio to benchmark a buffer pool is pretty irrelevant
a buffer pool also brings great overhead
- atomic reference counting can have extremely high costs
database data access is not random in practice
transaction savety is also a matter when using buffer pools
- RavenDB modifeies pages outside the mapped memory not because of it but MVCC
- a single writer like in LMDB is pretty common in embedded databases
I/O stalls are the biggest issue when using mmap()
- other asynchronous I/O is also occasionally blocking, io_uring is better
- it's possible to tell mmap() which memory is interesting, using madvise(WILL_NEED)
- overhead can be the same when allocation directly
RavenDB ofc does error handling
- validates data on first access
- when using read() there is also no guarantee the data is not from a cache
- when dealing with I/O errors, crash and restore is the only answer
page table contention was a OS bug which is now fixed
- single threaded page eviction happens rarely because pages rarely get dirty
- TLB rarely occours because we have much RAM and time spend on working with the data tops time used for I/O

More about mmap() file access

(link)

detailed information about mmap() and =S behavior
differentiates in minor an major page faults
- minor needs no disk access, data comes from page cache
- major creates disk IO operation
FADVISE_DONT_NEED can be used to tell OS which data is not needed anymore
- OS will decide on its own if it will evict the pages
- in case of memory pressure, kernel will start reclaiming memory
- to force page eviction, MADV_DONT_NEED is used

Research

Protocols

Meeting Protocols

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Papers and Blogposts

What Every Programmer Should Know About SSDs

Drives not Disks

Parallelism

Writing

Out-Of-Place Writes

Garbage Collection

Write Amplification and Overprovisioning

Summary

Why mmap is faster than system calls

Are you sure you want to use MMAP in your database management system?

re: Are You Sure You Want to Use MMAP in Your Database Management System?

More about mmap() file access

Research

Protocols

Clone this wiki locally