Skip to content

Papers and Blogposts

Tobias Jordan edited this page Nov 10, 2022 · 17 revisions

What Every Programmer Should Know About SSDs

(link)

Drives not Disks

  • don't think of SSDs as fast disks
  • no moving head -> random read 100 times faster
    • 10 ms vs 100us

Parallelism

  • hard disks have one disk head -> well for sequential accesses
  • SSD has dozens or hundreds flash chips which can be accesses concurrently
    • SSD stripe files across the chips at page granularity
      • schedule hundreds of random IO requests concurrently to keep all flash chips busy
      • use multithreading or asynchronous IO
    • SSDs achieve almost the full bandwidth with random page reads

Writing

  • high bandwidth because multiple are flash chips accessed concurrently
  • writes 10 times slower than reads
    • 100us vs 1ms
  • writes cahed in RAM
    • need to call sync/flush
    • doesn't do anything in datacenters because of USPs
  • write occupies a flash chip 10 times longer than read
    • read after write has significant latency

Out-Of-Place Writes

  • SSD blocks consist of hundreds of pages
  • NAND flash pages can't be overritten
    • append only
    • page updates are written to new location, old page is invalidated
      • logical and physical page adresses decoupled
      • mapping table stored on ssd, Flash Translation Layer (FTL)

Garbage Collection

  • Because of out-of-place writes SSD will run out of free blocks
    • space of old/invalidated pages must be reclaimed
    • whole block gets erased, valid pages written back to block

Write Amplification and Overprovisioning

  • because of grabage collection, a logial (software) write can trigger multiple physical (flash) writes
    • ratio is called write amplification
    • if 3 logical writes require 4 physical writes, write amplification is 1.33
  • high write amplification decreases performance and lifetime
  • random writes worse than sequential
  • higher memory usage => higher write amplification
    • most SSDs reserve some space to ease this

Summary

Why mmap is faster than system calls

(link)

  • quick background on OS, system calls, buffers, IO etc.
  • lists reasons why mmap() should be faster
    • no page faults when reaccessing data
    • no additional memory copy necessary when data is in mapped memory
  • did experiments comparing syscall and mmap
  • mmap is way faster
  • explains this behavior comparing CPU operations
  • mentions fater (AVX/SIMD) copy function used by mmap() as main reason
  • source

Are you sure you want to use MMAP in your database management system?

(link)

  • blog post from one of the authors
  • not really something new
  • mentions OS evolution has fallen behind hardware
  • admits OS controlled IO has some advantages
    • no overhead when a page is mapped
  • mentions pointer swizzling as an fast alternatives
    • done by LeanStore and Umbra
    • difficult to implement
    • supports only tree-like data structures
  • there should be a mmap-like interface with more control and better performance

re: Are You Sure You Want to Use MMAP in Your Database Management System?

(link)

  • answer from the RavenDB CEO, a database which uses mmap()
  • argues using mmap() saves a lot of time for the bigger tasks
  • points out the things the OS will take care of when using mmap()
    • including tracking of dirty pages
  • doing everything manually is also a difficult task
  • using fio to benchmark a buffer pool is pretty irrelevant
  • a buffer pool also brings great overhead
    • atomic reference counting can have extremely high costs
  • database data access is not random in practice
  • transaction savety is also a matter when using buffer pools
    • RavenDB modifeies pages outside the mapped memory not because of it but MVCC
    • a single writer like in LMDB is pretty common in embedded databases
  • I/O stalls are the biggest issue when using mmap()
    • other asynchronous I/O is also occasionally blocking, io_uring is better
    • it's possible to tell mmap() which memory is interesting, using madvise(WILL_NEED)
    • overhead can be the same when allocation directly
  • RavenDB ofc does error handling
    • validates data on first access
    • when using read() there is also no guarantee the data is not from a cache
    • when dealing with I/O errors, crash and restore is the only answer
  • page table contention was a OS bug which is now fixed
    • single threaded page eviction happens rarely because pages rarely get dirty
    • TLB rarely occours because we have much RAM and time spend on working with the data tops time used for I/O

More about mmap() file access

(link)

  • detailed information about mmap() and =S behavior
  • differentiates in minor an major page faults
    • minor needs no disk access, data comes from page cache
    • major creates disk IO operation
  • FADVISE_DONT_NEED can be used to tell OS which data is not needed anymore
    • OS will decide on its own if it will evict the pages
    • in case of memory pressure, kernel will start reclaiming memory
    • to force page eviction, MADV_DONT_NEED is used