forked from hyrise/hyrise
-
Notifications
You must be signed in to change notification settings - Fork 0
Papers and Blogposts
Tobias Jordan edited this page Nov 10, 2022
·
17 revisions
- don't think of SSDs as fast disks
- no moving head -> random read 100 times faster
- 10 ms vs 100us
- hard disks have one disk head -> well for sequential accesses
- SSD has dozens or hundreds flash chips which can be accesses concurrently
- SSD stripe files across the chips at page granularity
- schedule hundreds of random IO requests concurrently to keep all flash chips busy
- use multithreading or asynchronous IO
- SSDs achieve almost the full bandwidth with random page reads
- SSD stripe files across the chips at page granularity
- high bandwidth because multiple are flash chips accessed concurrently
- writes 10 times slower than reads
- 100us vs 1ms
- writes cahed in RAM
- need to call sync/flush
- doesn't do anything in datacenters because of USPs
- write occupies a flash chip 10 times longer than read
- read after write has significant latency
- SSD blocks consist of hundreds of pages
- NAND flash pages can't be overritten
- append only
- page updates are written to new location, old page is invalidated
- logical and physical page adresses decoupled
- mapping table stored on ssd, Flash Translation Layer (FTL)
- Because of out-of-place writes SSD will run out of free blocks
- space of old/invalidated pages must be reclaimed
- whole block gets erased, valid pages written back to block
- because of grabage collection, a logial (software) write can trigger multiple physical (flash) writes
- ratio is called write amplification
- if 3 logical writes require 4 physical writes, write amplification is 1.33
- high write amplification decreases performance and lifetime
- random writes worse than sequential
- higher memory usage => higher write amplification
- most SSDs reserve some space to ease this
- because of fast SSDs operating system I/O stack is often the performance bottleneck
- tutorial on coding for SSDs
- experiments
- quick background on OS, system calls, buffers, IO etc.
- lists reasons why mmap() should be faster
- no page faults when reaccessing data
- no additional memory copy necessary when data is in mapped memory
- did experiments comparing syscall and mmap
- mmap is way faster
- explains this behavior comparing CPU operations
- mentions fater (AVX/SIMD) copy function used by mmap() as main reason
- source
- blog post from one of the authors
- not really something new
- mentions OS evolution has fallen behind hardware
- admits OS controlled IO has some advantages
- no overhead when a page is mapped
- mentions pointer swizzling as an fast alternatives
- done by LeanStore and Umbra
- difficult to implement
- supports only tree-like data structures
- there should be a mmap-like interface with more control and better performance
- answer from the RavenDB CEO, a database which uses mmap()
- argues using mmap() saves a lot of time for the bigger tasks
- points out the things the OS will take care of when using mmap()
- including tracking of dirty pages
- doing everything manually is also a difficult task
- using fio to benchmark a buffer pool is pretty irrelevant
- a buffer pool also brings great overhead
- atomic reference counting can have extremely high costs
- database data access is not random in practice
- transaction savety is also a matter when using buffer pools
- RavenDB modifeies pages outside the mapped memory not because of it but MVCC
- a single writer like in LMDB is pretty common in embedded databases
- I/O stalls are the biggest issue when using mmap()
- other asynchronous I/O is also occasionally blocking, io_uring is better
- it's possible to tell mmap() which memory is interesting, using madvise(WILL_NEED)
- overhead can be the same when allocation directly
- RavenDB ofc does error handling
- validates data on first access
- when using read() there is also no guarantee the data is not from a cache
- when dealing with I/O errors, crash and restore is the only answer
- page table contention was a OS bug which is now fixed
- single threaded page eviction happens rarely because pages rarely get dirty
- TLB rarely occours because we have much RAM and time spend on working with the data tops time used for I/O
- detailed information about mmap() and =S behavior
- differentiates in minor an major page faults
- minor needs no disk access, data comes from page cache
- major creates disk IO operation
- FADVISE_DONT_NEED can be used to tell OS which data is not needed anymore
- OS will decide on its own if it will evict the pages
- in case of memory pressure, kernel will start reclaiming memory
- to force page eviction, MADV_DONT_NEED is used