Meeting Notes

Mar 24th, 2023

Wiebe

Project: Implemented sampling support for my PMEM tracing tool (pmemtrace) in the kernel. On top of that, developed an easy-to-use CLI application that interacts with pmemtrace through added IOCTL calls, e.g. usage: sudo pmemtrace --freq 60 --device /dev/ndctl0 "head -c 1M <dev/urandom >/mnt/pmem_emul/randfile.txt" Had a quick discussion with Finn (ETH Zurich) about a problem with UFS, got it working. Next step is to replay traces in user space on bare metal DAX-mapped device, already compiled a (preliminary) list of metrics I want to investigate.

Thesis: Written introduction chapter (~80% done), still working on background chapter (~50% done).

Mar 10th, 2023

Niels

Present proposal for the thesis. Need to discuss some questions. Wednesday meeting with climate. some research on data formats. Not sure where to continue for the indexing stuff. The data is high dimensional, grid data, (coordination x, y + n dimension of time data). question: how to generalize to this domain. Is it possible to optimize that data structure? The data is stored next to metadata, metadata is the entry point. Not sure if it is needed to load all the kv index. They use chunks. There are some students working on file format from the system seminar. project idea: performance benchmarking of climate data formats. Download data from dutch academy. Use it for benchmarking. Do they provide the raw data or processed data?

Peter

Talked to Nvidia, and met with them. Limited early access technical preview. Clean up some code. Add email address. Thesis proposal early in the week. When the queue is full , the driver delays for 1ms and retries, but each request needs sub-ms time. Change the kernel, submit when available, 17s - 1s. The driver is fully interrupt-based, can’t do polling. virt-io fs driver for nvidia. Proposal by next week. Set 2-3 key features to investigate. such as work-load specific optimization. Climate data? (JSON processing on DPU)

File type + traffic, then can different between files on the DPU. Do prioritization in DPU. QoS in DPU. What API should be used to pass the hint, fs f-control? (look rocksDB) File-level differentiation service.

Weibe

Presentation, tracing PM in the kernel. Patch MMU trace, such as huge page. Trace the access from virtual address to physical address. 1MB file, seq, -> 15 sec? can track all R/W. Trace and optimization are two different processes. working but not optimal. worse-case solution: only trace small file writes. Do trace at small workload. Loop over small workloads. Check the page allocation policy. Next week, collect some traces, then do benchmarking without running the whole workload. Start a new google doc, and put all updates in it. MMU is hardware, and can not easily be traced. Use MMU to lock writes? Flush TLB address every time. Disable all cores, or there will be consistency issues. In-memory databases, MonoDB. 10-years ago. Find papers, PhD thesis in in-memory databases. A paper, how memory is allocated. Target end of April, May for optimizations, June for writing etc. how to collect micro-architecture properties?

Nick

msF2FS: better data grouping, avoid GC, no sudden performance drop. Some performance drop caused by checkpointing. More checkpointing because of segment summaries. Always buffered io. heatmap: msF2FS individual zones are used more time. If a file is deleted, continue using the zone. Reset the zones where there is no valid data in the zone and reset it when do checkpointing. With rocksDB performance, overwriting, when there is lots of GC, msF2FS is much better. It is possible to plot these things as a timeline.

Plan till next Thursday: finish writing and redo results for some f2fs experiments.

Mar 3th, 2023

Nick

Still running RocksDB benchmarks (hoping to finish today). Going to continue over the weekend. Node1 and Node6. Add one ZNS SSD to node3. Over the weekend will keep running. Trace metadata for msF2FS is most important (BPF). Do you already have the numbers? Small 10GiB (frequency of updates with 7 streams and 1 F2FS log). Have to do a normalisation (logic metadata per file operation), can not compare rate numbers probably. We can help with writing part.

Niels

Climate/AI modelling. We can use their traces for the timeseries database. They want to filter part of the data. They look at simple filters. You have traces. Write a small data generator. You have to write a benchmark for this, to evaluate Frogfish. Two options:

if you run out of space, run aggregation.
else on slower medium Should work on both flash and ZNS. We need to discuss how to setup the Msc for this. Add questions: Q0: What are domain specific optimisations? LLVM-Xtrace

Wiebe

Write-heavy workload for PMEM traces. Filebench. Wiebe looked into a heatmap, and the granularity of the request size. MMIO-trace is used to check the number of traces. Average request size is 4KiB. ctFS could not get working. Did step1 in his thesis, understanding traces without Varmail. Why run it on the filesystem? We need to look at MMU translation. Need to trace page-level caching as well? ctFS would be nicer than uFS. In 6 months Wiebe will tell us if it is all worth it.

Peter-Jan

Has done magic with DPUs. Multiqueue for virtio-fuse fs is messed up as it only supports one queue. As a result only one core and queue can be used on DPU.

Zebin FAST:

Andy Warfield keynote, we should all watch this and know him. Send AI and Ml to Sacheen. Shade for AI/climate. The latency paper is important. We should know all FS, all KV. Fisc is interesting for Peter-Jan. PM for wiebe. Remote memory for Animesh. We should know all I/O stack papers. SSD smarthphone papers potentially for GC.

Feb 24th, 2023

Zebin

sanity check CHEOPS paper.

Nicola

Bsc thesis proposal is close to done. Will start (and has already started) on monday on the implementation. Finish the proposal by then. For the proposal, do think about: 2 questions on design as well. Is a design trivial? How do we quantify the design? Add a question on RUST. In general Add to node, node4. Look into xnvmne. Bachelor sync thing will happen mid-March.

Niels

Wednesday discussed his proposal. In 2 weeks he will be presenting in group meeting. Yesterday started porting from common lisp (tell me how to do Lisp at some point). Will look into Append with passthrough. What are the options for append? Try to get it working on io_uring. Name discussion: Sinkhole. Com up with namer: OctopusDB, SquidDB. Sanitiy check needs to be done with SIGMOD. FEMU, switched to 2/3 zones (multiple streams is nice to have). Wrap up proposal by next week, and finish the broad research questions.

Wiebe

Spend 50/50 on proposal/practical, spend time on fixing splitfs, and ctfs. Both "work" now. Is Updating splitfs to new kernel, but it still crashes under load. He dan reproduce numbers (great progress). For next week, try to fix (not a prio), else move on. "Noah?Nomad?", on kernel FS does work. Process feedback from proposal before next meeting. Will get access to people with an account with root access. Big questions remain on how to do correct traces (at page basis). Capture the raw access, need to check microarchitectual with sanity checks.

Peter-Jan

Submitted draft to an "RDMA alliance". Multiqueue patch is still present (might be a bug in the firmware). However, it is not be fixed with deadline as it is next Thursday. Mentioned that reads are slow because 2 operations for every read (FUSE). This needs to be turned off. Fixed bs to 128K (from 4k), (DPU limit). Reads are buffered, writes are immediate. Test with block sizes soon. All experiments are finished. Finish graphs today. Nulldev bottlencek is queue/thread bound. Write paper backwards before Thursday, eval first, intro second. Why is write more expensive than read, check CPU util. Zebin will help Peter-Jan with breakdown. NFS lat is not consistent? Animesh said that for now just plot stddev. DPU is the bottleneck. In general, show the range where you are winning. Perf ramp by time (today). Test are 7 hours, so do try to parallelize.

Nick

benchmarking the file system week to finish thesis (it is about to happen). No longer runs FEMU, back to real ZNS because performnance was bad. Went with 512 format. Nice results on fragmentation. Try to wrap up with experiments by next week.

Feb 10th, 2023

Wiebe (master thesis)

Progress

Setup VM with custom kernel. Working environment on DRAM (UFS uses DRAM).
Research proposal setup.

Comments from Animesh

Keep independence from HTC.
More high level with CTFS
Check if CTFS can be done with newer kernel.
We do not know if UTFS is better than CTFS, investigate.
Next week to Optane and code.
What is the state of benchmarks? TLB misses and cache stuff, PMX.
Narrow down in next week/10 days.
Investigate micro-architectural things (page allocation how? Prefetching? caching? hugepages? TLB misses? CPU stalled?)
Goal: Literature study and investigation, make framework for applications.
Do not use SplitFS. Be ready very soon with toolchain. Run quickly on Optane. If does not match, then it is finding that emulation is incorrect.

Niels (master thesis)

Progress

proposal: introduction, context
look into benchmarks, hard (30 MiB/s)
- look into CPU util, etc.
- local TCP socket (loopback device)

TODO/comments

run fio for baseline
(Krijn) Add Niels to cluster
Try FEMU (read FAST), and try to use. Ask Nick.
Finish proposal
Add analysis for decision to start from scratch.
Run FIO after FEMU.
What are choices? What are benchmarks? What are implementations? Have a soft deadline every 2 weeks. every month a decision/progress.

Peter-Jan (Master thesis)

Progress and comments intermixed

Has mailed Linux kernel maintainers for VFIO driver patch. Very nice!
Goal: cat on a single core than it works.
Hot-plugging support needed. Needs patch for multiple cores. Made kernel-development environment.
Needs Ubuntu in QEMU, Niels and Wiebe can help.
___ will give update at end of week.
Crazy emulator stuff with two NUMA nodes for dissagregated memory. MQSim-emulator. Do not want dependency that breaks thesis.
CXL thesis.
Discussion later today.
If further in file system DPU, look into ARM cluster stuff.
Merge introduction with ___. Should freeze, need a story in 3 weeks.
Can not wait for FAST to publish. Mail second-author person of DPU paper ASAP (vu mail, CC Animesh).

Nicola (Bachelor thesis)

Project proposal, still struggling with it. Problems with the why, and which part should be the focus on the proposal.
Research direction needs to be finalised. Implementation is done. Fix after meeting, Animesh and Krijn will help. Key papers to bootstrap.
Final revision and then done:
Needs a VM on the cluster, but can also use desktop at home.
Needs to work on the environment.

Jan 13th, 2023

Wiebe

Progress: received feedback & grade survey on Monday. Sent Animesh the final (slightly) revised version on Thursday (only minor cleanup/clarification tweaks).

Nick

Progress: 70/80 pages of content added. 5 contributions. Target: ICT open abstract summary 19 Jan (2 page). Should contain: Introduction F2FS before, F2FS after. One visualisation of tool, graphs of results. Github links add them. Send message to Animesh for Alex. Systor master thesis submission. Share the draft. Add Nicks tool for RocksDB/FS.

Zebin

SPDK > io_uring performs better, but is limited at 1 million IOPS still with 8 devices. Tasks:

Get more organised, take notes on wiki. Try to keep track.
If by mid February there is no draft or key experiment, then move on to next project. By next week: perf and ftrace to measure latency/overhead of each subpart of storage stack. Will try a better classification, flamegraph is better. Some metrics: Interrupts, how many? How many syscalls? How many locks? Don't wait, work ahead. Related work as quickly as possible. Look at FAST proceedings when they come out. Do not forget to ask for help in 2/3 weeks.

TODO before next Friday;

Find out why the maximum throughput can't reach 4500k IOPS
The current throughput of SPDK is only 1M IOPS(Which can be 10M IOPS), find out why
Revise introduction, finish background, related work and part of motivation.
Latency and work breakdown, where dose the Linux storage stack spend time and works in
Some guess: interrupts, systemcalls, lock(Where and how much), measure them

Krijn

Selective slice of TropoDB 5 pages ready in February. Help Nick with RocksDB and experiments. Helping with teaching in February. Look at canvas/next week preparation for the course. Priorities in February:

course
TropoDB (add more related work).
Systor (help)

March is CSUR.

Dec 9th, 2022

Wiebe

Niels

Zebin

Nick

Discussion points/progress:

There is one lock for inodes, and one locks for streams (only 1 file at a stream at a time).
Discussed and implemented multiple schedulers: round-robin vs stream (affinity)? Could have also done stride round-robin (4k on one stream, then the next).
The current results if the scheduler is: 3 files, 3 streams leads to 3x perf and can be benched.

Tasks:

Make a discussion on the API of F2FS and the configuration used. Don't make policies in the kernel, pass everything to user-space instead.
How to translate in application? use RocksDB
How will fcntl look like? bitmap, flag. <- How do you pass all this information from kernel? RQ
Include your survey, but don't go to 200 pages.
Systor for the thesis next year, CSur for the survey.

Peter-Jan

Discussion points/progress:

1.5 week on NVidea bugs, because a part of an NVidea is driver NOT thread-safe.
virtio also has queue polling.
Uses 1 thread for nfs requests and 1 thread for polls on socket
Now only 1 thread works on virtio buffer. one thread that does nfs polling. xlio, no memcpy. Went 100 down in micros to 80 micros. 120 -> 80 micros. Seq QD=16 with write 180 mb/s read is 20 mb/s.

Krijn

Progress:

NVMe/ZNS benchmark report is closed to finished

Tasks:

SIGMETRICS for ZNS
CSur for KV-store paper
Arxiv for TropoDB
Help Nick with Systor paper (next year)
TA next year in february/march

Nov 25th, 2022

Wiebe

covered many paper, missing section on atomicity and failure (some papers missing here)
todo: write the conclusion, mostly done
next week tue/wed - first complete draft.
presentation in 2 weeks time: rough sketch of the presentation

Niels

completed the paper reading that need to be included (stream processing need more work)
next mon/tue - first complete draft
next fri - presentation draft

Following week wednesday 7th is the final presentation.

Nick

single file parallelism does not work. MDTS is ~1MB, mq-deadline merges consequtive block on a single stream with be MDTS request, but with 2 stream it drop by 1/4th. The 1/4th MDTS remains for any further streams parallelism.
pinning a file to a stream work, fcntl, fadvise
Going back to RocksDB- how do files from rocksdb now map to this new nice interface => (goal): microbenchmark of fio is ok, rocksdb needs to be targeted back so that we move out from the file system - level only configuration.

Krijn

benchmarked the new and old devices, [google drive link sent]

Zebin

did instruction splitting
2x optane working, 1+ million iops

23 Sep, 2022

Wiebe

Use page table to accelerate metadata lookup for small tables.

Filesystem metadata can be stored in NVM storage, then metadata lookup does not need to access the disk.

Niels

Benchmarking time-series databases.

Time-series database benchmarking: SciTS: A Benchmark for Time-Series Databases in Scientific Experiments and Industrial Internet of Things

Google's in memory time-series database: Monarch: Google’s Planet-Scale In-Memory Time Series Database

Nick

F2FS on ZNS

Track the location of files on ZNS device. (Custom printk to log file locations etc.)

File classifications:

Static: Initial write (Warm)
Dynamic: After GC (if not updated, cold)
Use hints(fcontorl)

Q: What if a bad classification is chosen?

With RocksDB:

What hints are used, at what level the hints are passed?
Classification of DB files.
Is other info passed down, such as flags?

Q: What is the classification before BG GC and after BG GC?

Q: What does the layout look like? where do the different files end up?

Q: Overall goal to answer: When does the passing of better hints at file creation time benefit the performance

Q: What other application can be used to test the files ystem's performance?

Oct 7

Niels

Influx DB: https://cs.ulb.ac.be/public/_media/teaching/influxdb_2017.pdf. Store arrays in rows.

ClickHouse DB has better performance.

Facebook 2015: Hbase

Use B+ tree? Where to build the tree?

Deploy the databases mentioned above and run benchmarks.

Weibe

(TODO: Find the paper) Use tree structure to find a file.

User-space file systems are easier to implement and debug. So try

ufs exposes a set of calls to user space, and a large par of it is implemented in user space, such as page tables. A possible solution is to expose kernel API to user space.

Implement fs in user space as much as possible?

TODO: Make a slide in two weeks, one or two page.

Nick

File allocation for RocksDB: Check wiki page for detailed description.

There are 6 lifetime hints, only 3 are used: short, medium, extreme for hot, warm and cold.

SST structures ends in different segments(cold or warm segment). Check what and where happens(In rocks DB or f2fs?).

Hints:

Hotness classfication
Lifecycle

Krijn

Test SPDK performance on ZNS, use callback for trace.

Use BPF: Check the wiki doc for more details.

Oct 14

Niels

The paper ingest, 30MB/sec. What cause the performance issues. (Ask for the paper).

sw/hw problem, use a in-memory DB. Is it a hardware or software problem. What's the bottleneck.

rerun the benchmark, couple of GB is to expect.

(TODO for Z: read the paper)

Wiebe

Paper: (ask later)

Page table? Most papers are software based.

Hardware optimization:

Pay attention to general class of data structres, caching, prefechting

Most use b+ trees. (https://tolia.org/files/pubs/fast2011.pdf)

B tree for metadata lookup.

Plan: Extend the list of papers.

for the next week meeting 1pm present an overview

me

SPDK and ioruing. io co-processer. IO compeletion on a single core.

Krijn

Benchmark spdk and iouring.

Compare ZNS and normal nvme device.

(250k in depth 4)

OPT1: One job, qdepth 1-8

the ZNS device has a normal 4gb part, do benchmarking on that part?

libZBC: want to show spdk deliver better perfromance

Problem: Does it deliever better latency?

better latency for single read, write, append?

Latency of each operation qd = 1, jobsize - 1 4kb.

Increase #jobs with qd = 1.

Single job mutiple qd, MQ deadline, use append(no spdk)

(SPDK can not do multiple writes buffering?)

Expe1: 1 reader loc 0 and 1 reader loc 1/2 exp2: 1 job, qd = 2, control where they are reading from.

(SPDK is a single process library)

libzbc/libzbd ->

fio use iouring for zns interaction -> write to wiki

Nick

if a single open for each classification. W multilple files ->

two zone w performace

how many concurrents.

how to assign how many zones open at the same time.

design the mechanism rather than the policy.

TODO: reconsile the performance single and two zone writes.

Oct 21

weibe

Fast fs metadata lookup(check the literature study)

niels

paper: Storing ?? data

timescale DB and kafka

Oct 28

Nick

Checkpoints failed sometimes, check why.

When checking point, IO ops before much be finished.

Krijn

SPDK and iouring

seq zone and ramdon zones. 4k lba, db not scale. (try to reduce padding?)

assume append with improve performance. redo exercise on new devices.

try to write intro; why spdk, why zfs, why ?

backgru . zns, zfs, what diff, one page. incl fig etc

Weibe

spdk:

CTFS:

Following: read more papers. Ten next week. usenix, sigmod etc

write a list of conferences

key papers in detail.

nick

btr DB: special tree structure ot index time-series data

4 nodes, 500mb, hard disk,

Nov 7

Weibi

Zebin

fio instruction counter

Nick

6 types. (hot cold warm/ data )

Multiple for each part, more than one streams(zones) for each zone

round robin first(scheduling)

fs virtulazation with DPU(PCIE)

FUSE on DPU

now: NFS on DPU

NFS maps easliy on fuse
cloud, use NFS a lot

host to guest VM

krijn

ZNS device, KV store in zns

triple db, use space with spdk

Larger block, higher through output.

disable compaction?

Nov 11

Niels

draft for intor, background.

Find more papers on stream processing.

Nick

two-- more hot data setcion, round robin by block, implemented.

round robin with number, no good results.

512 qd 4k, 4g file

weibe

spearation of the fs design,

hashtable logging strucuters, kernel base

some papers about hashing structures

next week: finsh deisgn section, more papers, about metadata, atomic

how to ensure mmap in kernel

paper about: failure atomic

hotstorage 2022 bye block and hello byte? (Check the exact name later)

Peter

Implement

latency test: host -> dpu 187 us dpu-nfs 133 us

git rep: micro-arch benchmark

search for pci latency

Design Guidelines for High Performance RDMA Systems ATC16

Krijn

Test lower frame work, how many write are happening

something are wrong

without buffering,

may be inefficient buffering

set up of bpf and spdk

Zebin

realated work

Nov 18

Become I came

Wiebe and Niels presentation before Christmas at 7th December 10:00-12:00

Animesh TODO: reserve and announce for the slot during the meeting Wiebe deadline first draft to 29th of November, references and story going good Niels at 65%~ progress, Nicola BSc thesis comparing B+tree on FS vs direct to device Stuck on the writing, hard to give Animesh proposed idea of putting virtio-fs over the wire and implement the protocol on a server and do the mapping to a FS on that server. Peter-Jan is sceptical in terms of development time and argumentation. Will further work out and discuss with Jonas. Krein and Nick installed various NVMe and ZNS devices in the datacenter, works fine. Nick has resolved his performance problem when increasing the number of streams. The bottleneck was in the IO manager that was contended for by all streams. There are now multiple IO managers, removing the need for striping too. The flushing of inodes is still problematic for performance, but not for correctness. Animesh wants to know if the single file performance over 9 streams is the same as 9 files on 9 streams. For next week. Nick will build a benchmark for the worst case performance hit. For in two weeks.

Nick

Different file for different steams.

Problem: Have to flush all the inode when flushing a file.

Implement a bitmap for the inodes and see if it worth it.

Zebin

Goal

perf tutorial
split of work among the storage stack, hopefully
storage diagram

Krijn

Each flush from 7 sec to 5, slow. Because of memory merge?

Profile it, see what happens. Where the time are spent? The serialization is inefficient.

The problems is compression? Where is the bottleneck, CPU or IO?

Flush is a memory barrier, no reorder. Is the flush prevent the device from doing io parallels. Is flush force the device to flush all its cache?