Persist AOF file by io_uring #750

Wenwen-Chen · 2024-07-05T07:23:15Z

Description
Persisting write commands into AOF file is a method of Valkey to ensure high reliability. When user turn on AOF and set the appendfsync always, the speed of writing data into disk is critical to Valkey. Due to the write operation is synchronous and Valkey server will not response to other requests of Valkey clients.
IO_Uring is a powerful asynchronous I/O API for Linux. This patch optimize Valkey's performance by replace traditional write interface by io_uring when persist AOF file to disk.
We tested the performance by Valkey-benchmark tool. The patch improves perfromance by 29.24%.
Baseline: 48,847.20 Qps -> Optimized: 63,130.57 Qps

Test Environment
OPERATING SYSTEM: Ubuntu
Kernel: 6.5.0
DISK: SATA SSD
PROCESSOR: Intel(R) Xeon(R) Gold 6152 CPU (total 88 Threads, 2 Sockets, 22 Cores per socket, 2 Threads per Core)
NUMA info of the processor
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,...,86
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,...,87
Base: #741
Server and Valkey-benchmark in same socket.

Server config
port 9876
bind 127.0.0.1
appendonly yes
appendfsync always
no-appendfsync-on-rewrite no
aof-use-rdb-preamble no
daemonize no
protected-mode no
databases 16
latency-monitor-threshold 1
repl-diskless-sync-delay 0
save
io-uring-enalbed yes

Test step

Start sever with taskset -c 12,14,16,18 src/valkey-server valkey.conf
Start benchmark using single thread: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q
Start benchmark using multiple threads: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4
For both single thread and multiple threads, I tested each case 3 times. The average performance are summaried as follow table:

Mode	Baseline	Optimized	Performance Improvement
Single Thread	48847.2	63130.57	29.24%
Multiple Threads	59992.36	72723.67	21.22%

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

codecov · 2024-07-05T07:35:08Z

Codecov Report

Attention: Patch coverage is 19.04762% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.34%. Comparing base (b728e41) to head (c87f7de).

Files	Patch %	Lines
src/io_uring.c	0.00%	11 Missing ⚠️
src/server.c	20.00%	4 Missing ⚠️
src/aof.c	60.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #750      +/-   ##
============================================
- Coverage     70.40%   70.34%   -0.06%     
============================================
  Files           112      113       +1     
  Lines         61467    61487      +20     
============================================
- Hits          43275    43253      -22     
- Misses        18192    18234      +42

Files	Coverage Δ
src/config.c	`78.69% <ø> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/aof.c	`79.97% <60.00%> (-0.17%)`	⬇️
src/server.c	`88.45% <20.00%> (-0.11%)`	⬇️
src/io_uring.c	`0.00% <0.00%> (ø)`

... and 10 files with indirect coverage changes

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

zuiderkwast

Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?

Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Wenwen-Chen · 2024-07-09T08:32:29Z

Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?

No，I am not working with @lipzhu，but I have following #599 for a long time.
In my opinion, AOF write and fsync can use the same io_uring instance in the manner of time-sharing multiplexing.

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?
Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Yes, the performance improvemet is brought by the less syscalls of io_uring.
I did an extra experiment for without io_uring scenario. In order to count the number of 'write' which are called in each aofWrite function, I add some logs in aofWrite . I tested valkey with the same test case. I found that each aofWrite only calls 'write' once . Therefore, I didn't replace write by writev.

lipzhu · 2024-07-10T00:33:37Z

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?

Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Echo @zuiderkwast , I am also curious why io_uring could help perf boost on such kind of case.
@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

Wenwen-Chen · 2024-07-10T05:28:56Z

@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC.

OK, I will do these tests ASAP.

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process?
Set some config item ? or change source code?

lipzhu · 2024-07-10T07:18:33Z

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?

Through config auto-aof-rewrite-min-size 64gb.

Wenwen-Chen · 2024-07-10T09:28:29Z

Through config auto-aof-rewrite-min-size 64gb.

Thank you very much.

I did some extra experiments

Persist AOF file by io_uring bring a bit CPU overhead when compared with traditional write system call
Why the performance improved when using io_uring? It has relationship with Rewrite feature. However, I don't know the root cause. The detail test result is shown as following.

1. Performance comparison
I compared the perfomance by enable/disable rewrite feature.
Test command : taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4
It shown that enable io_uring + enable rewrite got best performance.

Rewrite	Baseline (use write SYSCALL)	Optimized (use io_uring)	Performance Improvement
Disable	61722.51	60336.46	-2.25%
Enable	59576.85	72835.51	22.25%

2. CPU utilization comparison
perf stat -p 'pid of valkey-server' sleep 10

Disable Rewrite

Index	Baseline(use write SYSCALL)	Optimzed(use io_uring)	Optimzed/ Baseline -1
Cycles	21,496,813,799	22,242,600,805	3.47%
Instructions	21,470,082,059	24,364,695,157	13.48%
Insn Per Cycle	1	1.1	-17.29%
CPU utilized	0.653	0.683	4.59%

Enable Rewrite

Index	Baseline(use write SYSCALL)	Optimzed(use io_uring)	Optimzed/ Baseline -1
Cycles	24,055,924,761	27,149,142,327	12.86%
Instructions	23,769,267,973	30,960,818,308	30.26%
Insn Per Cycle	0.99	1.14	15.42%
CPU utilized	0.732	0.859	17.35%

zuiderkwast · 2024-07-10T11:40:45Z

With io_uring, the kernel can use kernel threads? Maybe that's why it's faster but uses more CPU?

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

With higher throughput, we can handle more traffic. It's OK to use more CPU for more traffic, but for the same traffic I hope we don't use very much more CPU.

Wenwen-Chen · 2024-07-11T07:17:11Z

With io_uring, the kernel can use kernel threads?

It is determined by IO traffic.
If Application's IO traffic is low, the kernel will not use kernel thread. Otherwise, the kernel will obtain io worker threads( kernel thread) from io_uring's worker pool to process IO.

Maybe that's why it's faster but uses more CPU?

Yes.
However, why enable io_uring + enable Rewrite get best performance? I don't konw the root cause.
I will deep dive in valkey's rewrite feature and resolve the problem.
I would be very grateful if someone could provide a method to resolve the problem.

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

For a fixed duration (10S).
They were tested by the command of perf stat -p 'pid of valkey-server' sleep 10

Wenwen-Chen · 2024-07-19T09:24:00Z

I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

@lipzhu
I have posted the test result. Do you have any comment on the result?
I am wondering why enable io_uring + enable rewrite get best performance.
According to my unstanding, rewrite feature forks a child process which stores the latest KV data into disk.
Is there anything special about rewrite feature which has strong correlation with io_uring?

lipzhu · 2024-07-31T01:31:24Z

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

I am wondering why enable io_uring + enable rewrite get best performance.

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

Wenwen-Chen · 2024-08-09T08:17:20Z

@lipzhu

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

Function flushAppendOnlyFile() persists aof_buf to disk by using write system call or aofWriteByIOUring (wrapped io_uring interface in this patch). I analyzed the execution time of flushAppendOnlyFile under 3 scenarios by using uftrace tool https://github.com/namhyung/uftrace

Optimized: enable io_uring + enable rewrite,
Baseline 1: disable io_uring + enable rewrite,
Baseline 2: enable io_uring + disable rewrite.

For optimized scenario, it get best performance, due to it spend shortest time on flushAppendOnlyFile() .
The execution time of flushAppendOnlyFile() mainly come from fdatasync(). The optimized scenario reduces 50.3%/44.5% execution time on fdatasync() when compared with scenario Baseline 1 and Baseline2.

Type	io-uring-enalbed	Rewite enable	Performance (Qps)	Time of flushAppendOnlyFile（s）	Time of fdatasync (s)
Optimized	Yes	Yes	38,963.27	15.113	10.775
Baseline 1	No	Yes	33,057.41	22.759	21.66
Baseline 2	Yes	No	32,701.11	25.82	19.43
		Optimized vs Baseline 1	17.9%	-33.6%	-50.3%
		Optimized vs Baseline 2	19.1%	-41.5%	-44.5%

Test steps

Start Server: uftrace record -F flushAppendOnlyFile -F fdatasync src/valkey-server valkey.conf
Start valkey-benchmark: taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 2500000 -q --threads 4
Analysis execution time: uftrace graph flushAppendOnlyFile.

Wenwen-Chen · 2024-08-20T02:36:20Z

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

io_uring doesn't bring perfromance improvment on ‘disable Rewrite’ scenario when compared with write SYSCALL ( io_uring: 60336.46 vs write: 61722.51, #750 (comment)).
However, we should focus on 'enable Rewrite' scenario instead of 'disable Rewrite'. User usually enable rewrite feature in the production environment when he/she turns on AOF!
io_uring improves perfromance significantly on 'enable Rewrite' scenario(io_uring: 72835.51 vs write: 59576.85, #750 (comment)). According to #750 (comment) , Valkey get best performance due to io_uring reduces the execution time of flushAppendOnlyFile.

egbaydarov · 2024-10-15T13:13:44Z

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

https://github.com/valkey-io/valkey/blob/e30ae762a8ec7f531005fab90edd275dfa98f72f/valkey.conf#L2374C1-L2388C25

# server-cpulist 0-7:2
# bio-cpulist 1,3
# aof-rewrite-cpulist 8-11
# bgsave-cpulist 1,10-11

xbasel · 2024-12-17T13:11:53Z

@Wenwen-Chen do you plan to work on this?

Wenwen-Chen · 2024-12-25T13:42:16Z

@Wenwen-Chen do you plan to work on this?

Hi, @xbasel I really want to promote this patch, However, I am not expert of Valkey. I have not found the root reason why io_uring enbale + rewrite enable reduces the time of fdatasync. Do you have any suggestion?

Wenwen-Chen · 2024-12-26T01:30:21Z

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

Hi @egbaydarov, thank you very much for your suggestion.
It got the same boost with correctly configured affinity.

Persist AOF file by io_uring

6ec4e08

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

fix clang-format

6c13708

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

Wenwen-Chen force-pushed the aof_io_uring branch from f95c131 to 6c13708 Compare July 5, 2024 07:59

fix minor compiling warning and spelling error

0c6ed8c

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

zuiderkwast reviewed Jul 9, 2024

View reviewed changes

Wenwen-Chen force-pushed the aof_io_uring branch 3 times, most recently from c26c4f2 to 0c6ed8c Compare August 1, 2024 08:30

Wenwen-Chen added 2 commits August 1, 2024 16:34

Reap io_uring completion one by one

8cc8454

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

Sync to upstream

c87f7de

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist AOF file by io_uring #750

Persist AOF file by io_uring #750

Wenwen-Chen commented Jul 5, 2024 •

edited

Loading

codecov bot commented Jul 5, 2024 •

edited

Loading

zuiderkwast left a comment

Wenwen-Chen commented Jul 9, 2024 •

edited

Loading

lipzhu commented Jul 10, 2024

Wenwen-Chen commented Jul 10, 2024

lipzhu commented Jul 10, 2024

Wenwen-Chen commented Jul 10, 2024 •

edited

Loading

zuiderkwast commented Jul 10, 2024

Wenwen-Chen commented Jul 11, 2024 •

edited

Loading

Wenwen-Chen commented Jul 19, 2024

lipzhu commented Jul 31, 2024

Wenwen-Chen commented Aug 9, 2024 •

edited

Loading

Wenwen-Chen commented Aug 20, 2024

egbaydarov commented Oct 15, 2024 •

edited

Loading

xbasel commented Dec 17, 2024

Wenwen-Chen commented Dec 25, 2024

Wenwen-Chen commented Dec 26, 2024

Persist AOF file by io_uring #750

Are you sure you want to change the base?

Persist AOF file by io_uring #750

Conversation

Wenwen-Chen commented Jul 5, 2024 • edited Loading

codecov bot commented Jul 5, 2024 • edited Loading

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

Wenwen-Chen commented Jul 9, 2024 • edited Loading

lipzhu commented Jul 10, 2024

Wenwen-Chen commented Jul 10, 2024

lipzhu commented Jul 10, 2024

Wenwen-Chen commented Jul 10, 2024 • edited Loading

zuiderkwast commented Jul 10, 2024

Wenwen-Chen commented Jul 11, 2024 • edited Loading

Wenwen-Chen commented Jul 19, 2024

lipzhu commented Jul 31, 2024

Wenwen-Chen commented Aug 9, 2024 • edited Loading

Wenwen-Chen commented Aug 20, 2024

egbaydarov commented Oct 15, 2024 • edited Loading

xbasel commented Dec 17, 2024

Wenwen-Chen commented Dec 25, 2024

Wenwen-Chen commented Dec 26, 2024

Wenwen-Chen commented Jul 5, 2024 •

edited

Loading

codecov bot commented Jul 5, 2024 •

edited

Loading

Wenwen-Chen commented Jul 9, 2024 •

edited

Loading

Wenwen-Chen commented Jul 10, 2024 •

edited

Loading

Wenwen-Chen commented Jul 11, 2024 •

edited

Loading

Wenwen-Chen commented Aug 9, 2024 •

edited

Loading

egbaydarov commented Oct 15, 2024 •

edited

Loading