Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist AOF file by io_uring #750

Open
wants to merge 5 commits into
base: unstable
Choose a base branch
from

Conversation

Wenwen-Chen
Copy link
Contributor

@Wenwen-Chen Wenwen-Chen commented Jul 5, 2024

Description
Persisting write commands into AOF file is a method of Valkey to ensure high reliability. When user turn on AOF and set the appendfsync always, the speed of writing data into disk is critical to Valkey. Due to the write operation is synchronous and Valkey server will not response to other requests of Valkey clients.
IO_Uring is a powerful asynchronous I/O API for Linux. This patch optimize Valkey's performance by replace traditional write interface by io_uring when persist AOF file to disk.
We tested the performance by Valkey-benchmark tool. The patch improves perfromance by 29.24%.
Baseline: 48,847.20 Qps -> Optimized: 63,130.57 Qps

Test Environment
OPERATING SYSTEM: Ubuntu
Kernel: 6.5.0
DISK: SATA SSD
PROCESSOR: Intel(R) Xeon(R) Gold 6152 CPU (total 88 Threads, 2 Sockets, 22 Cores per socket, 2 Threads per Core)
NUMA info of the processor
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,...,86
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,...,87
Base: #741
Server and Valkey-benchmark in same socket.

Server config
port 9876
bind 127.0.0.1
appendonly yes
appendfsync always
no-appendfsync-on-rewrite no
aof-use-rdb-preamble no
daemonize no
protected-mode no
databases 16
latency-monitor-threshold 1
repl-diskless-sync-delay 0
save
io-uring-enalbed yes

Test step

  1. Start sever with taskset -c 12,14,16,18 src/valkey-server valkey.conf
  2. Start benchmark using single thread: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q
  3. Start benchmark using multiple threads: taskset -c 20,22,24,26 src/valkey-benchmark -p 9876 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4
    For both single thread and multiple threads, I tested each case 3 times. The average performance are summaried as follow table:
Mode Baseline Optimized Performance Improvement
Single Thread 48847.2 63130.57 29.24%
Multiple Threads 59992.36 72723.67 21.22%

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>
Copy link

codecov bot commented Jul 5, 2024

Codecov Report

Attention: Patch coverage is 19.04762% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.34%. Comparing base (b728e41) to head (c87f7de).

Files Patch % Lines
src/io_uring.c 0.00% 11 Missing ⚠️
src/server.c 20.00% 4 Missing ⚠️
src/aof.c 60.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #750      +/-   ##
============================================
- Coverage     70.40%   70.34%   -0.06%     
============================================
  Files           112      113       +1     
  Lines         61467    61487      +20     
============================================
- Hits          43275    43253      -22     
- Misses        18192    18234      +42     
Files Coverage Δ
src/config.c 78.69% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/aof.c 79.97% <60.00%> (-0.17%) ⬇️
src/server.c 88.45% <20.00%> (-0.11%) ⬇️
src/io_uring.c 0.00% <0.00%> (ø)

... and 10 files with indirect coverage changes

Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>
Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?

Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

@Wenwen-Chen
Copy link
Contributor Author

Wenwen-Chen commented Jul 9, 2024

Hello. Are you working with @lipzhu? If we do the write with io_uring, we could also do the fsync in the same ring without an extra syscall?

No,I am not working with @lipzhu,but I have following #599 for a long time.
In my opinion, AOF write and fsync can use the same io_uring instance in the manner of time-sharing multiplexing.

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?
Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Yes, the performance improvemet is brought by the less syscalls of io_uring.
I did an extra experiment for without io_uring scenario. In order to count the number of 'write' which are called in each aofWrite function, I add some logs in aofWrite . I tested valkey with the same test case. I found that each aofWrite only calls 'write' once . Therefore, I didn't replace write by writev.

@lipzhu
Copy link
Contributor

lipzhu commented Jul 10, 2024

29% improved throughput is impressive. I wonder how this can be achieved, because we still wait for the write and then do fsync before we process the next command. I guess it is just doing less syscalls?

Without io_uring we do write in a while loop. I wonder if the same improved performance could be achieved with writev instead of the loop. Have you tried that?

Echo @zuiderkwast , I am also curious why io_uring could help perf boost on such kind of case.
@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

@Wenwen-Chen
Copy link
Contributor Author

@Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC.

OK, I will do these tests ASAP.

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process?
Set some config item ? or change source code?

@lipzhu
Copy link
Contributor

lipzhu commented Jul 10, 2024

To be simpler, let's also disable the rewrite process when enable AOF?

I am sorry. I don't know howto disable rewrite process? Set some config item ? or change source code?

Through config auto-aof-rewrite-min-size 64gb.

@Wenwen-Chen
Copy link
Contributor Author

Wenwen-Chen commented Jul 10, 2024

Through config auto-aof-rewrite-min-size 64gb.

Thank you very much.

I did some extra experiments

  • Persist AOF file by io_uring bring a bit CPU overhead when compared with traditional write system call
  • Why the performance improved when using io_uring? It has relationship with Rewrite feature. However, I don't know the root cause. The detail test result is shown as following.

1. Performance comparison
I compared the perfomance by enable/disable rewrite feature.
Test command : taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 5000000 -q --threads 4
It shown that enable io_uring + enable rewrite got best performance.

Rewrite Baseline (use write SYSCALL) Optimized (use io_uring) Performance Improvement
Disable 61722.51 60336.46 -2.25%
Enable 59576.85 72835.51 22.25%

2. CPU utilization comparison
perf stat -p 'pid of valkey-server' sleep 10

  • Disable Rewrite
Index Baseline(use write SYSCALL) Optimzed(use io_uring) Optimzed/ Baseline -1
Cycles 21,496,813,799 22,242,600,805 3.47%
Instructions 21,470,082,059 24,364,695,157 13.48%
Insn Per Cycle 1 1.1 -17.29%
CPU utilized 0.653 0.683 4.59%
  • Enable Rewrite
Index Baseline(use write SYSCALL) Optimzed(use io_uring) Optimzed/ Baseline -1
Cycles 24,055,924,761 27,149,142,327 12.86%
Instructions 23,769,267,973 30,960,818,308 30.26%
Insn Per Cycle 0.99 1.14 15.42%
CPU utilized 0.732 0.859 17.35%

@zuiderkwast
Copy link
Contributor

With io_uring, the kernel can use kernel threads? Maybe that's why it's faster but uses more CPU?

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

With higher throughput, we can handle more traffic. It's OK to use more CPU for more traffic, but for the same traffic I hope we don't use very much more CPU.

@Wenwen-Chen
Copy link
Contributor Author

Wenwen-Chen commented Jul 11, 2024

With io_uring, the kernel can use kernel threads?

It is determined by IO traffic.
If Application's IO traffic is low, the kernel will not use kernel thread. Otherwise, the kernel will obtain io worker threads( kernel thread) from io_uring's worker pool to process IO.

Maybe that's why it's faster but uses more CPU?

Yes.
However, why enable io_uring + enable Rewrite get best performance? I don't konw the root cause.
I will deep dive in valkey's rewrite feature and resolve the problem.
I would be very grateful if someone could provide a method to resolve the problem.

Are these cycles and instructions numbers for the full benchmark or for a fixed duration like one second?

For a fixed duration (10S).
They were tested by the command of perf stat -p 'pid of valkey-server' sleep 10

@Wenwen-Chen
Copy link
Contributor Author

I am also curious why io_uring could help perf boost on such kind of case. @Wenwen-Chen Do you mind taking a look at the before/after CPU utilization, IPC. To be simpler, let's also disable the rewrite process when enable AOF?

@lipzhu
I have posted the test result. Do you have any comment on the result?
I am wondering why enable io_uring + enable rewrite get best performance.
According to my unstanding, rewrite feature forks a child process which stores the latest KV data into disk.
Is there anything special about rewrite feature which has strong correlation with io_uring?

@lipzhu
Copy link
Contributor

lipzhu commented Jul 31, 2024

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

I am wondering why enable io_uring + enable rewrite get best performance.

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

@Wenwen-Chen Wenwen-Chen force-pushed the aof_io_uring branch 3 times, most recently from c26c4f2 to 0c6ed8c Compare August 1, 2024 08:30
Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>
Signed-off-by: Wenwen Chen <wenwen.chen@samsung.com>
@Wenwen-Chen
Copy link
Contributor Author

Wenwen-Chen commented Aug 9, 2024

@lipzhu

This is also my question, can you help to do some analysis of why enable rewrite could help on performance. what is the cost, or just assign a single CPU to the server?

  1. Function flushAppendOnlyFile() persists aof_buf to disk by using write system call or aofWriteByIOUring (wrapped io_uring interface in this patch). I analyzed the execution time of flushAppendOnlyFile under 3 scenarios by using uftrace tool https://github.com/namhyung/uftrace
  • Optimized: enable io_uring + enable rewrite,
  • Baseline 1: disable io_uring + enable rewrite,
  • Baseline 2: enable io_uring + disable rewrite.
  1. For optimized scenario, it get best performance, due to it spend shortest time on flushAppendOnlyFile() .
    The execution time of flushAppendOnlyFile() mainly come from fdatasync(). The optimized scenario reduces 50.3%/44.5% execution time on fdatasync() when compared with scenario Baseline 1 and Baseline2.
<style> </style>
Type io-uring-enalbed Rewite enable Performance (Qps) Time of flushAppendOnlyFile(s) Time of fdatasync (s)
Optimized Yes Yes 38,963.27 15.113 10.775
Baseline 1 No Yes 33,057.41 22.759 21.66
Baseline 2 Yes No 32,701.11 25.82 19.43
    Optimized vs Baseline 1 17.9% -33.6% -50.3%
    Optimized vs Baseline 2 19.1% -41.5% -44.5%
  1. Test steps
  • Start Server: uftrace record -F flushAppendOnlyFile -F fdatasync src/valkey-server valkey.conf
  • Start valkey-benchmark: taskset -c 20,22,24,26 src/valkey-benchmark -p 5432 -t set -d 100 -r 1000000 -n 2500000 -q --threads 4
  • Analysis execution time: uftrace graph flushAppendOnlyFile.

@Wenwen-Chen
Copy link
Contributor Author

Hi @Wenwen-Chen , sorry for the late response, per my understanding, io_uring should not have benefit for such kind of case, your result of disable Rewrite has proven that.

io_uring doesn't bring perfromance improvment on ‘disable Rewrite’ scenario when compared with write SYSCALL ( io_uring: 60336.46 vs write: 61722.51, #750 (comment)).
However, we should focus on 'enable Rewrite' scenario instead of 'disable Rewrite'. User usually enable rewrite feature in the production environment when he/she turns on AOF!
io_uring improves perfromance significantly on 'enable Rewrite' scenario(io_uring: 72835.51 vs write: 59576.85, #750 (comment)). According to #750 (comment) , Valkey get best performance due to io_uring reduces the execution time of flushAppendOnlyFile.

@egbaydarov
Copy link

egbaydarov commented Oct 15, 2024

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

https://github.com/valkey-io/valkey/blob/e30ae762a8ec7f531005fab90edd275dfa98f72f/valkey.conf#L2374C1-L2388C25

# server-cpulist 0-7:2
# bio-cpulist 1,3
# aof-rewrite-cpulist 8-11
# bgsave-cpulist 1,10-11

@xbasel
Copy link
Member

xbasel commented Dec 17, 2024

@Wenwen-Chen do you plan to work on this?

@Wenwen-Chen
Copy link
Contributor Author

@Wenwen-Chen do you plan to work on this?

Hi, @xbasel I really want to promote this patch, However, I am not expert of Valkey. I have not found the root reason why io_uring enbale + rewrite enable reduces the time of fdatasync. Do you have any suggestion?

@Wenwen-Chen
Copy link
Contributor Author

@Wenwen-Chen

I'm completely mediocre on the io_uring internals and fsync as well. But did you try setting affinity for background processes and AOF rewrite while doing your benchmark? Will it get the same boost with correctly configured affinity (I mean different physical cores for the main and background threads, not virtual ones like HT or SMT)?

Hi @egbaydarov, thank you very much for your suggestion.
It got the same boost with correctly configured affinity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants