Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd CPU overhead optimizations #11709

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

adamdmoss
Copy link
Contributor

@adamdmoss adamdmoss commented Mar 8, 2021

Motivation and Context

This PR improves CPU overhead in Zstd compression and (to a lesser extent) decompression especially at high (zstd-5..zstd-9+) compression levels and smaller recordsizes.

Description

I replaced the zzstd module's memory pool with an object pool, to permit recycle+reset of Zstd cctx objects. This in turn allows partial-reinitialization of these context objects; fully initializing these objects can become a huge overhead (i.e. up to ~50% of the entire CPU cost at high compression levels and low block sizes!). Then I applied the same optimization to decompression, though the overhead/benefit is much smaller there - still measurable.

I also bulk-replaced memcpy/memmove/memset calls inside the Zstd library itself, permitting the compiler to inline some important inner-loop calls (which it wouldn't otherwise know how to do for non-libc string functions). This may look like a pretty invasive change to what should be pristine upstream code, but upstream Zstd has made the same changes for the same reasons since we introduced this code, so any future upgrade won't need to reproduce this change.

Please note, the compressed data should remain 100% binary-identical, so this doesn't affect dedup/l2arc/encryption-validation etc.

How Has This Been Tested?

ZTS and lots of local data shunting.

Various duct-taped benchmarks copying data of various compressibility to and from datasets with varying recordsizes (mostly 4k, 8k, 16k and 1M) and compression levels (mostly zstd-3, zstd-8, zstd-19). Not chosen to demonstrate PR wins, just seemed like a fairly objective mix.

Raw timing results are here sort.xls - after a few hours I failed to get LibreOffice to graph this in a readable format, so maybe someone else can. :) Below is a table of the average performance per test and the percent improvement.

  • This has timings for compression=zstd-1, zstd-5, and zstd-9 across master and this PR at various block sizes, for writes and (16-way parallel) reads. Timings for compression=lz4, gzip-1 and 'off' are also included (as measured on master) for comparison.
  • Corpus was an even split of highly compressible, moderately compressible and uncompressible data.
  • note: systime for writes is not very useful since the benchmark uses async writes which mostly do their compression outside of the process' accounting (though final sync completion is included in the wallclock timings)
  • (interesting to see a significant systime/wallclock read-performance dropoff going from 128k->1m recordsize across all compression types; wonder if it's a L2/L3 residency cliff or something more sinister)
Master - wallclock (secs) PR11709 - wallclock (secs) Percentage Decrease
read zstd-1 128k 3.745 3.381 -9.72%
zstd-1 1m 5.572 5.589 0.31%
zstd-1 8k 8.77 8.392 -4.31%
zstd-5 128k 3.84 3.535 -7.94%
zstd-5 1m 5.559 5.651 1.65%
zstd-5 8k 8.654 8.564 -1.04%
zstd-9 128k 3.679 3.395 -7.72%
zstd-9 1m 5.487 5.567 1.46%
zstd-9 8k 8.797 8.624 -1.97%
write zstd-1 128k 1.953 1.861 -4.71%
zstd-1 1m 1.739 1.813 4.26%
zstd-1 8k 3.966 3.789 -4.46%
zstd-5 128k 3.632 3.565 -1.84%
zstd-5 1m 5.172 4.939 -4.51%
zstd-5 8k 6.441 6.17 -4.21%
zstd-9 128k 6.494 6.126 -5.67%
zstd-9 1m 9.088 8.511 -6.35%
zstd-9 8k 10.46 10.012 -4.28%

FWIW I also simulated NOSLEEP allocations failing 50% of the time to robustify against this case. (It was so useful that it'd be cool if there was a ZFS-wide random-nosleep-fails as part of (fuzz?) testing. Or maybe there is already. :) )

[edit] Added performance table from raw data.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@adamdmoss
Copy link
Contributor Author

Compression results should be binary-identical, for the purposes of dedupe, decompressed ARC, L2ARC, encryption validation, etc. But I'm not sure how to check/demonstrate that; suggestions welcome.

@adamdmoss
Copy link
Contributor Author

Oh yeah, probably only builds on Linux right now because of some dumb unecessary linux-specific debugging stuff.

@beren12
Copy link
Contributor

beren12 commented Mar 9, 2021

This looks interesting! I've been having performance issues with my gaming vm if im writing to a separate test pool with zstd3, even though it only uses 5-10% of the cpu

@drjohnnyfever1
Copy link

I'm going to try to get this building on FreeBSD tonight or tomorrow if I get some time.

@adamdmoss
Copy link
Contributor Author

adamdmoss commented Mar 10, 2021

I'm going to try to get this building on FreeBSD tonight or tomorrow if I get some time.

Cool - should work if you stub-out my stupid XASSERT implementation, probably.

@adamdmoss
Copy link
Contributor Author

This looks interesting! I've been having performance issues with my gaming vm if im writing to a separate test pool with zstd3, even though it only uses 5-10% of the cpu

I look forward to seeing if it helps! I have other patches that help with similar use-cases but they're probably not great for a server workload.

@drjohnnyfever1
Copy link

I'm going to try to get this building on FreeBSD tonight or tomorrow if I get some time.

Cool - should work if you stub-out my stupid XASSERT implementation, probably.

Successfully built and booted 8b00ce6 on FreeBSD. I'm going to try your latest changes, but I think I'll probably still need to work around printk.

@adamdmoss
Copy link
Contributor Author

I'm going to try to get this building on FreeBSD tonight or tomorrow if I get some time.

Cool - should work if you stub-out my stupid XASSERT implementation, probably.

Successfully built and booted 8b00ce6 on FreeBSD. I'm going to try your latest changes, but I think I'll probably still need to work around printk.

Shouldn't matter any more - it'll just use the usual VERIFY() macros mostly, and the less-important logging is stubbed-out.

@IvanVolosyuk
Copy link

It is nice to see ZSTD performance receiving some love :)

I had terrible interactive performance with ZFS for my VM gaming until I enabled CONFIG_PREEMPT=y.
Now I have perfect responsiveness of windows VM running with rt priority.
I wonder if putting cond_sched() every 4k is necessary for preemptive kernel. If you can make the gulp size configurable this can be a way to make the same code work for those who use preemptive kernel already...

@adamdmoss
Copy link
Contributor Author

FWIW, cond_resched() is compiled or patched away to nothing in CONFIG_PREEMPTION kernels.

If you can make the gulp size configurable this can be a way to make the same code work for those who use preemptive kernel already

What do you expect won't work? :)

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 12, 2021
@PrivatePuffin
Copy link
Contributor

@adamdmoss Interesting changes! :)

That being said: Would this affect the checksumming? If it does this would hit the same roadblocks as updating ZSTD currently does, with no one actively (outside of empty promises) working on solving those.

@adamdmoss
Copy link
Contributor Author

@adamdmoss Interesting changes! :)

That being said: Would this affect the checksumming? If it does this would hit the same roadblocks as updating ZSTD currently does, with no one actively (outside of empty promises) working on solving those.

It should be bit-identical unless I buggered up the flushing. Not really sure how to verify that though.

@PrivatePuffin
Copy link
Contributor

@adamdmoss Interesting changes! :)
That being said: Would this affect the checksumming? If it does this would hit the same roadblocks as updating ZSTD currently does, with no one actively (outside of empty promises) working on solving those.

It should be bit-identical unless I buggered up the flushing. Not really sure how to verify that though.

Thats super awesome, getting some improvements in without needing versioning on the ZSTD stack (assuming you didn't bugger up ;-) )

@PrivatePuffin
Copy link
Contributor

@c0d3z3r0 You might want to check this great work out!

@adamdmoss
Copy link
Contributor Author

adamdmoss commented Mar 21, 2021

Squeezed a few more % of decompression speed out of this branch by approximately cherrypicking some upstream zstd optimizations. Probably not something I'd want to submit with this PR but I thought that anyone who was testing this branch might enjoy it anyway. :)

(edit: I have included this as part of this PR.)

@behlendorf
Copy link
Contributor

@adamdmoss those are some impressive improvements! To make this easier to get reviewed and tested it would be helpful if you could propose a patch stack with each of the major logical changes broken in to its own commit. For example, one commit which makes the recycle+reset ZSTD cctx objects change, another which switches to the streaming API, etc.

@adamdmoss
Copy link
Contributor Author

@adamdmoss those are some impressive improvements! To make this easier to get reviewed and tested it would be helpful if you could propose a patch stack with each of the major logical changes broken in to its own commit. For example, one commit which makes the recycle+reset ZSTD cctx objects change, another which switches to the streaming API, etc.

Definitely - was actually thinking of splitting it into separate PRs where possible but that's probably overkill.

@PrivatePuffin
Copy link
Contributor

@adamdmoss Small note on one of your statements:

I did not apply the same optimization to decompression since I couldn't really measure decompression of recordsizes<=1M being an appreciable issue for interactive performance.

This might be due to the fact ZSTD decompression is already a few times faster than compression per thread, so it is significantly less likely to run into performance issues.

@behlendorf
Copy link
Contributor

Definitely - was actually thinking of splitting it into separate PRs where possible but that's probably overkill.

Separate PRs would be fine. In fact, they may be preferable since it'll let us more easily merge this work in chunks. Smaller changes are always easier to review and integrate!

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Mar 25, 2021

Also: it would be nice to have some actual speed numbers to compare ourselves, to get an eye on expected performance changes of the respective changes :)

Basically something like we did with the original PR and which is also done by ZSTD upstream themselves: "These numbers are for reference only and do not reflect real-world performance"

@adamdmoss
Copy link
Contributor Author

I'm trying to whip up the enthusiasm to make a benchmark that anyone else could reproduce. I could do some dumb shell-script stuff but I'd love to get fancy graphs and what-have-you; do we have any recipes for such a thing?

@adamdmoss
Copy link
Contributor Author

adamdmoss commented Mar 28, 2021

Yeah, I guess I'll script something local unless someone already has a recipe, it can spit out numbers which I can just worry about graphing later.

@PrivatePuffin
Copy link
Contributor

My original benchmark script (co-authored with Allan) which is link on the original ZSTD PR, should still work for raw test output :)
(Repo is also still on my profile)

Those graphs however, required some manual labor, though the Excel format is also included in the original PR ^^

@adamdmoss adamdmoss changed the title WIP ZSTD compression/decompression optimizations Zstd CPU overhead optimizations Apr 11, 2021
@adamdmoss
Copy link
Contributor Author

adamdmoss commented Apr 11, 2021

I reckon this is ready for review. I can squash when it's accepted, or sooner.

@adamdmoss adamdmoss marked this pull request as ready for review April 11, 2021 01:54
@adamdmoss
Copy link
Contributor Author

I took out the streaming API change from the original versions of this PR; it needs more iteration and can be added later.

Signed-off-by: Adam Moss <c@yotes.com>
@BrainSlayer
Copy link
Contributor

has the compression ratio comparisatoin been tested in this latest version? havent seen any update on this topic

@gmelikov
Copy link
Member

if allocation fails, the compression for the block is disabled

side note, this behavior is problematic with paradigm of "idempotent compression results", in same ways, see #12840

@PrivatePuffin
Copy link
Contributor

has the compression ratio comparisatoin been tested in this latest version? havent seen any update on this topic

yeah this is kinda the big "if", we've seen better performance before in the PoC code by Allan, which in fact just bypassed the compressor for significant amounts of data.

@BrainSlayer
Copy link
Contributor

has the compression ratio comparisatoin been tested in this latest version? havent seen any update on this topic

yeah this is kinda the big "if", we've seen better performance before in the PoC code by Allan, which in fact just bypassed the compressor for significant amounts of data.

thats my concern. because if the compression results have lower ratio, it means that we have a higher amount of allocation fails which renders the whole patch here pointless since it basicly does not optimize anything usefull

@rincebrain
Copy link
Contributor

It's everyone's favorite time again, benchmark o'clock.

zstd optimization comparison

You can find many numbers here, but over 5 runs with each of (git master as of a few hours ago, this branch as of a few hours ago):

  • Modest performance gains on average
  • No change in compression ratio

(See discussion in #12978 for explanations about what numbers are what. If you're wondering why lz4 is in there, I was curious what its variance across runs would be, nothing else.)

@BrainSlayer
Copy link
Contributor

BrainSlayer commented Feb 5, 2022

It's everyone's favorite time again, benchmark o'clock.

zstd optimization comparison

You can find many numbers here, but over 5 runs with each of (git master as of a few hours ago, this branch as of a few hours ago):

  • Modest performance gains on average
  • No change in compression ratio

(See discussion in #12978 for explanations about what numbers are what. If you're wondering why lz4 is in there, I was curious what its variance across runs would be, nothing else.)

can't see the compression ratio in your sheet. i just see performance values and performance deltas

@rincebrain
Copy link
Contributor

Counterintuitively, it's the one labeled "ratio".

On that tab, it's always 0 because there was no difference, but if you look in the tabs it sources data from, the actual values are present.

@BrainSlayer
Copy link
Contributor

so the ratio column is a delta value. i mean your left tab shows values and the right tab delta normally. but on the ratio column this makes no sense since both sides are zero.

@rincebrain
Copy link
Contributor

No, that's not what my previous or current charts were.

The left table is average(one source) - avg(other source), the right is (avg(one source) - avg(other source))/avg(other source).

@BrainSlayer
Copy link
Contributor

mmh if i just read the zstd values. we see that write 6 9 and 12 has identical performance. so i assume all other value differences are just background noise and the performance gain is zero nor measureable in a significant way. anyway. i will run the same test using my bare metal 16 core ryzen and on a core i7 quadcore. so we get maybe some more accurate results and we can also check if the cpu amd vs intel has a change here

@rincebrain
Copy link
Contributor

...what chart are you reading?

6, 9, and 12 all had > 5% performance differences between the current code and this PR, on average.

@BrainSlayer
Copy link
Contributor

using the real data from your dropbox. just write speeds of dd and fio

@BrainSlayer
Copy link
Contributor

alot of values are very identical or similar others are dramatic different. that doesnt fit together. i'm current setting up a test in background to verify it on a zero load non vm system

@rincebrain
Copy link
Contributor

This was on a baremetal system that was otherwise idle, my 8700k.

@BrainSlayer
Copy link
Contributor

BrainSlayer commented Feb 5, 2022

then it doesnt explain the very noisy results. i just do a retest and will see what comes out on my system. but i will only test zstd algos

@BrainSlayer
Copy link
Contributor

okay. this PR here cannot be merged using git apply. too many merge errors. i skip it and follow up with your PR. no time for this

@rincebrain
Copy link
Contributor

You can just clone people's branches from Github, rather than trying to re-apply it, if that's the primary difficulty.

@BrainSlayer
Copy link
Contributor

You can just clone people's branches from Github, rather than trying to re-apply it, if that's the primary difficulty.

and you think its clever to compare a old codebase with the current upstream master? its about comparing the PR. not about comparing different openzfs versions. in your PR i did manually patch your PR into the upstream version (just one small merge error occured)

@adamdmoss
Copy link
Contributor Author

adamdmoss commented Feb 5, 2022

It's everyone's favorite time again, benchmark o'clock.

Very cool, thanks for testing!

Note, I don't expect significant improvements to throughput unless you're mostly-CPU-limited rather than I/O limited, since this is just a code optimization and compression usually pipelines perfectly with async I/O (ZFS FTW). In this benchmark I'm guessing seq zstd-9,12,15 were pretty CPU-limited. Seq at smaller recordsizes would probably also benefit particularly.

@rincebrain
Copy link
Contributor

Comparing upstream at the most recent rebase to the current branch seems like a reasonable thing to do if your goal is to test the changes in behavior versus unmodified master.

Testing rebased against latest master doesn't have to be the same step, particularly when your goal seemed to be to verify whether the changes did more than noise.

@BrainSlayer
Copy link
Contributor

Comparing upstream at the most recent rebase to the current branch seems like a reasonable thing to do if your goal is to test the changes in behavior versus unmodified master.

Testing rebased against latest master doesn't have to be the same step, particularly when your goal seemed to be to verify whether the changes did more than noise.

i just call it noise. a difference of +. 50 - 100 mbit/s depending in which mood your system is at the current time is not just noise.

i can run a test with the original code vs your versoin and the original code is faster than yours. and then i do it again and the effect is vice versa. whats faster now?

@BrainSlayer
Copy link
Contributor

It's everyone's favorite time again, benchmark o'clock.

Very cool, thanks for testing!

Note, I don't expect significant improvements to throughput unless you're mostly-CPU-limited rather than I/O limited, since this is just a code optimization and compression usually pipelines perfectly with async I/O (ZFS FTW). In this benchmark I'm guessing seq zstd-9,12,15 were pretty CPU-limited. Seq at smaller recordsizes would probably also benefit particularly.

the benchmark test is running on a ramdisk drive. io isnt the barrier here. no real hdd or ssd was involved

@rincebrain
Copy link
Contributor

Comparing upstream at the most recent rebase to the current branch seems like a reasonable thing to do if your goal is to test the changes in behavior versus unmodified master.
Testing rebased against latest master doesn't have to be the same step, particularly when your goal seemed to be to verify whether the changes did more than noise.

i just call it noise. a difference of +. 50 - 100 mbit/s depending in which mood your system is at the current time is not just noise.

i can run a test with the original code vs your versoin and the original code is faster than yours. and then i do it again and the effect is vice versa. whats faster now?

I wasn't claiming I thought it was just noise, just that that was my impression of your assertion - that you thought the benchmarking we were doing was unreliable, and that you would go run benchmarks with lower variance and see whether any differences arose there.

@behlendorf
Copy link
Contributor

The performance results above are encouraging, more benchmarks are of course welcome, but this sure looks like something we should try and move forward with.

@adamdmoss if you have the time, I think breaking this in to multiple commits (or PRs) would be the way to proceed. It looks like what still needs to be done here is sort out the FreeBSD build issue and the comment pertaining to the user of vmem_alloc(.., KM_NOSLEEP).

@PrivatePuffin
Copy link
Contributor

The performance results above are encouraging, more benchmarks are of course welcome, but this sure looks like something we should try and move forward with.

@adamdmoss if you have the time, I think breaking this in to multiple commits (or PRs) would be the way to proceed. It looks like what still needs to be done here is sort out the FreeBSD build issue and the comment pertaining to the user of vmem_alloc(.., KM_NOSLEEP).

I agree here, performance gains without ratio loss seems enough to put some pressure behind this to get this merged :)

@PrivatePuffin
Copy link
Contributor

@adamdmoss Any chance of this going to be made into something that's mergeable?

#define ZSTD_memmove(d, s, l) memmove((d), (s), (l))
#define ZSTD_memset(p, v, l) memset((p), (v), (l))
#endif

Copy link
Contributor

@mjguzik mjguzik Oct 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to ask what was done here to confirm compiler builtins are not used as is. Note general kernel code in both FreeBSD and Linux is using them, so something special would have to be happening in zstd code for that to not happen.

I did a quick check by adding this on FreeBSD:

diff --git a/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c b/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c
index 597de18fc895..c3254a1ce646 100644
--- a/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c
+++ b/sys/contrib/openzfs/module/zstd/lib/common/xxhash.c
@@ -86,6 +86,9 @@ static void XXH_free (void* p) { free(p); }
#include <string.h>
static void* XXH_memcpy(void* dest, const void* src, size_t size) { return memcpy(dest,src,size); }

+void foomemset(char *buf);
+void foomemset(char *buf) { memset(buf, 0, 20); }
+

... and confirmed foomemset is using the builtin at least on that platform, I don't have handy manner to check Linux right now.

Regardless of what's going on, replacing memcpy et al uses with ZSTD_ prefixed variants is in my opinion too problematic -- for one one will have to keep a look out on each update.

In absolute worst case this can redefine these routines, but as noted previously this should not even be necessary. If things fail on Linux, there is probably a missing include.

Copy link
Contributor Author

@adamdmoss adamdmoss Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to ask what was done here to confirm compiler builtins are not used as is.

The kernel's out-of-line memcpy() etc were hot on perftop; their being visible at all there means they're non-builtins.

replacing memcpy et al uses with ZSTD_ prefixed variants is in my opinion too problematic -- for one one will have to keep a look out on each update.

Upstream zstd already did this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.