-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstd: x86 assembler implementation of sequenceDecs.executeSimple #531
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few initial thoughts.
I fixed the code, so the tests pass. Now going to prepare an |
9d30996
to
7643f8e
Compare
I'll fix the CI error once pushing to github will be working. The fresh results from IceLake:
I also tried copying via AVX registers, but there is no big difference. IIRC I observed better speedup in my early decodeSync implementation. So, maybe for that function, we'll see a nicer AVX impact.
|
Re: performing all memcpy inside asm -- overall, it's better, few minimal regressions. Comparison with the version with a threshold.
|
@WojciechMula Looks good 👍🏼 |
@klauspost sorry, I copied a wrong file - but the actual results are still quite good IMHO |
Re: copying in 32-byte chunks - 2xSSE reg. There are some nice speed-ups, but I feel like there are more regressions.
|
Yeah, it will over-copy quite a bit, since most matches/lits will be <16 bytes. We could add some logic that uses one or the other, based on |
I fully agree. So, I'm going to revert this change, squash the commits and it'll be ready to merge. I'd like to add the history support in another PR, as I see there's some more work and likely we'd need another specialisation. Does it sound fine? |
BTW this is on my TODO list: check if having a separate path for |
Method sequenceDecs.executeSimple is simplified sequenceDecs.execute for cases when both the history and dictionary are empty. These cases are 83% of all calls to `execute` when run `go test`. In benchmarks it is 99%.
2cb26c5
to
3295600
Compare
- allocate padded out buffer - remove not needed Go fallback code - add missing checks
I checked this (see: https://github.com/WojciechMula/compress/tree/asm-seqdec-execute-small-ll-ml), but the results are not too promising.
|
Took this out, since it isn't true, since non-stream benchmarks hits decodeSync, not decode+execute. The code is largely unused for now, but a foundation we can build on. |
This is plain x86 and x86 with BMI2 implementation of
sequenceDecs.executeSimple
. Part of #515.I extracted function
executeSimple
to handle cases when no history nor dictionary is used. My quick check showed that forgo test
such cases is 83% of all calls, while forgo test -bench .
it's 99%. Thus, it's the vast majority of cases. Of course, we may consider handling all cases in another PR (but after completing #529).As always, I'm marking it as a draft, as some tests fail. I will figure out what's wrong, likely as usual I missed something silly.[fixed (I was right, it was silly)]Below are preliminary benchmark results from IceLake machine: it's noasm vs GOARM64=v3. Currently, the branch is built on top of #528, thus we see the combined performance boost from x86 BMI use in both
decode
andexecute
.