-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Signed Types Integrated in the CODECs #28
Comments
Yes. I expect it would vectorize very well. |
Zigzag support for int32 is implemented in this branch: here specifically: |
@aqrit Is there are clean way to bring this into the main streamvbyte repo? Want to issue a PR? Or do we do something dirtier? |
Not really, I touched everything. It currently lacks the ARM routines. My original plan was to implement a AVX2 table-less version then package it up... but I never got around to it. |
What about this then... As a stopgap measure, we add https://gist.github.com/lemire/b6437fbd193395d8e4ccac1a5b2e50cc This is such simple code that the autovectorizer should do a fine job (with GCC using -O3, with clang use -O2). Then you can call these functions as needed, to convert your signed integers into unsigned integers and back. It is not going to be as fast as having zigzag encoding integrating in the main functions, but it should be a lot faster than doing the conversion in Python or whatever else. We would leave the issue open, to be resolved with better code... but you'd have an 'ok' fix meanwhile. Thoughts? |
Sounds like a plan 👍 |
Implemented in the latest release. Make sure to compile with aggressive optimization flags and it should be quite fast. |
Testing some implementations we have internally I've found the following code to perform quite well: https://gist.github.com/jorj1988/a2f2f14719d46074900dd563dc31a1c7 It does a delta >> zigzag >> expand to 32 bit >> streambyte encode |
Thanks. That is a useful reference. |
With delta zigzag as per #29 I see about a 40% improvement over my Python implementation with int32 arrays and 20% with int16 (the extra cast is expensive). |
@jorj1988 thanks, I'll have a go at benchmarking your implementation and see what kind of performance improvements it yields. @aqrit I would certainly benefit from an AVX2 implementation if you're still up for it. |
A look at the performance across types iiSeymour/pystreamvbyte#1 (comment) I had a quick go at modifying the tests/perf utility to reproduce in C. |
I understood the 'int32' perf as in "you need to apply zigzag and then compress". However, I am puzzled regarding 'int16'. What do you do exactly? |
My primary use case is working with int16 data and that means an expensive copy to cast up in Python before every encode/decode. |
For encoding 16-bit words a shuffle table wouldn't even be needed. |
I've experimented with a prefix sum for shuffle and didnt see much of a performance increase for my test set. I'd be interested in seeing a more competent implementation of this? |
I was unable to find any shortcuts for encoding using prefix-sums. My apologies. Zigzag is overkill for int16, |
Hmm, I sort of see what you're saying, but I'm having trouble putting together code to implement it - do you have a psuedo code example of what youre suggesting? |
It it is easier to check bit_7 than it is to check bit_0.
I don't think anything is worth doing while they are being piped through the int32 functions. I've been thinking about doing some avx2 code for shorts... |
thanks - ill take a look |
attempt at uint16 support using SSSE3 |
@aqrit here are some performance results against using
Array size vs throughput Any further thoughts on |
Someone want to prepare a PR based on @aqrit 's work? |
Not sure what you’re looking for... Here a simple RLE of the HIBYTEs was used. uint16 could be the same speed as uint32 if one wanted to use a 4 KiB dictionary. The decoder’s inner loop could be fully unrolled. There is a lot of "room for improvement" in the encoder. Signed integer support is just a couple extra instructions. ramblings: |
Encoding speed is more important than decoding for my use-case. Do you think AVX2 would result in a significant improvement?
Could you push a version? I'm more comfortable in Python and C with intrinsics scare me 😄
I've found a pass of zstd (level 1) after svb is very effective but quite costly in comparison. For the data I'm primarily interested in the range is actually only 11 bits and with delta encoding ~6. |
I'll post something this weekend. If optimizing for encoding speed, the pseudo zigzag operation only requires 1 instruction (to encode).
|
If a 16-bit value in the range How does one instead encode the the range I can't decide how to optimize the encoder... |
@aqrit I think I am missing something. I can see the delta coding but I'm confused on the signed integer support as the functions still expect Anyway, tests seem to pass and here are some performance numbers -
|
It is the exact same operation for signed or unsigned numbers (AFAIK). If the function signature was changed to
|
I spent some time looking at
Using your latest
|
back on the topic of signed 32-bit integers:
Zigzag and VariableSignExtend -- using SSE41 – both requires 4 instructions to decode. Zigzag: VariableSignExtend: However, So decoding could just be: The sign bit could be flipped (XORed) during encoding instead of during decoding. The decode shuffle index is known during encoding. For scalar decoding, w/fast unaligned memory access, VariableSignExtend encoding is 1 instruction cheaper to encode than Zigzag, in the general case. |
Raising a question related to supporting signed integers in this library: how should the library handle overflows/underflows in signed integer arithmetic (undefined behavior) when working with signed integers? Here are some examples where we don't handle potential signed overflows at the moment
for example, the zig-zag encode function does If we support signed integers in the future, we should maybe think about raising an error with the user instead of running into undefined behavior. With unsigned integers at the moment we get a wrap-around and the behavior we want "for free". |
@daniel-j-h There is no need to report an error. All processors use two’s complement so addition instructions are mostly sign unaware (except for the flags). A SIMD zigzag would be written with intrinsics and would thus sidestep UB. Otherwise we can provide qualifiers to discourage sanitizers from providing false positives. |
The C and C++ languages we code against specify integer overflow as undefined behavior; it's unclear to me if we can just "sidestep" this UB therefore or if compilers can do whatever they want in the case of UB. I just wanted to bring this up because it was not clear to me that we encode ints in zigzag fashion and then have these potential undefined behavior issues in the current zigzag implementation, when I was looking at the pystreamvbyte wrapper. |
SIMD intrinsics are not covered by the standard at all so there is no UB. So yes, you can sidestep entirely. Regarding the larger issues... Before C++20, I do not think you could sensibly implement zigzag encoding in a well-defined C/C++ way as per the standards. The zigzag encoding trick assumes that you have two complement's, something that you were not allowed to assume prior to C++20. |
Let me elaborate... This... uint32_t _zigzag_encode_32 (int32_t val) {
return (val + val) ^ (val >> 31);
} should become... in ARM NEON... veorq_s32( vaddq_s32(val, val), vshrq_n_s32(val,31)) The latter will never trigger UB. Notice how easy it is. |
Or @aqrit 's x64 intrinsics... __m128i _mm_zigzag_encode_epi32 (__m128i v) {
__m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
} Notice that we expect the compiler to autovectorize zigzag to equivalent code. However, we could do better with a tight integration as we would avoid the two passes over the data. |
@daniel-j-h More critically, the UB is skin deep. You can cast to unsigned integers, do the addition, and cast back. Prior to C++20, the result was implementation defined, but with C++20, you will get consistent standard-defined results. That is not to say that signed integer overflow is not an issue. It can be. But it is nothing like 'well, we can't do this at all'. We have to keep in mind that, underneath, the processor has no issue whatsoever with these overflows. |
We now run undefined sanitizers in CI. See #36 |
I have an interest in using streamvbyte with signed types (specifically int16) and have created a Python wrapper which supports this however the overhead is quite large as you would imagine iiSeymour/pystreamvbyte#1.
Native support for an efficient zigzag encoding/decoding for int16 and int32 would be great. I imagine this is something that would vectorise well?
The text was updated successfully, but these errors were encountered: