-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimal huff depth speed improvements #3302
Optimal huff depth speed improvements #3302
Conversation
In the comparison above, what is |
Apologies, |
While the compression losses are small, they are nonetheless present, in large enough quantity to question the "speed benefit at no compression loss" initial objective. My understanding is that this PR starts by "guessing" what is likely a good My concern is that managing the left/right regression logic may be a bit more complex than it initially seems. Of course, the resulting format is always correct, the concern is about finding the best choice, as the brute-force method does. A recommendation here would be to simplify the logic by doing only a single direction : from smallest to larger. See if this method results in any loss of compression (if it does, then there is more to look into). See if it improves speed measurably. The intuition is that it will help speed for small data, which is where it matters most because the cost is perceptible. It will probably not be efficient for large data, but also, the relative cost is much lower in this case. Finally, presuming it works (to be proven), it would be a good step forward, a reference point that could still be improved upon. |
…it/zstd into optimal-huff-depth-speed
After many rounds of experimentation, I unfortunately could not find a solution that fits the initial objective to achieve speed improvements with no loss in ratio. Any potential speed improvements that deviate from brute force seem to at least somewhat regress compression ratio for 1KB block sizes (even if only by a couple hundred bytes). The solution proposed by @Cyan4973 appears to be the best solution for now, though still not quite optimal. Speed-Optimized Optimal Log vs. Brute Force Optimal LogIn the following tables, Default block size-B1KB-B16KBThere remains some loss in compression ratio for smaller block sizes, though the loss is fairly negligible. |
for (huffLog = HUF_minTableLog(symbolCardinality); huffLog <= maxTableLog; huffLog++) { | ||
maxBits = HUF_buildCTable_wksp(table, count, | ||
maxSymbolValue, huffLog, | ||
workSpace, wkspSize); | ||
if (ERR_isError(maxBits)) continue; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor optimization :
if (maxBits < optLogGuess) break;
If the tree builder is unable to use all allowed bits anyway, it means we have already reached the optimal huffman distribution at previous attempt. We can immediately stop the loop, as it will bring no further benefit.
Tested on silesia.tar
with 1 KB blocks : this seems to improve compression speed by ~+1%, at no impact on compression ratio.
lib/compress/huf_compress.c
Outdated
} | ||
optSize = newSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference here in how you handle size-equal solutions.
In previous PR, when 2 distributions result in equal size, you keep the first one, with smaller maxBits
.
In this new PR, when 2 distributions result in equal size, you update the solution, and keep the larger maxBits
.
It turns out that this difference is where the compression ratio regression happen on the 1KB block test.
If I change the logic to keep the smaller nb of bits, I'm getting most the original size back.
Actually try this variant :
- break if newSize > optSize + 1
- update solution only if newSize < optSize
and it should give you exactly the same result as the brute-force variant, while preserving most of the speed benefits of this PR.
Now, why is there a difference between 2 maxBits
variants if size estimation tells us that they should lead to the same compressed size ? Initially, I thought there might be a side effect with statistics re-use in following blocks, but thinking about it, that can't be the case : the -B1K
test cut the input into independent chunks of 1 KB, so there is no follow-up block, hence no statistics re-use. It can't be the reason for this difference.
Now, what could explain it ? Let's investigate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that these last differences we observe between solutions which are expected to be equal is that HUF_estimateCompressedSize()
is just an estimation.
At first sight, it looks exact, but there are actually a nb of minor details to be added for a more accurate picture :
- there is a last-bit marker at end of the bitstream
- the estimation doesn't distinguish 1 stream vs 4 streams, which would lead to different size estimations
- 4 streams cannot be estimated "as is", it would require to know the exact histogram of each stream (we just have the global histogram of the block to work with).
At the end of the day, it's not that these shortcomings make a lot of difference : the estimation is still mostly correct. But it does explain potential off-by-1 estimation errors, which then produce to a certain level of "noise", below which it's not possible to make progresses.
Making HUF_estimateCompressedSize()
more accurate might be useful, not least because it's also used in other parts of the compression library, where such differences might matter more. But it's definitely a different effort, not to be confused with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: the last-bit marker at end of the bitstream is actually accounted for in HUF_estimateCompressedSize()
. It currently returns nbBits >> 3
which returns the correct number of bytes - 1, when considering the closing bit. Since we only care about the relative sizes, this is not the cause in the disparity we observe between final compressed size and the estimated local compressed size.
I implemented the suggestions made by @Cyan4973 and observed favorable results. Note: The losses we see in decompression ratio appear to be random noise. I ran the benchmarks additional times and did not see the loss in speed observed here. Speed-Optimized Optimal Log vs. Brute Force Optimal LogIn the following tables, Default block size-B1KB-B16KB |
TLDR
This PR is a followup to a previous PR that, for high compression levels, brute force tests all valid Huffman table depths and chooses the optimal based on minimizing encoded + header size. This PR introduces some speed optimizations that could (potentially) allow this feature to be used at lower compression levels.
Note: This does cause us to lose a bit in terms of compression ratio, but offers a significant speed improvement. This could mean that we want more fine-grained
depth modes
, but that could be overkill.Benchmarking
I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.
Speed-Optimized Optimal Log vs. No Optimal Log
In the following tables,
ctrl
uses the original huffman log method with no speed or compression optimizations, andtest
uses the speed-optimized huffman log method.Default block size
-B1KB
-B16KB
Speed-Optimized Optimal Log vs. Brute Force Optimal Log
In the following tables,
ctrl
uses the brute force optimal log method, andtest
uses the speed-optimized optimal log method.Default block size
-B1KB
-B16KB