-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly worse compression than ZLIB on DWARF debug info #2832
Comments
I'm getting good compression ratio on this file at level 18 specifically (<130KB). This is uncommon, though it does unfortunately happen from time to time. The explanation is something like a "trapped into local minima" scenario, which prevents the algorithm to "find" the globally shortest path, due to minuscule evaluation differences at the beginning of the file. This is a known problem for all these algorithms, and finding a way to not miss the "better" opportunity is non-trivial, often involving compromises. Anyway, your file is proving a great training ground to analyze this situation. This could prove useful in finding a solution. |
Thanks for the analysis. Unfortunately for this example, it's also the case that ZLIB outperforms ZSTD at the default compression level as well. (138KB vs 234KB) It also appears that there's something specific about the data patterns in these |
The files seem to generate a ton of short matches of length 3, following a regular 4-bytes pattern. For files > 256 KB, only levels 18+ can hope to handle that properly. Even then, it's difficult, and while generally "okay", there are cases where the algorithm leans on the wrong side. I presume we have enough data to learn from the provided sample already. Once it gets improved, it will be interesting to see if the tentative update survives more examples. |
I would expect most of the data in this file to be LEB128 variable-width integers. I haven't checked, but it's plausible that they're sorted in some way that would cause the 3-byte ones to come first. |
ah, no, they aren't LEB128 encoded in .debug_str_offsets - they're regular sized for faster lookup (by index): https://dwarfstd.org/doc/DWARF5.pdf bottom of page 240: "This header is followed by a series of string table offsets that have the same representation as DW_FORM_strp. For the 32-bit DWARF format, each offset is 4 bytes long; for the 64-bit DWARF format, each offset is 8 bytes long." |
Thanks for that clarification. Looking at the specification and the raw bytes, the data consists of a 12-byte header followed by a sequence of 4-byte little-endian unsigned integers, sorted in increasing order. In this particular file, the high byte of the values is always zero. I'm not sure where specifically 3-byte matches would be coming from, though there's plenty of redundancy in general. |
I am attaching two more examples of other debug info sections that compress worse with ZSTD than with ZLIB at default compression levels. debug_loclists.bin.gz - ZSTD 6.4MB vs ZLIB 5.9MB Both of these examples are prefixes of the actual section (due to GitHub attachment limits). |
For applications in which compression speed is basically irrelevant, would it be possible to try expensive options, such as trying many different paths and picking the best one? I would not at all be surprised if optimum compression is NP-hard or worse, but I imagine there are options if one has compute power to spare. |
Yes. |
Would using a dictionary help? What about preprocessing? |
All these options are possible. |
FYI, I just found some benchmark on debuginfo done by MaskRay: Looking at the number the gist seems that using a level around |
Except that analysis doesn't seem to have data on .debug_str_offsets in particular - which was an outlier in @resistor's analysis. So this is still an interesting bug - though, yes, DWARF producers can adopt zstd for a win overall - it's just a bit awkward when one section regresses (we could technically keep compressing that one section with zlib for a win there, which is an unfortunate incentive). |
We're having similar issue when compressing our parquet files. Here is a sample of parquet sizes compressed with
and here is size of same data compressed with
The data cannot be posted on public domain, but I can provide a sample parquet file if that helps. Just contact me at nickaein.i@gmail.com |
These files are rather large. Presuming you are using the
and see how compression ratio reacts to this parameter ? |
I should clarify that lists I posted are size of directories and each one usually contains three parquet files. So the size of files can roughly be estimated by dividing those numbers by 3. Sorry for misleading stats. I've selected one of the parquet files for testing that is 62 MB in uncompressed format and With
Version info: zstd v1.4.8 |
OK, so these parquet files seem very responsive to Unfortunately, As a work around, note that strategy is also a parameter that can be enforced manually, like
Eventually, combining low levels with slow strategies can lead to weird combinations that doesn't always work well. Within Supporting |
Hi @Cyan4973, we're also seeing that zstd with some datasets requires exploration of fine-tuning the parameters which for the casual user can be very daunting. Has the zstd team explored the possibility of integrating the concept of paramgrill into zstd itself? An "auto-paramater-tuning" capability for zstd would really set it apart from other compression tools. It could alleviate the roadblocks when attempting to integrate zstd but running into glassjaws as compared to gzip. It sounds like the heuristic search that paramgrill performs is more than adequate for reaching optimal combinations? or is an ML based re-enforcement learning approach worth exploring? Or something like OpenTuner? Documenting example usages for how paramgrill can be used would be greatly appreciated I guess in general what should our approach be to parameter tuning for different usage models? Thank you for the outstanding technology that is zstandard |
Finding better heuristics to determine parameters is a great topic. One difficulty though is to achieve that goal in a variety of speed / efficiency trade off, in order to remain compatible with |
A few words on this topic: Starting Specifically, on the sample provided in this issue, compressed size at level 19 is improved, from 179,814 bytes to 131,410 bytes (a +36.8% improvement). Critically, it is now more compact than reported gzip/zlib result at maximum level. This outcome seems to resolve this issue. |
Describe the bug
ELF executables are able to store DWARF debug info in a compressed format, using ZLIB. I have been experimenting with using ZSTD as a replacement for it, and am generally seeing good results. However, on the ".debug_str_offsets" section, I am seeing significantly worse compression ratios with ZSTD than with ZLIB, even at max compression levels (176KB vs 138KB, from 290KB uncompressed).
To Reproduce
I have attached uncompressed.bin.gz, which is an extracted
.debug_str_offsets
section from a debug build of the LLVM compiler. It has been compressed withgzip -9
. You can reproduce my results above by extracting it to obtainuncompressed.bin
, and recompressing that withzstd -19
.Expected behavior
I would generally expect ZSTD's compression ratio to be no worse than that of ZLIB.
Desktop (please complete the following information):
zstd command line interface 64-bits v1.5.0
gzip 1.11
The text was updated successfully, but these errors were encountered: