-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deflate_compress: Significant performance drop + increase in ratio compared to zlib on levels 8 and 9 with some data #85
Comments
The algorithm changes at level 8 and above. So that's why there can be a discontinuity between level 7 and 8. The algorithm at levels 8 and above is slower than the algorithm at levels 7 and below. However, on most inputs it produces a significantly better compression ratio. Also, it's usually still slightly faster than zlib. Compression is all about heuristics, so it will always be possible to find some inputs where the algorithms don't work as well as expected. So I'm not sure it's fair to call it a "bug". But when I have a chance, I'll try to add some more heuristics to improve on this input as well as the input given in #57 on which libdeflate also does poorly -- probably something along the lines of dynamically selecting the algorithm and minimum match length based on a quick analysis of the input data. The file you gave does well with the "lazy" algorithm and a minimum match length of 4 bytes. |
Sorry about that, misclicked on my phone .. |
Instead of switching directly from the lazy compressor at level 7 to the near-optimal compressor at level 8, use the lazy2 compressor at levels 8-9 and don't switch to near-optimal until level 10. This avoids poor compression ratio and bad performance (both significantly worse than level 7, and significantly worse than zlib) at levels 8-9 on data where the near-optimal compressor doesn't do well until the parameters are cranked up. On data where the near-optimal compressor *does* do well, this change worsens the compression ratio of levels 8-9, but also speeds them up a lot, thus positioning them similarly vs. zlib as the lower levels (i.e. much faster and slightly stronger, rather than slightly faster and much stronger). The difference between levels 9 and 10 is increased, but that's perhaps the least bad place to have a discontinuity. Resolves #85
Instead of switching directly from the lazy compressor at level 7 to the near-optimal compressor at level 8, use the lazy2 compressor at levels 8-9 and don't switch to near-optimal until level 10. This avoids poor compression ratio and bad performance (both significantly worse than level 7, and significantly worse than zlib) at levels 8-9 on data where the near-optimal compressor doesn't do well until the parameters are cranked up. On data where the near-optimal compressor *does* do well, this change worsens the compression ratio of levels 8-9, but also speeds them up a lot, thus positioning them similarly vs. zlib as the lower levels (i.e. much faster and slightly stronger, rather than slightly faster and much stronger). The difference between levels 9 and 10 is increased, but that's perhaps the least bad place to have a discontinuity. Resolves #85
Instead of switching directly from the lazy compressor at level 7 to the near-optimal compressor at level 8, use the lazy2 compressor at levels 8-9 and don't switch to near-optimal until level 10. This avoids poor compression ratio and bad performance (both significantly worse than level 7, and significantly worse than zlib) at levels 8-9 on data where the near-optimal compressor doesn't do well until the parameters are cranked up. On data where the near-optimal compressor *does* do well, this change worsens the compression ratio of levels 8-9, but also speeds them up a lot, thus positioning them similarly vs. zlib as the lower levels (i.e. much faster and slightly stronger, rather than slightly faster and much stronger). The difference between levels 9 and 10 is increased, but that's perhaps the least bad place to have a discontinuity. Resolves #85
f7d3a70 fixed this by changing the algorithm used at levels 8 and 9. Other commits made other improvements to the compression algorithms as well. Here are the results of DEFLATE compression on
Note: level 10 is the odd one out now (worse than level 9), but not by very much. It's hard to handle this file perfectly, as it's a bit unusual. Edit: changed table to be easier to read |
Previously for read names compression levels between 4 and 7 zlib considerably beat libdeflate due to libdeflate's poor selection of minimum match length. This was raised as issue ebiggers/libdeflate#85. It's now been resolved in libdeflate 1.9 (along with some general improvements elsewhere), and in all cases libdeflate is a better choice. Also fixed the mapping of levels 1..9 (standard zlib) to 1..12 (libdeflate). The maths in the comment was incorrect as it's an integer calculation not a floating point one. Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0: Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 (RN=libdeflate) Libdeflate 1.8-7 0m45.379s 206626451 50580708 (RN=zlib) Libdeflate 1.8-7 1m1.431s 210172035 *54126292 (RN=libdeflate, forced) Zlib only -7 0m48.531s 207189920 50580708 (RN=zlib) (Default level) Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (RN=libdeflate) Libdeflate 1.8-5 0m33.265s 208714328 51612215 (RN=zlib, as devel) Libdeflate 1.8-5 0m29.753s 213024792 *55922679 (RN=libdeflate, forced) Zlib only -5 0m40.353s 208499406 51612215 (RN=zlib) We can clearly see the problem(*) in using libdeflate for read-names in 1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib again. At level 9 it was using libdeflate for RN already, but we see improvements to both RN and elsewhere which are simply down to other changes in the library: Time Size RN Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 (RN=libdeflate) Libdeflate 1.8-9 2m6.304s 204292448 48541687 (RN=libdeflate) Zlib only -9 1m20.966s 206482425 49988310 (RN=zlib) Finally, the impact of switching level 9 from the old mapping (11) to new (12; "9+"), along with a more complete table for curiosities sake: Time Size RN Libdeflate 1.9+PR-9+ 2m54.664s 202315823 46783148 Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 Libdeflate 1.9+PR-8 1m39.040s 202934405 47247996 Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 Libdeflate 1.9+PR-6 0m31.521s 207437149 50768595 Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (default level) Libdeflate 1.9+PR-4 0m29.478s 210425588 52946850 Libdeflate 1.9+PR-1 0m27.460s 215975209 57142706 (no change) (Note: "1.9" here is actually master, which is a few commits on from the tag, but the main gist of it is the same.)
Previously for read names compression levels between 4 and 7 zlib considerably beat libdeflate due to libdeflate's poor selection of minimum match length. This was raised as issue ebiggers/libdeflate#85. It's now been resolved in libdeflate 1.9 (along with some general improvements elsewhere), and in all cases libdeflate is a better choice. Also fixed the mapping of levels 1..9 (standard zlib) to 1..12 (libdeflate). The maths in the comment was incorrect as it's an integer calculation not a floating point one. Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0: Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 (RN=libdeflate) Libdeflate 1.8-7 0m45.379s 206626451 50580708 (RN=zlib) Libdeflate 1.8-7 1m1.431s 210172035 *54126292 (RN=libdeflate, forced) Zlib only -7 0m48.531s 207189920 50580708 (RN=zlib) (Default level) Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (RN=libdeflate) Libdeflate 1.8-5 0m33.265s 208714328 51612215 (RN=zlib, as devel) Libdeflate 1.8-5 0m29.753s 213024792 *55922679 (RN=libdeflate, forced) Zlib only -5 0m40.353s 208499406 51612215 (RN=zlib) We can clearly see the problem(*) in using libdeflate for read-names in 1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib again. At level 9 it was using libdeflate for RN already, but we see improvements to both RN and elsewhere which are simply down to other changes in the library: Time Size RN Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 (RN=libdeflate) Libdeflate 1.8-9 2m6.304s 204292448 48541687 (RN=libdeflate) Zlib only -9 1m20.966s 206482425 49988310 (RN=zlib) Finally, the impact of switching level 9 from the old mapping (11) to new (12; "9+"), along with a more complete table for curiosities sake: Time Size RN Libdeflate 1.9+PR-9+ 2m54.664s 202315823 46783148 Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 Libdeflate 1.9+PR-8 1m39.040s 202934405 47247996 Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 Libdeflate 1.9+PR-6 0m31.521s 207437149 50768595 Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (default level) Libdeflate 1.9+PR-4 0m29.478s 210425588 52946850 Libdeflate 1.9+PR-1 0m27.460s 215975209 57142706 (no change) (Note: "1.9" here is actually master, which is a few commits on from the tag, but the main gist of it is the same.)
Previously for read names compression levels between 4 and 7 zlib considerably beat libdeflate due to libdeflate's poor selection of minimum match length. This was raised as issue ebiggers/libdeflate#85. It's now been resolved in libdeflate 1.9 (along with some general improvements elsewhere), and in all cases libdeflate is a better choice. Also fixed the mapping of levels 1..9 (standard zlib) to 1..12 (libdeflate). The maths in the comment was incorrect as it's an integer calculation not a floating point one. Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0: Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 (RN=libdeflate) Libdeflate 1.8-7 0m45.379s 206626451 50580708 (RN=zlib) Libdeflate 1.8-7 1m1.431s 210172035 *54126292 (RN=libdeflate, forced) Zlib only -7 0m48.531s 207189920 50580708 (RN=zlib) (Default level) Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (RN=libdeflate) Libdeflate 1.8-5 0m33.265s 208714328 51612215 (RN=zlib, as devel) Libdeflate 1.8-5 0m29.753s 213024792 *55922679 (RN=libdeflate, forced) Zlib only -5 0m40.353s 208499406 51612215 (RN=zlib) We can clearly see the problem(*) in using libdeflate for read-names in 1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib again. At level 9 it was using libdeflate for RN already, but we see improvements to both RN and elsewhere which are simply down to other changes in the library: Time Size RN Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 (RN=libdeflate) Libdeflate 1.8-9 2m6.304s 204292448 48541687 (RN=libdeflate) Zlib only -9 1m20.966s 206482425 49988310 (RN=zlib) Finally, the impact of switching level 9 from the old mapping (11) to new (12; "9+"), along with a more complete table for curiosities sake: Time Size RN Libdeflate 1.9+PR-9+ 2m54.664s 202315823 46783148 Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 Libdeflate 1.9+PR-8 1m39.040s 202934405 47247996 Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 Libdeflate 1.9+PR-6 0m31.521s 207437149 50768595 Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (default level) Libdeflate 1.9+PR-4 0m29.478s 210425588 52946850 Libdeflate 1.9+PR-1 0m27.460s 215975209 57142706 (no change) (Note: "1.9" here is actually master, which is a few commits on from the tag, but the main gist of it is the same.)
Previously for read names compression levels between 4 and 7 zlib considerably beat libdeflate due to libdeflate's poor selection of minimum match length. This was raised as issue ebiggers/libdeflate#85. It's now been resolved in libdeflate 1.9 (along with some general improvements elsewhere), and in all cases libdeflate is a better choice. Also fixed the mapping of levels 1..9 (standard zlib) to 1..12 (libdeflate). The maths in the comment was incorrect as it's an integer calculation not a floating point one. Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0: Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 (RN=libdeflate) Libdeflate 1.8-7 0m45.379s 206626451 50580708 (RN=zlib) Libdeflate 1.8-7 1m1.431s 210172035 *54126292 (RN=libdeflate, forced) Zlib only -7 0m48.531s 207189920 50580708 (RN=zlib) (Default level) Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (RN=libdeflate) Libdeflate 1.8-5 0m33.265s 208714328 51612215 (RN=zlib, as devel) Libdeflate 1.8-5 0m29.753s 213024792 *55922679 (RN=libdeflate, forced) Zlib only -5 0m40.353s 208499406 51612215 (RN=zlib) We can clearly see the problem(*) in using libdeflate for read-names in 1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib again. At level 9 it was using libdeflate for RN already, but we see improvements to both RN and elsewhere which are simply down to other changes in the library: Time Size RN Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 (RN=libdeflate) Libdeflate 1.8-9 2m6.304s 204292448 48541687 (RN=libdeflate) Zlib only -9 1m20.966s 206482425 49988310 (RN=zlib) Finally, the impact of switching level 9 from the old mapping (11) to new (12; "9+"), along with a more complete table for curiosities sake: Time Size RN Libdeflate 1.9+PR-9+ 2m54.664s 202315823 46783148 Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 Libdeflate 1.9+PR-8 1m39.040s 202934405 47247996 Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 Libdeflate 1.9+PR-6 0m31.521s 207437149 50768595 Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (default level) Libdeflate 1.9+PR-4 0m29.478s 210425588 52946850 Libdeflate 1.9+PR-1 0m27.460s 215975209 57142706 (no change) (Note: "1.9" here is actually master, which is a few commits on from the tag, but the main gist of it is the same.)
I'm currently working on a PHP binding for libdeflate.
I'm compressing the following file using zlib vs libdeflate: https://github.com/pmmp/BedrockData/blob/afc885cccae38048d309911f2c5ddcdcb6af8152/required_block_states.nbt
From levels 1 to 7, zlib and libdeflate achieve comparable compression ratios, and libdeflate is consistently taking 40-60% of the time that zlib takes.
However, levels 8 and 9 feature significantly worse performance, and the compression ratio increases, not just compared to zlib, but also compared to lower levels of libdeflate.
I wouldn't have thought much of it except for the fact that the performance runs off a cliff at those high levels, enough for me to wonder if there isn't some bug in the code running on those higher levels, and it seems peculiar that the ratio would increase.
The following Callgrind profile indicates that the bulk of the CPU time in my benchmark is being spent in
![image](https://user-images.githubusercontent.com/14214667/95663711-51e5fb00-0b39-11eb-99df-691fbf13015b.png)
deflate_compress_near_optimal
:OS is Ubuntu 20.04.
Attached are some compression numbers from a quick and dirty impl.
libdeflate-test.txt
Is this a bug or should I be expecting this kind of behaviour?
The text was updated successfully, but these errors were encountered: