Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deflate_compress: Significant performance drop + increase in ratio compared to zlib on levels 8 and 9 with some data #85

Closed
dktapps opened this issue Oct 10, 2020 · 3 comments

Comments

@dktapps
Copy link
Contributor

dktapps commented Oct 10, 2020

I'm currently working on a PHP binding for libdeflate.

I'm compressing the following file using zlib vs libdeflate: https://github.com/pmmp/BedrockData/blob/afc885cccae38048d309911f2c5ddcdcb6af8152/required_block_states.nbt

From levels 1 to 7, zlib and libdeflate achieve comparable compression ratios, and libdeflate is consistently taking 40-60% of the time that zlib takes.

However, levels 8 and 9 feature significantly worse performance, and the compression ratio increases, not just compared to zlib, but also compared to lower levels of libdeflate.

I wouldn't have thought much of it except for the fact that the performance runs off a cliff at those high levels, enough for me to wonder if there isn't some bug in the code running on those higher levels, and it seems peculiar that the ratio would increase.

The following Callgrind profile indicates that the bulk of the CPU time in my benchmark is being spent in deflate_compress_near_optimal:
image

OS is Ubuntu 20.04.

Attached are some compression numbers from a quick and dirty impl.
libdeflate-test.txt

Is this a bug or should I be expecting this kind of behaviour?

@ebiggers
Copy link
Owner

The algorithm changes at level 8 and above. So that's why there can be a discontinuity between level 7 and 8. The algorithm at levels 8 and above is slower than the algorithm at levels 7 and below. However, on most inputs it produces a significantly better compression ratio. Also, it's usually still slightly faster than zlib.

Compression is all about heuristics, so it will always be possible to find some inputs where the algorithms don't work as well as expected. So I'm not sure it's fair to call it a "bug". But when I have a chance, I'll try to add some more heuristics to improve on this input as well as the input given in #57 on which libdeflate also does poorly -- probably something along the lines of dynamically selecting the algorithm and minimum match length based on a quick analysis of the input data. The file you gave does well with the "lazy" algorithm and a minimum match length of 4 bytes.

@dktapps
Copy link
Contributor Author

dktapps commented Oct 22, 2020

Sorry about that, misclicked on my phone ..

ebiggers added a commit that referenced this issue Dec 31, 2021
Instead of switching directly from the lazy compressor at level 7 to the
near-optimal compressor at level 8, use the lazy2 compressor at levels
8-9 and don't switch to near-optimal until level 10.

This avoids poor compression ratio and bad performance (both
significantly worse than level 7, and significantly worse than zlib) at
levels 8-9 on data where the near-optimal compressor doesn't do well
until the parameters are cranked up.

On data where the near-optimal compressor *does* do well, this change
worsens the compression ratio of levels 8-9, but also speeds them up a
lot, thus positioning them similarly vs. zlib as the lower levels (i.e.
much faster and slightly stronger, rather than slightly faster and much
stronger).  The difference between levels 9 and 10 is increased, but
that's perhaps the least bad place to have a discontinuity.

Resolves #85
ebiggers added a commit that referenced this issue Dec 31, 2021
Instead of switching directly from the lazy compressor at level 7 to the
near-optimal compressor at level 8, use the lazy2 compressor at levels
8-9 and don't switch to near-optimal until level 10.

This avoids poor compression ratio and bad performance (both
significantly worse than level 7, and significantly worse than zlib) at
levels 8-9 on data where the near-optimal compressor doesn't do well
until the parameters are cranked up.

On data where the near-optimal compressor *does* do well, this change
worsens the compression ratio of levels 8-9, but also speeds them up a
lot, thus positioning them similarly vs. zlib as the lower levels (i.e.
much faster and slightly stronger, rather than slightly faster and much
stronger).  The difference between levels 9 and 10 is increased, but
that's perhaps the least bad place to have a discontinuity.

Resolves #85
ebiggers added a commit that referenced this issue Dec 31, 2021
Instead of switching directly from the lazy compressor at level 7 to the
near-optimal compressor at level 8, use the lazy2 compressor at levels
8-9 and don't switch to near-optimal until level 10.

This avoids poor compression ratio and bad performance (both
significantly worse than level 7, and significantly worse than zlib) at
levels 8-9 on data where the near-optimal compressor doesn't do well
until the parameters are cranked up.

On data where the near-optimal compressor *does* do well, this change
worsens the compression ratio of levels 8-9, but also speeds them up a
lot, thus positioning them similarly vs. zlib as the lower levels (i.e.
much faster and slightly stronger, rather than slightly faster and much
stronger).  The difference between levels 9 and 10 is increased, but
that's perhaps the least bad place to have a discontinuity.

Resolves #85
@ebiggers
Copy link
Owner

ebiggers commented Jan 1, 2022

f7d3a70 fixed this by changing the algorithm used at levels 8 and 9. Other commits made other improvements to the compression algorithms as well.

Here are the results of DEFLATE compression on required_block_states.nbt. Table entries contain compressed size in bytes / compression time in milliseconds; "new" is commit 3dca7de and "old" is commit 047aa84:

Level libdeflate (new) libdeflate (old) zlib
1 118528 / 3 119349 / 3 118729 / 3
2 106640 / 3 107242 / 3 110383 / 4
3 99915 / 3 100406 / 3 102669 / 4
4 92241 / 4 93719 / 4 97863 / 6
5 91444 / 4 90841 / 4 90502 / 7
6 81281 / 5 80877 / 5 74639 / 10
7 74085 / 8 74058 / 8 73230 / 13
8 71513 / 17 91419 / 36 71121 / 21
9 70995 / 22 88240 / 46 71129 / 23
10 72525 / 92 77829 / 72 N/A
11 69871 / 184 72562 / 115 N/A
12 68442 / 397 70696 / 190 N/A

Note: level 10 is the odd one out now (worse than level 9), but not by very much. It's hard to handle this file perfectly, as it's a bit unusual.

Edit: changed table to be easier to read

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 20, 2022
Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.

Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate).  The maths in the comment was incorrect as it's an
integer calculation not a floating point one.

Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:

Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374 (RN=libdeflate)
Libdeflate 1.8-7     0m45.379s  206626451   50580708 (RN=zlib)
Libdeflate 1.8-7     1m1.431s	210172035  *54126292 (RN=libdeflate, forced)
Zlib only -7         0m48.531s  207189920   50580708 (RN=zlib)

(Default level)
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (RN=libdeflate)
Libdeflate 1.8-5     0m33.265s  208714328   51612215 (RN=zlib, as devel)
Libdeflate 1.8-5     0m29.753s	213024792  *55922679 (RN=libdeflate, forced)
Zlib only -5         0m40.353s  208499406   51612215 (RN=zlib)

We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.

At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:

                     Time       Size        RN
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597 (RN=libdeflate)
Libdeflate 1.8-9     2m6.304s   204292448   48541687 (RN=libdeflate)
Zlib only -9         1m20.966s  206482425   49988310 (RN=zlib)

Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:

                     Time       Size        RN
Libdeflate 1.9+PR-9+ 2m54.664s  202315823   46783148
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597
Libdeflate 1.9+PR-8  1m39.040s  202934405   47247996
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374
Libdeflate 1.9+PR-6  0m31.521s  207437149   50768595
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (default level)
Libdeflate 1.9+PR-4  0m29.478s  210425588   52946850
Libdeflate 1.9+PR-1  0m27.460s  215975209   57142706 (no change)

(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 20, 2022
Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.

Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate).  The maths in the comment was incorrect as it's an
integer calculation not a floating point one.

Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:

Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374 (RN=libdeflate)
Libdeflate 1.8-7     0m45.379s  206626451   50580708 (RN=zlib)
Libdeflate 1.8-7     1m1.431s	210172035  *54126292 (RN=libdeflate, forced)
Zlib only -7         0m48.531s  207189920   50580708 (RN=zlib)

(Default level)
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (RN=libdeflate)
Libdeflate 1.8-5     0m33.265s  208714328   51612215 (RN=zlib, as devel)
Libdeflate 1.8-5     0m29.753s	213024792  *55922679 (RN=libdeflate, forced)
Zlib only -5         0m40.353s  208499406   51612215 (RN=zlib)

We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.

At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:

                     Time       Size        RN
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597 (RN=libdeflate)
Libdeflate 1.8-9     2m6.304s   204292448   48541687 (RN=libdeflate)
Zlib only -9         1m20.966s  206482425   49988310 (RN=zlib)

Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:

                     Time       Size        RN
Libdeflate 1.9+PR-9+ 2m54.664s  202315823   46783148
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597
Libdeflate 1.9+PR-8  1m39.040s  202934405   47247996
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374
Libdeflate 1.9+PR-6  0m31.521s  207437149   50768595
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (default level)
Libdeflate 1.9+PR-4  0m29.478s  210425588   52946850
Libdeflate 1.9+PR-1  0m27.460s  215975209   57142706 (no change)

(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 20, 2022
Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.

Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate).  The maths in the comment was incorrect as it's an
integer calculation not a floating point one.

Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:

Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374 (RN=libdeflate)
Libdeflate 1.8-7     0m45.379s  206626451   50580708 (RN=zlib)
Libdeflate 1.8-7     1m1.431s	210172035  *54126292 (RN=libdeflate, forced)
Zlib only -7         0m48.531s  207189920   50580708 (RN=zlib)

(Default level)
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (RN=libdeflate)
Libdeflate 1.8-5     0m33.265s  208714328   51612215 (RN=zlib, as devel)
Libdeflate 1.8-5     0m29.753s	213024792  *55922679 (RN=libdeflate, forced)
Zlib only -5         0m40.353s  208499406   51612215 (RN=zlib)

We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.

At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:

                     Time       Size        RN
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597 (RN=libdeflate)
Libdeflate 1.8-9     2m6.304s   204292448   48541687 (RN=libdeflate)
Zlib only -9         1m20.966s  206482425   49988310 (RN=zlib)

Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:

                     Time       Size        RN
Libdeflate 1.9+PR-9+ 2m54.664s  202315823   46783148
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597
Libdeflate 1.9+PR-8  1m39.040s  202934405   47247996
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374
Libdeflate 1.9+PR-6  0m31.521s  207437149   50768595
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (default level)
Libdeflate 1.9+PR-4  0m29.478s  210425588   52946850
Libdeflate 1.9+PR-1  0m27.460s  215975209   57142706 (no change)

(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Feb 1, 2022
Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.

Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate).  The maths in the comment was incorrect as it's an
integer calculation not a floating point one.

Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:

Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374 (RN=libdeflate)
Libdeflate 1.8-7     0m45.379s  206626451   50580708 (RN=zlib)
Libdeflate 1.8-7     1m1.431s	210172035  *54126292 (RN=libdeflate, forced)
Zlib only -7         0m48.531s  207189920   50580708 (RN=zlib)

(Default level)
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (RN=libdeflate)
Libdeflate 1.8-5     0m33.265s  208714328   51612215 (RN=zlib, as devel)
Libdeflate 1.8-5     0m29.753s	213024792  *55922679 (RN=libdeflate, forced)
Zlib only -5         0m40.353s  208499406   51612215 (RN=zlib)

We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.

At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:

                     Time       Size        RN
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597 (RN=libdeflate)
Libdeflate 1.8-9     2m6.304s   204292448   48541687 (RN=libdeflate)
Zlib only -9         1m20.966s  206482425   49988310 (RN=zlib)

Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:

                     Time       Size        RN
Libdeflate 1.9+PR-9+ 2m54.664s  202315823   46783148
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597
Libdeflate 1.9+PR-8  1m39.040s  202934405   47247996
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374
Libdeflate 1.9+PR-6  0m31.521s  207437149   50768595
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (default level)
Libdeflate 1.9+PR-4  0m29.478s  210425588   52946850
Libdeflate 1.9+PR-1  0m27.460s  215975209   57142706 (no change)

(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants