-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect libdeflate >= 1.9 and adjust CRAM RN encoding. #1383
Conversation
For idle curiosity, here are some specific benchmarks of various Deflate implementations, plus lzma and bzip2 for comparison. Below, "zlib" is plain old gzip binary (so not quite zlib, but giving very similar numbers) and Z_FILT is actual zlib with Z_FILTERED strategy enabled. Then we see libdeflate (libdf) 1.8 and 1.9, 7zip's deflate (-tgzip -mx=level), and xz and bzip2. It's clear on the highly structured data that Zlib's Z_FILTERED is a huge leap over Z_DEFAULT_STRATEGY. What this does is to favour huffman over LZ steps by limiting the smallest LZ match. It's also what the optimisations to libdeflate are (automatically) doing. The normal english text (Wikipedia) shows the reverse - Z_FILTERED being detrimental, and no major difference between libdeflate versions, showing the auto-tuning is working well. (It's also this data where bzip2 and lzma really start pulling ahead of the Deflate.) 9827_2#49.bam, 10 x 10k read names
XA:Z tags (novaseq)
10MB of Wikipedia
Some curious observations here though. Bzip2 level 1 is actually smaller than level 9 (this is 100k vs 900k block size) for XA:Z tags. Also on this same data, both xz -1 and bzip2 -1 are faster than libdeflate's gzip -9, while also being smaller. So libdeflate works OK (albeit large) at standard levels, but arguably it's not worth using at high compression levels without also enabling xz/bzip2 to compare against. It also seems maybe having a lzma level 1 mode is worth while on faster CRAM profiles, as it's potentially good compression / time tradeoff (0.3s at -1, 1.5 at -5 and 1.7 at -9). Zstd and Brotli also can't match that speed/size on the XA:Z data. Although long term we need a way to tokenise these structure aux tags and just get better compression through being smart. |
cram/cram_io.c
Outdated
@@ -1213,6 +1213,7 @@ char *zlib_mem_inflate(char *cdata, size_t csize, size_t *size) { | |||
} | |||
#endif | |||
|
|||
#if !defined(HAVE_LIBDEFLATE) || !(LIBDEFLATE_VERSION_MAJOR > 1 || LIBDEFLATE_VERSION_MINOR > 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly, shouldn't the last bit of this be ( ... || (LIBDEFLATE_VERSION_MAJOR == 1 && LIBDEFLATE_VERSION_MINOR > 8))
? Although it may work as-is, as I don't think there was a 0.9 release. Better to be safe, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair comment, although wrong correction I think. I'll fix.
Previously for read names compression levels between 4 and 7 zlib considerably beat libdeflate due to libdeflate's poor selection of minimum match length. This was raised as issue ebiggers/libdeflate#85. It's now been resolved in libdeflate 1.9 (along with some general improvements elsewhere), and in all cases libdeflate is a better choice. Also fixed the mapping of levels 1..9 (standard zlib) to 1..12 (libdeflate). The maths in the comment was incorrect as it's an integer calculation not a floating point one. Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0: Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 (RN=libdeflate) Libdeflate 1.8-7 0m45.379s 206626451 50580708 (RN=zlib) Libdeflate 1.8-7 1m1.431s 210172035 *54126292 (RN=libdeflate, forced) Zlib only -7 0m48.531s 207189920 50580708 (RN=zlib) (Default level) Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (RN=libdeflate) Libdeflate 1.8-5 0m33.265s 208714328 51612215 (RN=zlib, as devel) Libdeflate 1.8-5 0m29.753s 213024792 *55922679 (RN=libdeflate, forced) Zlib only -5 0m40.353s 208499406 51612215 (RN=zlib) We can clearly see the problem(*) in using libdeflate for read-names in 1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib again. At level 9 it was using libdeflate for RN already, but we see improvements to both RN and elsewhere which are simply down to other changes in the library: Time Size RN Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 (RN=libdeflate) Libdeflate 1.8-9 2m6.304s 204292448 48541687 (RN=libdeflate) Zlib only -9 1m20.966s 206482425 49988310 (RN=zlib) Finally, the impact of switching level 9 from the old mapping (11) to new (12; "9+"), along with a more complete table for curiosities sake: Time Size RN Libdeflate 1.9+PR-9+ 2m54.664s 202315823 46783148 Libdeflate 1.9+PR-9 2m21.757s 202890458 47327597 Libdeflate 1.9+PR-8 1m39.040s 202934405 47247996 Libdeflate 1.9+PR-7 0m43.816s 204732408 48381374 Libdeflate 1.9+PR-6 0m31.521s 207437149 50768595 Libdeflate 1.9+PR-5 0m30.323s 207793059 51023626 (default level) Libdeflate 1.9+PR-4 0m29.478s 210425588 52946850 Libdeflate 1.9+PR-1 0m27.460s 215975209 57142706 (no change) (Note: "1.9" here is actually master, which is a few commits on from the tag, but the main gist of it is the same.)
4b70605
to
e2d1f25
Compare
Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.
Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate). The maths in the comment was incorrect as it's an
integer calculation not a floating point one.
Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:
(Default level)
We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.
At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:
Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:
(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)