Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect libdeflate >= 1.9 and adjust CRAM RN encoding. #1383

Merged
merged 1 commit into from
Feb 1, 2022

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jan 20, 2022

Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.

Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate). The maths in the comment was incorrect as it's an
integer calculation not a floating point one.

Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:

                     Time       Size        RN
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374 (RN=libdeflate)
Libdeflate 1.8-7     0m45.379s  206626451   50580708 (RN=zlib)
Libdeflate 1.8-7     1m1.431s	210172035  *54126292 (RN=libdeflate, forced)
Zlib only -7         0m48.531s  207189920   50580708 (RN=zlib)

(Default level)

                     Time       Size        RN
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (RN=libdeflate)
Libdeflate 1.8-5     0m33.265s  208714328   51612215 (RN=zlib, as devel)
Libdeflate 1.8-5     0m29.753s	213024792  *55922679 (RN=libdeflate, forced)
Zlib only -5         0m40.353s  208499406   51612215 (RN=zlib)

We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.

At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:

                     Time       Size        RN
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597 (RN=libdeflate)
Libdeflate 1.8-9     2m6.304s   204292448   48541687 (RN=libdeflate)
Zlib only -9         1m20.966s  206482425   49988310 (RN=zlib)

Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:

                     Time       Size        RN
Libdeflate 1.9+PR-9+ 2m54.664s  202315823   46783148
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597
Libdeflate 1.9+PR-8  1m39.040s  202934405   47247996
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374
Libdeflate 1.9+PR-6  0m31.521s  207437149   50768595
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (default level)
Libdeflate 1.9+PR-4  0m29.478s  210425588   52946850
Libdeflate 1.9+PR-1  0m27.460s  215975209   57142706 (no change)

(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)

@jkbonfield
Copy link
Contributor Author

For idle curiosity, here are some specific benchmarks of various Deflate implementations, plus lzma and bzip2 for comparison.

Below, "zlib" is plain old gzip binary (so not quite zlib, but giving very similar numbers) and Z_FILT is actual zlib with Z_FILTERED strategy enabled. Then we see libdeflate (libdf) 1.8 and 1.9, 7zip's deflate (-tgzip -mx=level), and xz and bzip2.

It's clear on the highly structured data that Zlib's Z_FILTERED is a huge leap over Z_DEFAULT_STRATEGY. What this does is to favour huffman over LZ steps by limiting the smallest LZ match. It's also what the optimisations to libdeflate are (automatically) doing.

The normal english text (Wikipedia) shows the reverse - Z_FILTERED being detrimental, and no major difference between libdeflate versions, showing the auto-tuning is working well. (It's also this data where bzip2 and lzma really start pulling ahead of the Deflate.)

9827_2#49.bam, 10 x 10k read names

        zlib   Z_FILT    libdf-1.8  libdf-1.9  7z(gz)     xz        bzip2
1       650346 650282    587516     549997     568562     526396   498639
2       637592 637357    581077     523136     568562     517600   490275
3       618339 618059    578561     523061     568562     513380   487290
4       563650 511955    560152     513929     568562     493608   483866
5       546331 503513    543610     494442     489402     434832   483866
6       542634 503452    539033     489260     489402     429644   483866
7       522038 481199    533182     483155     454894     429644   483866
8       516396 476045    543334     462102     454894     429644   483866
9       516396 476045    476809     461233     449958     429644   483866
10                       476074     444499
11                       451635     443796
12                       443525     442113

XA:Z tags (novaseq)

        zlib    Z_FILT    libdf-1.8  libdf-1.9  7z(gz)    xz        bzip2
1       1290177 1288543   1160106    1134752    1105805   917088   889448
2       1263868 1262245   1116145    1082768    1105805   917760   891053
3       1244700 1242553   1105258    1076247    1105805   917348   896766
4       1094990 1078961   1100050    1074293    1105805   870692   896822
5       1027036 1016811   1020026     992339     906550   740880   899782
6       1017441 1003017   1013265     981532     906550   740328   901856
7       1001234  982690   1006705     969078     893837   740328   903642
8        997529  974867    940742     941165     893837   740328   903193
9        997551  974817    897998     939582     893623   740328   903814
                           895733     891178                     
                           891221     889934                     
                           890062     889813                        

10MB of Wikipedia

        zlib    Z_FILT    libdf-1.8  libdf-1.9   7z(gz)     xz         bzip2
1       4271856 4275128   4005720    3958154     3835412    3341704    3352773
2       4115469 4119248   3889591    3865680     3835412    3214848    3191585
3       3994919 3998716   3843672    3821119     3835412    3145864    3115854
4       3838904 3972744   3820032    3805628     3835412    2833760    3053777
5       3727743 3889512   3725578    3722118     3561923    2737228    3016628
6       3693800 3845945   3707559    3702375     3561923    2723036    2982881
7       3688055 3838821   3698946    3692170     3553337    2721280    2957052
8       3685300 3834888   3600300    3666673     3553337    2721280    2932684
9       3685289 3834881   3561433    3665677     3552437    2721280    2916026
10                        3555186    3553378
11                        3550775    3550629
12                        3550558    3550809

Some curious observations here though. Bzip2 level 1 is actually smaller than level 9 (this is 100k vs 900k block size) for XA:Z tags. Also on this same data, both xz -1 and bzip2 -1 are faster than libdeflate's gzip -9, while also being smaller. So libdeflate works OK (albeit large) at standard levels, but arguably it's not worth using at high compression levels without also enabling xz/bzip2 to compare against.

It also seems maybe having a lzma level 1 mode is worth while on faster CRAM profiles, as it's potentially good compression / time tradeoff (0.3s at -1, 1.5 at -5 and 1.7 at -9). Zstd and Brotli also can't match that speed/size on the XA:Z data. Although long term we need a way to tokenise these structure aux tags and just get better compression through being smart.

@daviesrob daviesrob self-assigned this Jan 25, 2022
cram/cram_io.c Outdated
@@ -1213,6 +1213,7 @@ char *zlib_mem_inflate(char *cdata, size_t csize, size_t *size) {
}
#endif

#if !defined(HAVE_LIBDEFLATE) || !(LIBDEFLATE_VERSION_MAJOR > 1 || LIBDEFLATE_VERSION_MINOR > 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly, shouldn't the last bit of this be ( ... || (LIBDEFLATE_VERSION_MAJOR == 1 && LIBDEFLATE_VERSION_MINOR > 8))? Although it may work as-is, as I don't think there was a 0.9 release. Better to be safe, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair comment, although wrong correction I think. I'll fix.

Previously for read names compression levels between 4 and 7 zlib
considerably beat libdeflate due to libdeflate's poor selection of
minimum match length. This was raised as issue ebiggers/libdeflate#85.
It's now been resolved in libdeflate 1.9 (along with some general
improvements elsewhere), and in all cases libdeflate is a better
choice.

Also fixed the mapping of levels 1..9 (standard zlib) to 1..12
(libdeflate).  The maths in the comment was incorrect as it's an
integer calculation not a floating point one.

Figures from converting 1 million NovaSeq records from BAM to CRAM 3.0:

Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374 (RN=libdeflate)
Libdeflate 1.8-7     0m45.379s  206626451   50580708 (RN=zlib)
Libdeflate 1.8-7     1m1.431s	210172035  *54126292 (RN=libdeflate, forced)
Zlib only -7         0m48.531s  207189920   50580708 (RN=zlib)

(Default level)
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (RN=libdeflate)
Libdeflate 1.8-5     0m33.265s  208714328   51612215 (RN=zlib, as devel)
Libdeflate 1.8-5     0m29.753s	213024792  *55922679 (RN=libdeflate, forced)
Zlib only -5         0m40.353s  208499406   51612215 (RN=zlib)

We can clearly see the problem(*) in using libdeflate for read-names in
1.8, how it's fixed in 1.9, and how it is now smaller/faster than zlib
again.

At level 9 it was using libdeflate for RN already, but we see
improvements to both RN and elsewhere which are simply down to other
changes in the library:

                     Time       Size        RN
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597 (RN=libdeflate)
Libdeflate 1.8-9     2m6.304s   204292448   48541687 (RN=libdeflate)
Zlib only -9         1m20.966s  206482425   49988310 (RN=zlib)

Finally, the impact of switching level 9 from the old mapping (11) to
new (12; "9+"), along with a more complete table for curiosities sake:

                     Time       Size        RN
Libdeflate 1.9+PR-9+ 2m54.664s  202315823   46783148
Libdeflate 1.9+PR-9  2m21.757s  202890458   47327597
Libdeflate 1.9+PR-8  1m39.040s  202934405   47247996
Libdeflate 1.9+PR-7  0m43.816s  204732408   48381374
Libdeflate 1.9+PR-6  0m31.521s  207437149   50768595
Libdeflate 1.9+PR-5  0m30.323s  207793059   51023626 (default level)
Libdeflate 1.9+PR-4  0m29.478s  210425588   52946850
Libdeflate 1.9+PR-1  0m27.460s  215975209   57142706 (no change)

(Note: "1.9" here is actually master, which is a few commits on from
the tag, but the main gist of it is the same.)
@daviesrob daviesrob merged commit e2d1f25 into samtools:develop Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants