-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgzip performance drop from v1.16 #1767
Comments
Did you build all of them with libdeflate? I can't see much difference here:
That said, it's not gaining much beyond 32 threads (this is a 64-core EPYC). |
Thank you for the reply. I can confirm the peak performance using 32 threads on our EPYC, too. In the meantime I was investigating this issue further. I compiled htslib with and without libdeflate, turned off NUMA and SMT, set cpu frequencies constant and tested from NVMe SSD and /dev/shm. Unfortunately under all test cases I made the the same observation as stated above when using |
I haven't tried rapidgzip yet, but you could try adding mbuffer in there too (ie uncompress | mbuffer | compress) as that will provide a bigger buffer and help decouple the threading. Otherwise you're bound by the small pipe buffer size. |
I can confirm I see a difference, but only when reading from a pipe. Reading direct from a file it's the same speed. I'm not sure yet what would have affected that. Maybe some additional error checking somewhere. I'd need to do some bisecting to figure it out. |
So it's down to 4957188734d8d977f3b494983e28156af9a258e which added autodetection of text data and ensured bgzip blocks ended on newlines for text based formats. This improves random access capabilities for certain file types. However even with Adding an explicit I am a bit surprised to see such a performance difference with hfile though. Commit 2ff03d3
Commit e495718
Commit e495718 with
Obviously some variability, but basically rescues the impact. (Adding |
Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution. Fixes samtools#1767
Thank you for figuring this out! Does your pull request, which increases the blocksize to 1MB, affect the binary/legacy mode only? |
No, it modifies the hfile buffer regardless of binary / text mode. The speed difference between the two isn't that significant compared to the overheads of the buffer. Further diagnosis shows the problem is due to pipes. Hfile does a Htslib maybe should just have a minimum buffer size of something bigger so it doesn't reduce the size to something so tiny. 4Kb block sizes are a real anachronism and harking back to bygone eras of early unixen. The other side of this is that even when the filesystem recommends a 4Mb buffer size (as lustre does), there is a hard cutoff of 32kb in htslib. This limit was added over concerns of high memory usage when doing many-way file merges with lots of file handles open. I have some sympathy for this, however having the maximum default detected block size being smaller than a typical single bgzf block seems like an inefficiency as we just end up doing multiple reads anyway so we can uncompress an entire bgzf block. Our inflight I/O based memory therefore is probably around 400kb for bam/vcf.gz, making the hfile layer largely an irrelevance. I'm tempted to say minimum 64Kb maximum 1Mb, with the option of being able to selectively increase as desired (eg bgzip). Maybe @jmarshall has views on this as most of this code originated in his era, perhaps mostly from himself. Edit: oops I'm wrong on my maths. 64k is the uncompressed size of a bgzf block, not compressed, so it's not as stark as it looks for current memory usage. Nevertheless, I think honouring the system stat return of 4kb for a pipe seems way too small. |
Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution. Fixes samtools#1767
Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. It's a little messy as we have to use hfile_internal.h to get hfile_set_blksize, but it works. I'm not sure why we didn't elect to make that API more public. Probably simply out of caution. Fixes samtools#1767
Commit e495718 changed bgzip from unix raw POSIX read() calls to hread(). Unfortunately hread gets its buffer size from stat of the input file descriptor, which can be 4kb for a pipe. We're reading 0xff00 bytes, so this ends up being split over two reads mostly, with one or both involving additional memcpys. This makes the buffered I/O worse performing than non-buffered. In the most extreme cases (cat data | bgzip -l0 > /dev/null) this is a two fold slow down. The easy solution is just to increase the buffer size to something sensible. Currently we play it cautiously and only do this on pipes and fifos. Fixes samtools#1767
Hi,
I stumbled across an issue when testing multiple versions of htslib/bgzip. With release of a blasting fast decompression algorithm in 2023 https://github.com/mxmlnkn/rapidgzip, which achives a throughput of 1.5GB/s on my machine, I observed a massive performance difference in data compression for bgzip versions <= 1.16 and >= 1.17, not attributable to the change of default compression level (from v1.16) or cold cache effects.
v1.13 - v1.15
v1.16 (with adapted compression level)
v1.17 - v1.19 (with adapted compression level)
The text was updated successfully, but these errors were encountered: