-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change gzip compression level from 9 to 6 in mergetrigs #3428
Change gzip compression level from 9 to 6 in mergetrigs #3428
Conversation
@josh-willis I have no objections to this. I think I only chose 9 because at the time the computing cost wasn't significant, but I was worried about getting the file size as small as possible so just took it to the extreme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give this approval, but also happy for this to be exposed as a command-line argument for better control!
I ran a few tests with different compression levels, this can be merged. |
Have you tried zstd? You'll get good compression ratios much faster. |
@lpsinger That's a good compressor, but not supported by default in hdf as far as I can tell. h5py/h5py#611. I.E. requires rebuilding hdf5 from source. Is that included in most hdf packages? |
* Change default gzip compression level from 9 to 6 in mergetrigs * Expose compression as CLI option Co-authored-by: Tito Dal Canton <dalcanton@lal.in2p3.fr>
* Change default gzip compression level from 9 to 6 in mergetrigs * Expose compression as CLI option Co-authored-by: Tito Dal Canton <dalcanton@lal.in2p3.fr>
We have recently had several workflows that spend an inordinate amount of time in the
hdf_mergetrigs
jobs on full data. In some cases they ran for 24 hours and got evicted. @titodalcanton noticed that these jobs were spending much more time writing some of the outputs than others, and that if hecondor_ssh'd
to the job, it was actually in theR
state using 100% of the CPU, so it seemed that the time was spent in actual computation, and not file system issues. On further investigation of an example file, if we ignore the datasets which correspond to statistics not actually calculated in a given workflow (e.g., bank chi-squared), then the greater the compression factor achieved, the longer it took to write that dataset, strongly suggesting that for datasets which were compressible (but not trivially compressible, like bank chi-squared---those ran quickly) the time was being spent in the compression.Some reading around on the web suggested that our current compression level of
9
inpycbc_coinc_mergetrigs
is very aggressive. A more typical default is6
, and this is what you get if you callgzip
from the command line and don't specify a compression level. I ran a test last night with that change, on the largest merged trigger file from several recent workflows I'd run. With a level of 6, its size on disk was 24 G, whereas with 9, it was 23 G. However the job completed in an order of magnitude less time.This is marked as work in progress, because after some discussion with Tito, he's going to add a follow-up commit to actually create a command-line option to specify this level, and default it to six. But we want to write it up so @ahnitz has a chance to give us feedback.