Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change gzip compression level from 9 to 6 in mergetrigs #3428

Merged

Conversation

josh-willis
Copy link
Contributor

We have recently had several workflows that spend an inordinate amount of time in the hdf_mergetrigs jobs on full data. In some cases they ran for 24 hours and got evicted. @titodalcanton noticed that these jobs were spending much more time writing some of the outputs than others, and that if he condor_ssh'd to the job, it was actually in the R state using 100% of the CPU, so it seemed that the time was spent in actual computation, and not file system issues. On further investigation of an example file, if we ignore the datasets which correspond to statistics not actually calculated in a given workflow (e.g., bank chi-squared), then the greater the compression factor achieved, the longer it took to write that dataset, strongly suggesting that for datasets which were compressible (but not trivially compressible, like bank chi-squared---those ran quickly) the time was being spent in the compression.

Some reading around on the web suggested that our current compression level of 9 in pycbc_coinc_mergetrigs is very aggressive. A more typical default is 6, and this is what you get if you call gzip from the command line and don't specify a compression level. I ran a test last night with that change, on the largest merged trigger file from several recent workflows I'd run. With a level of 6, its size on disk was 24 G, whereas with 9, it was 23 G. However the job completed in an order of magnitude less time.

This is marked as work in progress, because after some discussion with Tito, he's going to add a follow-up commit to actually create a command-line option to specify this level, and default it to six. But we want to write it up so @ahnitz has a chance to give us feedback.

@ahnitz
Copy link
Member

ahnitz commented Aug 20, 2020

@josh-willis I have no objections to this. I think I only chose 9 because at the time the computing cost wasn't significant, but I was worried about getting the file size as small as possible so just took it to the extreme.

Copy link
Contributor

@spxiwh spxiwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give this approval, but also happy for this to be exposed as a command-line argument for better control!

@titodalcanton
Copy link
Contributor

I ran a few tests with different compression levels, this can be merged.

@josh-willis josh-willis merged commit afa358b into gwastro:master Aug 20, 2020
@titodalcanton
Copy link
Contributor

Here is a quick evaluation of all compression levels, done by merging just a couple files. Probably not enough to see big differences. Reading from NFS, writing to /local. Timing is done just from a single run, so take it with a grain of salt.

image

@lpsinger
Copy link
Contributor

Have you tried zstd? You'll get good compression ratios much faster.

@ahnitz
Copy link
Member

ahnitz commented Aug 20, 2020

@lpsinger That's a good compressor, but not supported by default in hdf as far as I can tell. h5py/h5py#611.

I.E. requires rebuilding hdf5 from source. Is that included in most hdf packages?

lenona pushed a commit to lenona/pycbc that referenced this pull request Sep 14, 2020
* Change default gzip compression level from 9 to 6 in mergetrigs

* Expose compression as CLI option

Co-authored-by: Tito Dal Canton <dalcanton@lal.in2p3.fr>
OliverEdy pushed a commit to OliverEdy/pycbc that referenced this pull request Apr 3, 2023
* Change default gzip compression level from 9 to 6 in mergetrigs

* Expose compression as CLI option

Co-authored-by: Tito Dal Canton <dalcanton@lal.in2p3.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants