Correctly using the `encoding` argument when compressing #9868

uriii3 · 2024-12-09T16:57:29Z

uriii3
Dec 9, 2024

I'm trying to compress datasets in an automated way, using a code that, reduced a lot, looks like this:

import xarray

dataset = xarray.open_dataset("./dataset.nc")

# This is somehow the main body that is automated and that I want to make it work
comp = dict(
        zlib=True,
        complevel=4,
        contiguous=False,
        shuffle=True,
    )
keys_to_keep = {
    "scale_factor",
    "add_offset",
}
encoding = {
    name: {
        **{
            key: value
            for key, value in var.encoding.items()
            if key in keys_to_keep
        },
        **comp,
    }
    for name, var in dataset.data_vars.items()
}
encoding = {var: comp for var in dataset.data_vars}

dataset.to_netcdf(
    mode="w",
    encoding=encoding,
)

Before, I didn't have the keys_to_keep and I just passed on the compression options. But then, it looked like in some cases (specific datasets that had the add_offset and scale_factor) the compression wasn't working (check Issue #9783 ).

I created a middle keys_to_keep with a different set (the one that pops up if you are sending wrong 'encodings' actually to the function):

"dtype", "blosc_shuffle", "shuffle", "chunksizes", "szip_coding", "zlib", "_FillValue", "significant_digits", "fletcher32", "contiguous", "szip_pixels_per_block", "complevel", "compression_opts", "compression", "quantize_mode", "endian"

There was this case, though, were the var.encoding.items() just had the dtype and the _FillValue and the whole output dataset was wrong.

With the code that I have now, keeping the add_offset and the scale_factor seems like all the datasets are good (which also is weird because they didn't appear in the ERROR I got when trying to pass the encoding) but I'm losing a little bit of compression power.

Question: Is there a standard to know which attributes to pass through the encoding? I've seen that this was somehow a big Issue back in the day but I couldn't find anything in the actual documentation to address it now: #1614 .

Question: am I supposed to keep the var.encoding.items() dictionary or a specific set of values (like add_offset, scale_factor and maybe some others?)?

Thank you very much in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly using the `encoding` argument when compressing #9868

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Correctly using the encoding argument when compressing #9868

uriii3 Dec 9, 2024

Replies: 0 comments

Correctly using the `encoding` argument when compressing #9868

uriii3
Dec 9, 2024