dataset.to_netcdf not compressing correctly(?) #9783

uriii3 · 2024-11-15T12:39:44Z

What happened?

I tried different compression levels and without compression sometimes it works better than with actual compression activated.

What did you expect to happen?

That with compression everything should be with a lower size.

(you can obtain the dataset complevel0.nc through a library called copernicusmarine: copernicusmarine subset --dataset-id cmems_mod_glo_phy_my_0.083deg_P1D-m -v thetao -t 1993-01-01T00:00:00 -T 2020-12-31T23:59:59 -x -90 -X -85 -y -35 -Y -30 -z 0.49 -Z 1 -f complevel0.nc, the zip file is to big to add it here).

Minimal Complete Verifiable Example

import xarray

dataset = xarray.open_dataset("./complevel0.nc")
netcdf_compression_level = 1
netcdf3_compatible = False
if netcdf_compression_level > 0:
    print(f"NetCDF compression enabled with level {netcdf_compression_level}")
    comp = dict(
        zlib=True,
        complevel=netcdf_compression_level,
        contiguous=False,
        shuffle=True,
    )
    encoding = {var: comp for var in dataset.data_vars}
else:
    encoding = None

xarray_download_format = None
engine = "h5netcdf" if not netcdf3_compatible else "netcdf4"
output_path = f"./my_dataset_with_compression_{netcdf_compression_level}.nc"
dataset.to_netcdf(
    output_path,
    mode="w",
    encoding=encoding,
    format=xarray_download_format,
    engine=engine,
)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

The sizes of the datasets are:

-rw-r--r--@  1 oricart  staff   57141803 Nov 15 13:35 my_dataset_with_compression_0.nc
-rw-r--r--@  1 oricart  staff  105577838 Nov 15 13:34 my_dataset_with_compression_1.nc
-rw-r--r--@  1 oricart  staff  104547392 Nov 15 13:35 my_dataset_with_compression_4.nc
-rw-r--r--@  1 oricart  staff  104132169 Nov 15 13:35 my_dataset_with_compression_9.nc

With the one without compression being half the size of the ones compressed.



### Anything else we need to know?

_No response_

### Environment

<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.18 (main, Apr 10 2024, 12:33:50) 
[Clang 15.0.0 (clang-1500.3.9.4)]
python-bits: 64
OS: Darwin
OS-release: 23.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2024.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.4.0
h5py: 3.12.1
Nio: None
zarr: 2.17.2
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.4.1
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.3.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.2.0
pip: 24.3.1
conda: None
pytest: 8.1.1
mypy: None
IPython: 8.18.1
sphinx: 7.4.7
None

</details>

The text was updated successfully, but these errors were encountered:

kmuehlbauer · 2024-11-15T13:00:11Z

@uriii3

Please add the output of ncdump -h filename I suspect, that your data is in an much more compressed format already (eg. via scale_factor/add_offset) which can't be compensated by zlib floating point compression.

This might work (Untested!) by keeping current encoding and adding zlib on top.

encoding = {name: {**var.encoding, **comp} for name, var in dataset.data_vars.items()}

uriii3 · 2024-11-15T15:17:47Z

With the crump I checked that yes, it contains both scale_factor and add_offset in the original dataset. (not uploading them because I assume it is enough).

About your proposal, seems to work but I got an error when moving the encoding, looks like some of the **var.encoding attributes are not available in the h5netcdf engine.
The ecoding you created:

{'thetao': {'dtype': dtype('int16'), 'zlib': True, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': True, 'complevel': 4, 'fletcher32': False, 'contiguous': False, 'chunksizes': None, 'original_shape': (7671, 1, 61, 61), '_FillValue': -32767, 'scale_factor': 0.0007324442267417908, 'add_offset': 21.0}}

The encoding I was creating:

{'thetao': {'zlib': True, 'complevel': 4, 'contiguous': False, 'shuffle': True}}

But now I will need to manage to make some of the new attributes available: ValueError: unexpected encoding parameters for 'h5netcdf' backend: ['szip', 'zstd', 'bzip2', 'blosc']. Although it is a bit weird that the encoding can not be maintained, no?

Thank you so much for the quick reply, very much appreciated!

uriii3 · 2024-11-15T15:36:23Z

I finally added a little hardcode to make the code work, I hope it doesn't bother anyone (I pop the dict keys myself) right after creating the encoding:

for var in dataset.data_vars:
        for key in ["szip", "zstd", "bzip2", "blosc"]:
            encoding[var].pop(key, None)

It seems to work, thank you very much again and hope that you have a nice weekend 🤗

kmuehlbauer · 2024-11-15T15:53:22Z

@uriii3 Glad it helped and you figured out a working solution.

uriii3 · 2024-11-15T17:01:41Z

If someone else finds this issue (not really an issue but a usage question), keep in mind that in the end the only keys that will work (or seem to coincide) are the ones in the variable. So better to do a dict that has this variables than to do a dict that exclude some that doesn't work (you might skip some).

This code might be more robust:

for var in dataset.data_vars:
            keys_to_keep = [
                "dtype",
                "blosc_shuffle",
                "shuffle",
                "chunksizes",
                "szip_coding",
                "zlib",
                "_FillValue",
                "significant_digits",
                "fletcher32",
                "contiguous",
                "szip_pixels_per_block",
                "complevel",
                "compression_opts",
                "compression",
                "quantize_mode",
                "endian",
            ]
            encoding = {
                key: long_encoding[var][key]
                for key in long_encoding[var].items()
                if key in keys_to_keep
            }

uriii3 · 2024-12-05T11:25:44Z

Okey, last checking on my side (and maybe just nobody will encounter this, but):

looks like the keys_to_keep that I was using is not either that consistent (some sort of definition of dtype can be incorrect on my data and it then compresses with some loss).
What I finally did was to just have scale_factorand add_offset in that keys_to_keep list, which now seems to work... although it might lose some compression power (not exactly knowing the dtype or the _FillValue I guess, but at least I don't lose information in the data at any point.

Hope it helps someone if needed!

uriii3 added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 15, 2024

kmuehlbauer added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Nov 15, 2024

uriii3 closed this as completed Nov 15, 2024

uriii3 mentioned this issue Nov 15, 2024

fix: netcdf compression solved mercator-ocean/copernicus-marine-toolbox#215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.to_netcdf not compressing correctly(?) #9783

dataset.to_netcdf not compressing correctly(?) #9783

uriii3 commented Nov 15, 2024

kmuehlbauer commented Nov 15, 2024

uriii3 commented Nov 15, 2024 •

edited

Loading

uriii3 commented Nov 15, 2024

kmuehlbauer commented Nov 15, 2024

uriii3 commented Nov 15, 2024 •

edited

Loading

uriii3 commented Dec 5, 2024

dataset.to_netcdf not compressing correctly(?) #9783

dataset.to_netcdf not compressing correctly(?) #9783

Comments

uriii3 commented Nov 15, 2024

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

kmuehlbauer commented Nov 15, 2024

uriii3 commented Nov 15, 2024 • edited Loading

uriii3 commented Nov 15, 2024

kmuehlbauer commented Nov 15, 2024

uriii3 commented Nov 15, 2024 • edited Loading

uriii3 commented Dec 5, 2024

uriii3 commented Nov 15, 2024 •

edited

Loading

uriii3 commented Nov 15, 2024 •

edited

Loading