Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset.to_netcdf not compressing correctly(?) #9783

Closed
3 of 5 tasks
uriii3 opened this issue Nov 15, 2024 · 6 comments
Closed
3 of 5 tasks

dataset.to_netcdf not compressing correctly(?) #9783

uriii3 opened this issue Nov 15, 2024 · 6 comments

Comments

@uriii3
Copy link

uriii3 commented Nov 15, 2024

What happened?

I tried different compression levels and without compression sometimes it works better than with actual compression activated.

What did you expect to happen?

That with compression everything should be with a lower size.

(you can obtain the dataset complevel0.nc through a library called copernicusmarine: copernicusmarine subset --dataset-id cmems_mod_glo_phy_my_0.083deg_P1D-m -v thetao -t 1993-01-01T00:00:00 -T 2020-12-31T23:59:59 -x -90 -X -85 -y -35 -Y -30 -z 0.49 -Z 1 -f complevel0.nc, the zip file is to big to add it here).

Minimal Complete Verifiable Example

import xarray

dataset = xarray.open_dataset("./complevel0.nc")
netcdf_compression_level = 1
netcdf3_compatible = False
if netcdf_compression_level > 0:
    print(f"NetCDF compression enabled with level {netcdf_compression_level}")
    comp = dict(
        zlib=True,
        complevel=netcdf_compression_level,
        contiguous=False,
        shuffle=True,
    )
    encoding = {var: comp for var in dataset.data_vars}
else:
    encoding = None

xarray_download_format = None
engine = "h5netcdf" if not netcdf3_compatible else "netcdf4"
output_path = f"./my_dataset_with_compression_{netcdf_compression_level}.nc"
dataset.to_netcdf(
    output_path,
    mode="w",
    encoding=encoding,
    format=xarray_download_format,
    engine=engine,
)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

The sizes of the datasets are:

-rw-r--r--@  1 oricart  staff   57141803 Nov 15 13:35 my_dataset_with_compression_0.nc
-rw-r--r--@  1 oricart  staff  105577838 Nov 15 13:34 my_dataset_with_compression_1.nc
-rw-r--r--@  1 oricart  staff  104547392 Nov 15 13:35 my_dataset_with_compression_4.nc
-rw-r--r--@  1 oricart  staff  104132169 Nov 15 13:35 my_dataset_with_compression_9.nc

With the one without compression being half the size of the ones compressed.



### Anything else we need to know?

_No response_

### Environment

<details>

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.18 (main, Apr 10 2024, 12:33:50) 
[Clang 15.0.0 (clang-1500.3.9.4)]
python-bits: 64
OS: Darwin
OS-release: 23.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2024.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.4.0
h5py: 3.12.1
Nio: None
zarr: 2.17.2
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.4.1
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.3.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.2.0
pip: 24.3.1
conda: None
pytest: 8.1.1
mypy: None
IPython: 8.18.1
sphinx: 7.4.7
None

</details>
@uriii3 uriii3 added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 15, 2024
@kmuehlbauer
Copy link
Contributor

@uriii3

Please add the output of ncdump -h filename I suspect, that your data is in an much more compressed format already (eg. via scale_factor/add_offset) which can't be compensated by zlib floating point compression.

This might work (Untested!) by keeping current encoding and adding zlib on top.

encoding = {name: {**var.encoding, **comp} for name, var in dataset.data_vars.items()}

@kmuehlbauer kmuehlbauer added usage question and removed bug needs triage Issue that has not been reviewed by xarray team member labels Nov 15, 2024
@uriii3
Copy link
Author

uriii3 commented Nov 15, 2024

With the crump I checked that yes, it contains both scale_factor and add_offset in the original dataset. (not uploading them because I assume it is enough).

About your proposal, seems to work but I got an error when moving the encoding, looks like some of the **var.encoding attributes are not available in the h5netcdf engine.
The ecoding you created:

{'thetao': {'dtype': dtype('int16'), 'zlib': True, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': True, 'complevel': 4, 'fletcher32': False, 'contiguous': False, 'chunksizes': None, 'original_shape': (7671, 1, 61, 61), '_FillValue': -32767, 'scale_factor': 0.0007324442267417908, 'add_offset': 21.0}}

The encoding I was creating:

{'thetao': {'zlib': True, 'complevel': 4, 'contiguous': False, 'shuffle': True}}

But now I will need to manage to make some of the new attributes available: ValueError: unexpected encoding parameters for 'h5netcdf' backend: ['szip', 'zstd', 'bzip2', 'blosc']. Although it is a bit weird that the encoding can not be maintained, no?

Thank you so much for the quick reply, very much appreciated!

@uriii3
Copy link
Author

uriii3 commented Nov 15, 2024

I finally added a little hardcode to make the code work, I hope it doesn't bother anyone (I pop the dict keys myself) right after creating the encoding:

for var in dataset.data_vars:
        for key in ["szip", "zstd", "bzip2", "blosc"]:
            encoding[var].pop(key, None)

It seems to work, thank you very much again and hope that you have a nice weekend 🤗

@kmuehlbauer
Copy link
Contributor

@uriii3 Glad it helped and you figured out a working solution.

@uriii3
Copy link
Author

uriii3 commented Nov 15, 2024

If someone else finds this issue (not really an issue but a usage question), keep in mind that in the end the only keys that will work (or seem to coincide) are the ones in the variable. So better to do a dict that has this variables than to do a dict that exclude some that doesn't work (you might skip some).

This code might be more robust:

for var in dataset.data_vars:
            keys_to_keep = [
                "dtype",
                "blosc_shuffle",
                "shuffle",
                "chunksizes",
                "szip_coding",
                "zlib",
                "_FillValue",
                "significant_digits",
                "fletcher32",
                "contiguous",
                "szip_pixels_per_block",
                "complevel",
                "compression_opts",
                "compression",
                "quantize_mode",
                "endian",
            ]
            encoding = {
                key: long_encoding[var][key]
                for key in long_encoding[var].items()
                if key in keys_to_keep
            }

@uriii3
Copy link
Author

uriii3 commented Dec 5, 2024

Okey, last checking on my side (and maybe just nobody will encounter this, but):

  • looks like the keys_to_keep that I was using is not either that consistent (some sort of definition of dtype can be incorrect on my data and it then compresses with some loss).
  • What I finally did was to just have scale_factorand add_offset in that keys_to_keep list, which now seems to work... although it might lose some compression power (not exactly knowing the dtype or the _FillValue I guess, but at least I don't lose information in the data at any point.

Hope it helps someone if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants