Load multiple netCDF files, concatenate and write back as one dataset #5380

muaali · 2021-05-26T13:03:02Z

muaali
May 26, 2021

Hi,

This is related to thread #5367

I have bunch of .nc files inside my local directory. I can read them easily using:

nc_fpaths = ["file0.nc", "file1.nc", ...., "filen.nc"]
all_ds = xr.open_mfdataset(nc_fpaths, coords='minimal', concat_dim="new_dim", combine='nested', chunks={"x": 4096, "y": 4096}, engine='netcdf4')

This loads the data perfectly. All of .nc files stacked along new_dim. Now I want to write this stacked dataset back to local directory using dask chunking to avoid memory issues, something like this:

all_ds.to_netcdf("huge_dataset.nc"))

However my memory size increases incrementally and I get Memory Error. Any idea why is dask not doing its magic here ?

INSTALLED VERSIONS

commit: None
python: 3.7.10 (default, Feb 26 2021, 18:47:35)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-64-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: 3.2.1
Nio: None
zarr: 2.8.1
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.2
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.4.1
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 52.0.0.post20210125
pip: 21.0.1
conda: 4.10.1
pytest: 6.2.4
IPython: 7.22.0
sphinx: None

max-sixty · 2021-05-26T15:54:59Z

max-sixty
May 26, 2021
Maintainer

I have had this issue too. I've found it very difficult to debug; my sense is that it's a dask memory leak, for which there are many issues. If anyone has other insight or reproducible examples, it would be useful to making progress on these.

1 reply

muaali May 27, 2021
Author

I have come up with a very hacky solution which goes like this:

Create .nc files for all images using .
Load each .nc file using xr.open_rasterio
Concatenate files iteratively using xr.concat
save with xr.Dataset.to_netcdf

This works mainly (I believe) because xr.open_rasterio provides xr.DataArray that integrate well with Dask.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load multiple netCDF files, concatenate and write back as one dataset #5380

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Load multiple netCDF files, concatenate and write back as one dataset #5380

muaali May 26, 2021

INSTALLED VERSIONS

Replies: 1 comment · 1 reply

max-sixty May 26, 2021 Maintainer

muaali May 27, 2021 Author

muaali
May 26, 2021

Replies: 1 comment 1 reply

max-sixty
May 26, 2021
Maintainer

muaali May 27, 2021
Author