You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am reading Zarr store from S3 bucket into xr.Dataset. I rechunk dataset with 2 dimensions chunk sizes set to 1:
dask_chunks = {'mid_date': 11117, 'x': 1, 'y': 1}
and want to save it to local file. It seems that to_zarr() is extremely slow when chunk size is small. Did anybody run into a similar issue?
It takes extremely long time (4.5hours so far to write 5.8Mb out of 5.4Gb) to write such chunked xr.Dataset to Zarr.
It is faster if I set dask_chunks = {'mid_date': 11117, 'x': 10, 'y': 10}: wrote 980Mb out of 5.4Gb in 1.5 hours. Still it took more than 8 hours to write the whole dataset to disk.
It has reasonable runtime (tens of minutes) if I change chunk size to a larger value:
Also does anybody know if selected size of chunking affects access speed of the data? We are only interested in time series of one spacial point (x, y) of our dataset at a time. Would accessing such single spacial point time series be affected by the chunk size of our dataset it's stored in on the file system?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I am reading Zarr store from S3 bucket into xr.Dataset. I rechunk dataset with 2 dimensions chunk sizes set to 1:
dask_chunks = {'mid_date': 11117, 'x': 1, 'y': 1}
and want to save it to local file. It seems that to_zarr() is extremely slow when chunk size is small. Did anybody run into a similar issue?
It takes extremely long time (4.5hours so far to write 5.8Mb out of 5.4Gb) to write such chunked xr.Dataset to Zarr.
It is faster if I set
dask_chunks = {'mid_date': 11117, 'x': 10, 'y': 10}
: wrote 980Mb out of 5.4Gb in 1.5 hours. Still it took more than 8 hours to write the whole dataset to disk.It has reasonable runtime (tens of minutes) if I change chunk size to a larger value:
dask_chunks = {'mid_date': 11117, 'x': 250, 'y': 250}
Just wonder if it's xarray, dask or zarr issue.
Also does anybody know if selected size of chunking affects access speed of the data? We are only interested in time series of one spacial point (x, y) of our dataset at a time. Would accessing such single spacial point time series be affected by the chunk size of our dataset it's stored in on the file system?
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 19.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4
xarray: 0.19.0
pandas: 1.1.4
numpy: 1.20.2
scipy: None
netCDF4: 1.5.4
pydap: None
h5netcdf: 0.8.1
h5py: 3.1.0
Nio: None
zarr: 2.6.1
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.30.0
distributed: 2.30.1
matplotlib: 3.3.2
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 52.0.0.post20210125
pip: 20.2.4
conda: None
pytest: None
IPython: 7.22.0
sphinx: None
Beta Was this translation helpful? Give feedback.
All reactions