Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when rechunking from Zarr store #4380

Closed
eric-czech opened this issue Aug 26, 2020 · 5 comments
Closed

Error when rechunking from Zarr store #4380

eric-czech opened this issue Aug 26, 2020 · 5 comments
Labels
plan to close May be closeable, needs more eyeballs topic-zarr Related to zarr storage library

Comments

@eric-czech
Copy link

eric-czech commented Aug 26, 2020

My assumption for this is that it should be possible to:

  1. Write to a zarr store with some chunk size along a dimension
  2. Load from that zarr store and rechunk to a multiple of that chunk size
  3. Write that result to another zarr store

However I see this behavior instead:

import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=xr.DataArray(da.random.random(size=100, chunks=10), dims='d1')
))

# Write the store
ds.to_zarr('/tmp/ds1.zarr', mode='w') 

# Read it out, rechunk it, and attempt to write it again
xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

ValueError: Final chunk of Zarr array must be the same size or smaller than the first. 
Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) 
in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. 
Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.
Full trace
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

/opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
1656 compute=compute,
1657 consolidated=consolidated,
-> 1658 append_dim=append_dim,
1659 )
1660

/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)
1351 writer = ArrayWriter()
1352 # TODO: figure out how to properly handle unlimited_dims
-> 1353 dump_to_store(dataset, zstore, writer, encoding=encoding)
1354 writes = writer.sync(compute=compute)
1355

/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
1126 variables, attrs = encoder(variables, attrs)
1127
-> 1128 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
1129
1130

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
411 self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
412 self.set_variables(
--> 413 variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims
414 )
415

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
466 # new variable
467 encoding = extract_zarr_variable_encoding(
--> 468 v, raise_on_invalid=check, name=vn
469 )
470 encoded_attrs = {}

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in extract_zarr_variable_encoding(variable, raise_on_invalid, name)
214
215 chunks = _determine_zarr_chunks(
--> 216 encoding.get("chunks"), variable.chunks, variable.ndim, name
217 )
218 encoding["chunks"] = chunks

/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name)
154 if dchunks[-1] > zchunk:
155 raise ValueError(
--> 156 "Final chunk of Zarr array must be the same size or "
157 "smaller than the first. "
158 f"Specified Zarr chunk encoding['chunks']={enc_chunks_tuple}, "

ValueError: Final chunk of Zarr array must be the same size or smaller than the first. Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. Consider either rechunking using chunk() or instead deleting or modifying encoding['chunks'].

Overwriting chunks on open_zarr with overwrite_encoded_chunks=True works but I don't want that because it requires providing a uniform chunk size for all variables. This workaround seems to be fine though:

ds = xr.open_zarr('/tmp/ds1.zarr')
del ds.x.encoding['chunks']
ds.chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

Does encoding['chunks'] serve any purpose after you've loaded a zarr store and all the variables are defined as dask arrays? In other words, Is there any harm in deleting it from all dask variables if I want those variables to write back out to zarr using the dask chunk definitions instead?

Environment:

Output of xr.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: None

xarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.2
pytest: 5.4.3
IPython: 7.15.0
sphinx: 3.2.1

@dcherian dcherian added the topic-zarr Related to zarr storage library label Aug 29, 2020
@dcherian
Copy link
Contributor

Does encoding['chunks'] serve any purpose after you've loaded a zarr store and all the variables are defined as dask arrays?

No. I run into this frequently and it is annoying. @rabernat do you remember why you chose to keep chunks around in encoding

@aurghs
Copy link
Collaborator

aurghs commented Dec 27, 2020

I'm not sure but ... It seems to be a bug this error. There is a check on the final chunk that it seems to have the wrong direction in the inequality.
The part of the code to decide what's chunking should be used in case we have defined both, dask chunking and encoded chucking, is the following:

# the hard case
# DESIGN CHOICE: do not allow multiple dask chunks on a single zarr chunk
# this avoids the need to get involved in zarr synchronization / locking
# From zarr docs:
# "If each worker in a parallel computation is writing to a separate
# region of the array, and if region boundaries are perfectly aligned
# with chunk boundaries, then no synchronization is required."
# TODO: incorporate synchronizer to allow writes from multiple dask
# threads
if var_chunks and enc_chunks_tuple:
for zchunk, dchunks in zip(enc_chunks_tuple, var_chunks):
if len(dchunks) == 1:
continue
for dchunk in dchunks[:-1]:
if dchunk % zchunk:
raise NotImplementedError(
f"Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for "
f"variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. "
"This is not implemented in xarray yet. "
"Consider either rechunking using `chunk()` or instead deleting "
"or modifying `encoding['chunks']`."
)
if dchunks[-1] > zchunk:
raise ValueError(
"Final chunk of Zarr array must be the same size or "
"smaller than the first. "
f"Specified Zarr chunk encoding['chunks']={enc_chunks_tuple}, "
f"for variable named {name!r} "
f"but {dchunks} in the variable's Dask chunks {var_chunks} is "
"incompatible with this encoding. "
"Consider either rechunking using `chunk()` or instead deleting "
"or modifying `encoding['chunks']`."
)

the aims of these checks, as described in the comment, is to avoid to have multiple dask chunks in one zarr chunk. According to this logic this inequality at line 163:
if dchunks[-1] > zchunk:
has the wrong direction. It should be in this way:
if dchunks[-1] < zchunk, but this last one seems to me that it is always verified.

@aurghs
Copy link
Collaborator

aurghs commented Dec 27, 2020

Does encoding['chunks'] serve any purpose after you've loaded a Zarr store and all the variables are defined as dask arrays?

No. I run into this frequently and it is annoying. @rabernat do you remember why you chose to keep chunks around in encoding

The encodings["chunks"] is used in to_zarr. It seems to be reasonable: I expect that I should be able to read and re-write a Zarr without modifying the chunking on disk.
It seems to me that dask chunks are used in writing only when the encodings["chunks"] is not defined or they are not compatible anymore with variables shapes. In the other cases encodings["chunks"] is used.
So if you want to use the encoded chunks, you have to be sure that they are still compatible with variables shapes and that each Zarr chunk is contained in only one dask chunk.
If you want to use the dask chunks you can:

  • Delite the encoded chunking as done by @eric-czech.
  • Use encoding when you write: ds.to_zarr('/tmp/ds3.zarr', mode='w', encoding={'x': {}}).

Maybe this interface is a little bit confusing.
Probably would be better to move overwrite_encoded_chunks from open_dataset to to_zarr. open_dataset interface would be cleaner and would be clear how to use dask chunks in writing.

Concerning the different chunking per variable, I link here this related issue:
#4623

@mangecoeur
Copy link
Contributor

Running into the same issue, when I:

  1. Load input from a Zarr data source
  2. Queue some processing (delayed dask ufuncs)
  3. Re-chunk using chunk() to get the dask task size I want
  4. use to_zarr to trigger the calculation (dask distributed backend) and save to a new file on disk

I get the chunk size mismatch error which I solve by manually overwriting the encoding['chunks'] value, which seems unintuitive to me. Since I'm going from->to a zarr, I assumed that calling chunk() would set the chunk size for both the dask arrays and the zarr output, since calling to_zarr on a dask array will only work if the dask and zarr encoding chunk size match.

I didn't realize the overwrite_encoded_chunks option existed but it's also a bit confusing that to get the right chunksize on the output i need to set the overwrite option on the input.

@max-sixty
Copy link
Collaborator

I think we can fold this into #6323

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

5 participants