Error when rechunking from Zarr store #4380

eric-czech · 2020-08-26T20:53:05Z

My assumption for this is that it should be possible to:

Write to a zarr store with some chunk size along a dimension
Load from that zarr store and rechunk to a multiple of that chunk size
Write that result to another zarr store

However I see this behavior instead:

import xarray as xr
import dask.array as da

ds = xr.Dataset(dict(
    x=xr.DataArray(da.random.random(size=100, chunks=10), dims='d1')
))

# Write the store
ds.to_zarr('/tmp/ds1.zarr', mode='w') 

# Read it out, rechunk it, and attempt to write it again
xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

ValueError: Final chunk of Zarr array must be the same size or smaller than the first. 
Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) 
in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. 
Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.

Full trace

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
----> 1 xr.open_zarr('/tmp/ds1.zarr').chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')
/opt/conda/lib/python3.7/site-packages/xarray/core/dataset.py in to_zarr(self, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)

1656             compute=compute,

1657             consolidated=consolidated,

-> 1658             append_dim=append_dim,

1659         )

1660
/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in to_zarr(dataset, store, mode, synchronizer, group, encoding, compute, consolidated, append_dim)

1351     writer = ArrayWriter()

1352     # TODO: figure out how to properly handle unlimited_dims

-> 1353     dump_to_store(dataset, zstore, writer, encoding=encoding)

1354     writes = writer.sync(compute=compute)

1355
/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)

1126         variables, attrs = encoder(variables, attrs)

1127

-> 1128     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

1129

1130
/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)

411         self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)

412         self.set_variables(

--> 413             variables_encoded, check_encoding_set, writer, unlimited_dims=unlimited_dims

414         )

415
/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)

466                 # new variable

467                 encoding = extract_zarr_variable_encoding(

--> 468                     v, raise_on_invalid=check, name=vn

469                 )

470                 encoded_attrs = {}
/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in extract_zarr_variable_encoding(variable, raise_on_invalid, name)

214

215     chunks = _determine_zarr_chunks(

--> 216         encoding.get("chunks"), variable.chunks, variable.ndim, name

217     )

218     encoding["chunks"] = chunks
/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name)

154             if dchunks[-1] > zchunk:

155                 raise ValueError(

--> 156                     "Final chunk of Zarr array must be the same size or "

157                     "smaller than the first. "

158                     f"Specified Zarr chunk encoding['chunks']={enc_chunks_tuple}, "
ValueError: Final chunk of Zarr array must be the same size or smaller than the first. Specified Zarr chunk encoding['chunks']=(10,), for variable named 'x' but (20, 20, 20, 20, 20) in the variable's Dask chunks ((20, 20, 20, 20, 20),) is incompatible with this encoding. Consider either rechunking using chunk() or instead deleting or modifying encoding['chunks'].

Overwriting chunks on open_zarr with overwrite_encoded_chunks=True works but I don't want that because it requires providing a uniform chunk size for all variables. This workaround seems to be fine though:

ds = xr.open_zarr('/tmp/ds1.zarr')
del ds.x.encoding['chunks']
ds.chunk(chunks=dict(d1=20)).to_zarr('/tmp/ds2.zarr', mode='w')

Does encoding['chunks'] serve any purpose after you've loaded a zarr store and all the variables are defined as dask arrays? In other words, Is there any harm in deleting it from all dask variables if I want those variables to write back out to zarr using the dask chunk definitions instead?

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-42-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: None

xarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 47.3.1.post20200616
pip: 20.1.1
conda: 4.8.2
pytest: 5.4.3
IPython: 7.15.0
sphinx: 3.2.1

The text was updated successfully, but these errors were encountered:

dcherian · 2020-08-29T14:53:28Z

Does encoding['chunks'] serve any purpose after you've loaded a zarr store and all the variables are defined as dask arrays?

No. I run into this frequently and it is annoying. @rabernat do you remember why you chose to keep chunks around in encoding

aurghs · 2020-12-27T15:32:10Z

I'm not sure but ... It seems to be a bug this error. There is a check on the final chunk that it seems to have the wrong direction in the inequality.
The part of the code to decide what's chunking should be used in case we have defined both, dask chunking and encoded chucking, is the following:

xarray/xarray/backends/zarr.py

Lines 141 to 173 in ac23461

    
           # the hard case 
        
           # DESIGN CHOICE: do not allow multiple dask chunks on a single zarr chunk 
        
           # this avoids the need to get involved in zarr synchronization / locking 
        
           # From zarr docs: 
        
           #  "If each worker in a parallel computation is writing to a separate 
        
           #   region of the array, and if region boundaries are perfectly aligned 
        
           #   with chunk boundaries, then no synchronization is required." 
        
           # TODO: incorporate synchronizer to allow writes from multiple dask 
        
           # threads 
        
           if var_chunks and enc_chunks_tuple: 
        
               for zchunk, dchunks in zip(enc_chunks_tuple, var_chunks): 
        
                   if len(dchunks) == 1: 
        
                       continue 
        
                   for dchunk in dchunks[:-1]: 
        
                       if dchunk % zchunk: 
        
                           raise NotImplementedError( 
        
                               f"Specified zarr chunks encoding['chunks']={enc_chunks_tuple!r} for " 
        
                               f"variable named {name!r} would overlap multiple dask chunks {var_chunks!r}. " 
        
                               "This is not implemented in xarray yet. " 
        
                               "Consider either rechunking using `chunk()` or instead deleting " 
        
                               "or modifying `encoding['chunks']`." 
        
                           ) 
        
                   if dchunks[-1] > zchunk: 
        
                       raise ValueError( 
        
                           "Final chunk of Zarr array must be the same size or " 
        
                           "smaller than the first. " 
        
                           f"Specified Zarr chunk encoding['chunks']={enc_chunks_tuple}, " 
        
                           f"for variable named {name!r} " 
        
                           f"but {dchunks} in the variable's Dask chunks {var_chunks} is " 
        
                           "incompatible with this encoding. " 
        
                           "Consider either rechunking using `chunk()` or instead deleting " 
        
                           "or modifying `encoding['chunks']`." 
        
                       )

the aims of these checks, as described in the comment, is to avoid to have multiple dask chunks in one zarr chunk. According to this logic this inequality at line 163:

xarray/xarray/backends/zarr.py

Line 163 in ac23461

if dchunks[-1] > zchunk:

has the wrong direction. It should be in this way:
if dchunks[-1] < zchunk, but this last one seems to me that it is always verified.

aurghs · 2020-12-27T16:43:56Z

Does encoding['chunks'] serve any purpose after you've loaded a Zarr store and all the variables are defined as dask arrays?

No. I run into this frequently and it is annoying. @rabernat do you remember why you chose to keep chunks around in encoding

The encodings["chunks"] is used in to_zarr. It seems to be reasonable: I expect that I should be able to read and re-write a Zarr without modifying the chunking on disk.
It seems to me that dask chunks are used in writing only when the encodings["chunks"] is not defined or they are not compatible anymore with variables shapes. In the other cases encodings["chunks"] is used.
So if you want to use the encoded chunks, you have to be sure that they are still compatible with variables shapes and that each Zarr chunk is contained in only one dask chunk.
If you want to use the dask chunks you can:

Delite the encoded chunking as done by @eric-czech.
Use encoding when you write: ds.to_zarr('/tmp/ds3.zarr', mode='w', encoding={'x': {}}).

Maybe this interface is a little bit confusing.
Probably would be better to move overwrite_encoded_chunks from open_dataset to to_zarr. open_dataset interface would be cleaner and would be clear how to use dask chunks in writing.

Concerning the different chunking per variable, I link here this related issue:
#4623

mangecoeur · 2021-03-10T09:00:48Z

Running into the same issue, when I:

Load input from a Zarr data source
Queue some processing (delayed dask ufuncs)
Re-chunk using chunk() to get the dask task size I want
use to_zarr to trigger the calculation (dask distributed backend) and save to a new file on disk

I get the chunk size mismatch error which I solve by manually overwriting the encoding['chunks'] value, which seems unintuitive to me. Since I'm going from->to a zarr, I assumed that calling chunk() would set the chunk size for both the dask arrays and the zarr output, since calling to_zarr on a dask array will only work if the dask and zarr encoding chunk size match.

I didn't realize the overwrite_encoded_chunks option existed but it's also a bit confusing that to get the right chunksize on the output i need to set the overwrite option on the input.

max-sixty · 2023-11-09T06:45:27Z

I think we can fold this into #6323

dcherian added the topic-zarr Related to zarr storage library label Aug 29, 2020

eric-czech mentioned this issue Oct 5, 2020

DataArray encoding needs to be deleted in order to save a dataset sgkit-dev/sgkit#297

Closed

eric-czech mentioned this issue Nov 19, 2020

Add load_dataset and save_dataset functions sgkit-dev/sgkit#392

Merged

This was referenced Mar 19, 2021

Allow "unsafe" mode for zarr writing #5056

Closed

Zarr chunking fixes #5065

Merged

keewis mentioned this issue Mar 3, 2022

propagation of encoding #6323

Open

jhamman mentioned this issue Mar 27, 2023

Add reset_encoding to Dataset and DataArray objects #7686

Closed

max-sixty added the plan to close May be closeable, needs more eyeballs label Nov 9, 2023

dcherian closed this as completed Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when rechunking from Zarr store #4380

Error when rechunking from Zarr store #4380

eric-czech commented Aug 26, 2020 •

edited

Loading

dcherian commented Aug 29, 2020

aurghs commented Dec 27, 2020 •

edited

Loading

aurghs commented Dec 27, 2020 •

edited

Loading

mangecoeur commented Mar 10, 2021

max-sixty commented Nov 9, 2023

Error when rechunking from Zarr store #4380

Error when rechunking from Zarr store #4380

Comments

eric-czech commented Aug 26, 2020 • edited Loading

dcherian commented Aug 29, 2020

aurghs commented Dec 27, 2020 • edited Loading

aurghs commented Dec 27, 2020 • edited Loading

mangecoeur commented Mar 10, 2021

max-sixty commented Nov 9, 2023

eric-czech commented Aug 26, 2020 •

edited

Loading

aurghs commented Dec 27, 2020 •

edited

Loading

aurghs commented Dec 27, 2020 •

edited

Loading