Flexible backends - Harmonise zarr chunking with other backends chunking #4496

aurghs · 2020-10-08T14:43:23Z

Is your feature request related to a problem? Please describe.
In #4309 we proposed to separate xarray - backend tasks, more or less in this way:

Backend returns a dataset
xarray manage chunks and cache.

With the changes in open_dataset to support also zarr (#4187 ), we introduced a slightly different behavior for zarr chunking with respect the other backends.

Behavior of all the backends except zar

if chunk == {} or 'auto': it uses dask and only one chunk per variable
if the user defines chunks for not all the dimensions, along these dimensions it uses only one chunk:

>>> ds = xr.open_dataset('test.nc', chunks={'x': 4})
>>> print(ds['foo'].chunks)
((4, 4, 4, 4, 4), (4,))

Zarr chunking behavior is very similar, but it has a different default when the user doesn't choose the size of the chunk along some dimensions, i.e.

if chunk == {} or 'auto': it uses in both cases the on-disk chunks
if the user defines the chunks for not all the dimensions, along these dimensions it uses no disk chunck:

>>> ds = xr.open_dataset('test.zarr', engine='zarr', chunks={'x': 4})
>>> print(ds['foo'].encoding['chunks'])
(5, 2)
>>> print(ds['foo'].chunks)
((4, 4, 4, 4, 4), (2, 2))

Describe the solution you'd like

We could extend easily zarr behavior to all the backends (which, for now, don't use the field variable.encodings['chunks']):
if no chunks are defined in encoding, we use as default the dimension size, otherwise, we use the encoded chunks. So for now we are not going to change any external behavior, but if needed the other backends can use this interface.
I have some additional notes:

The key value auto is redundant because it has the same behavior as {}, we could remove one of them.
I would separate the concepts "on disk chunk" and "preferred chunking". We can use a different key in encodings or ask the backend to expose a function to compute the preferred chunking.

One last question:

In the new interface of open_dataset there is a new key, imported from open_zarr: overwrite_encoded_chunks. Is it really needed? Why do we support to overwrite of the encoded chunks at readi time? This operation can be easily done after or at write time.

The text was updated successfully, but these errors were encountered:

aurghs · 2020-10-09T10:18:10Z

The key value auto is redundant because it has the same behavior as {}, we could remove one of them.

That's not completely true. With no dask installed auto uses chunk=None, while {} raises an error. Probably it makes sense.

shoyer · 2020-10-15T15:27:29Z

With regards to overwrite_encoded_chunks=True, see #2530

aurghs · 2020-11-02T23:32:48Z

I think we can keep talking here about xarray chunking interface.

It seems that the interface for chunking is a tricky problem in xarray. There are involved different interfaces already implemented:

dask: da.rechunk, da.from_array
xarray: xr.open_dataset
xarray: ds.chunk
xarray-zarr: xr.open_dataset(engine="zarr") (≈ xr.open_zarr)

They are similar, but there are some inconsistencies.

dask
The allowed values for chunking in dask are:

dictionary (or tuple)
integers > 0
-1: no chunking (along this dimension)
auto: allow the chunking (in this dimension) to accommodate ideal chunk sizes (default 128MiB)

The allowed values in the dictionary are: -1, auto, None (no change to the chunking along this dimension)
Note: None isn't supported outside the dictionary.
Note: If chunking along some dimension is not specified then the chunking along this dimension will not change (e.g. {} is equivalent to {0: None})

xarray: xr.open_dataset for all the engines != "zarr"
It works as dask but also None is supported. If chunk is None then it doesn't use dask at all.

xarray: ds.chunk
It works as dask but also None is supported. None is equivalent to a dictionary with all values None (and equivalent to the empty dictionary).

xarray: xr.open_dataset(engine="zarr")
It works as dask except for:

None is supported. If chunk is None then it doesn't use dask at all.
If chunking along some dimension is not specified then encoded chunks are used.
auto is equivalent to the empty dictionary, encoded chunks are used.
auto inside the dictionary is passed on to dask and behaves as in dask.

Points to be discussed:

auto and {}
The main problem is how to uniform dask and xarray-zarr.

Option 1
Maybe the encoded chunking provided by the backend can be seen just as the current on-disk data chunking. According to dask interface, if in a dictionary the chunks for some dimension are None or not defined, then the current chunking along that dimension doesn't change. From this perspective, we would have:
- with auto it uses dask auto-chunking.
- with -1 it uses dask but no chunking.
- with {} it uses the backend encoded chunks (when available) for on-disk data (xr.open_dataset) and the current chunking for already opened datasets (ds.chunk)
Note: ds.chunk behavior would be unchanged
Note: xr.open_dataset would be unchanged, except for engine="zarr", since currently the var.encodings["chunks"] is defined only by zarr.

Option 2
We could use a different new value for the encoded chunks (e.g.encoded TBC). Something like:
open_dataset(chunks="encoded")
open_dataset(chunks={"x": "encoded", "y": 10,...})
Both expressions could be supported.
cons:
- chunks="encoded": with zarr the user probably needs to specify always to use the encoded chunks.
- chunks="encoded": the user must specify explicitly in the dictionary which dimension should be chunked with the encoded chunks, that's very inconvenient (but is it really used? @weiji14 do you have some idea about it?).
None
chunks=None should produce the same result in xr.open_dataset and ds.rechunk.

@shoyer, @alexamici, @jhamman, @dcherian, @weiji14 suggestions are welcome

weiji14 · 2020-11-04T01:47:30Z

Just a general comment on the xr.open_dataset(engine="zarr") part, I prefer to keep or reduce the amount of chunks= options (i.e. Option 1) rather than add another chunks="encoded" option.

For those who are confused, this is the current state of xr.open_mfdataset (correct me if I'm wrong):

⬇️ engine\chunk ➡️	None (default)	'auto'	{}	-1
None (i.e. default for NetCDF)	np.ndarray	dask.Array (produces origintal chunks as in NetCDF obj??)	dask.Array (rechunked into 1 chunk)	dask.Array (rechunked into 1 chunk)
zarr	np.ndarray	dask.Array (original chunks as in Zarr obj)	dask.Array (original chunks as in Zarr obj)	dask.Array (rechunked into 1 chunk + UserWarning)

Sample code to test (run in jupyter notebook to see the dask chunk visual):

import xarray as xr
import fsspec

# Opening NetCDF
dataset: xr.Dataset = xr.open_dataset(
    "http://thredds.ucar.edu/thredds/dodsC/grib/NCEP/HRRR/CONUS_2p5km/Best", chunks={}
)
dataset.Temperature_height_above_ground.data

# Opening Zarr
zstore = fsspec.get_mapper(
    url="gs://cmip6/CMIP/NCAR/CESM2/historical/r9i1p1f1/Amon/tas/gn/"
)
dataset: xr.Dataset = xr.open_dataset(
    filename_or_obj=zstore,
    engine="zarr",
    chunks={},
    backend_kwargs=dict(consolidated=True),
)
dataset.tas.data

aurghs · 2020-11-04T08:35:08Z

@weiji14 Thank you very much for your feedback. I think we should align also xr.open_mfdataset.
In the case of engine == zarr and chunk == -1 there is a UserWarning also in xr.open_dataset, but I think it should be removed.

Maybe we should evaluate for the future to integrate/use dask function dask.array.core.normalize_chunks
(https://docs.dask.org/en/latest/array-api.html#dask.array.core.normalize_chunks) with the key previous_chunks (see comment #2530 (comment))
It could be particularly useful for (re-)chunking taking into account the previous chunks or the on-disk chunks, especially if the on-disk chunks are small.

ravwojdyla · 2020-11-23T23:31:25Z

Hi. I'm trying to find an issue that is closest to the problem that I have, and this seems to be the best one, and most related.

Say, I have a zarr dataset with multiple variables Foo, Bar and Baz (and potentially, many more), there are 2 dimensions: x, y (potentially more). Say both Foo and Bar are large 2d arrays dims: x, y, Baz is relatively small 1d array dim: y. Say I would like to read that dataset with xarray but increase chunk from the native zarr chunk size for x and y but only for Foo and Bar, I would like to keep native chunking for Baz. afaiu currently I would do that with chunks parameter to open_dataset/open_zarr, but if I do do that via say dict(x=N, y=M) that will change chunking for all variables that use those dimensions, which isn't exactly what I need, I need those changed only for Foo and Bar. Is there a way to do that? Should that be part of the "harmonisation"? One could imagine that xarray could accept a dict of dict akin to {var: {dim: chunk_spec}} to specify chunking for specific variables.

Note that rechunk after reading is not what I want, I would like to specify chunking at read op.

Let me know if you would prefer me to open a completely new issue for this.

aurghs · 2020-11-29T10:29:34Z

@ravwojdyla I think that currently there is no way to do this. But it would be nice to have an interface that allows defining different chunks for each variable.
The main problem that I see in implementing that is to keep the ´xr.open_dataset(... chunks=)´, ´ds.chunk´ and ´ds.chunks´ interfaces backwards compatible.
Probably a new issue for that would be better since this refactor is already a little bit tricky and your proposal could be implemented separately.

* modify get_chunks to align zarr chunking as described in issue #4496 * fix: maintain old open_zarr chunking interface * add and fix tests * black * bugfix * add few documentation on open_dataset chunking * in test: re-add xafils for negative steps without dask * Specify in reason that only zarr is expected to fail * unify backend test negative_step with dask and without dask * Add comment on has_dask usage Co-authored-by: Alessandro Amici <a.amici@bopen.eu>

aurghs changed the title ~~Flexible backends - How to harmonise zarr chunking with that of the other backends~~ Flexible backends - Harmonise zarr chunking with other backends chunking Oct 8, 2020

dcherian added the topic-backends label Oct 9, 2020

alexamici assigned aurghs Oct 15, 2020

aurghs mentioned this issue Oct 29, 2020

WIP: Zarr chunks refactor #4550

Closed

5 tasks

aurghs mentioned this issue Nov 19, 2020

WIP: Chunking refactor #4595

Closed

5 tasks

ravwojdyla mentioned this issue Nov 30, 2020

Allow chunk spec per variable #4623

Open

This was referenced Dec 1, 2020

Move get_chunks from zarr.py to dataset.py #4632

Merged

change default in ds.chunk and datarray.chunk variable.chunk #4633

Merged

Modify zarr chunking as suggested in #4496 #4646

Merged

This was referenced Dec 9, 2020

unify zarr chunking with other chunking in apiv2.open_dataset #4667

Merged

add encodings["preferred_chunks"], used in open_dataset instead of en… #4669

Merged

alexamici added the grant-czi label Dec 10, 2020

aurghs closed this as completed Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

aurghs commented Oct 8, 2020 •

edited

Loading

aurghs commented Oct 9, 2020

shoyer commented Oct 15, 2020

aurghs commented Nov 2, 2020 •

edited

Loading

weiji14 commented Nov 4, 2020 •

edited

Loading

aurghs commented Nov 4, 2020 •

edited

Loading

ravwojdyla commented Nov 23, 2020 •

edited

Loading

aurghs commented Nov 29, 2020

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

Flexible backends - Harmonise zarr chunking with other backends chunking #4496

Comments

aurghs commented Oct 8, 2020 • edited Loading

aurghs commented Oct 9, 2020

shoyer commented Oct 15, 2020

aurghs commented Nov 2, 2020 • edited Loading

weiji14 commented Nov 4, 2020 • edited Loading

aurghs commented Nov 4, 2020 • edited Loading

ravwojdyla commented Nov 23, 2020 • edited Loading

aurghs commented Nov 29, 2020

aurghs commented Oct 8, 2020 •

edited

Loading

aurghs commented Nov 2, 2020 •

edited

Loading

weiji14 commented Nov 4, 2020 •

edited

Loading

aurghs commented Nov 4, 2020 •

edited

Loading

ravwojdyla commented Nov 23, 2020 •

edited

Loading