Skip to content

Commit

Permalink
Improve zarr chunks docs (#9140)
Browse files Browse the repository at this point in the history
* Improve zarr chunks docs

Makes them more structure, consistent. I think removes a mistake re the default chunks arg in `open_zarr` (it's not `None`, it's `auto`).

Adds a comment re performance with `chunks=None`, closing #9111
  • Loading branch information
max-sixty authored Jun 22, 2024
1 parent 2645d7f commit deb2082
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 23 deletions.
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ Bug fixes
Documentation
~~~~~~~~~~~~~

- Improvements to Zarr & chunking docs (:pull:`9139`, :pull:`9140`, :pull:`9132`)
By `Maximilian Roos <https://github.com/max-sixty>`_

Internal Changes
~~~~~~~~~~~~~~~~
Expand Down
43 changes: 26 additions & 17 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,15 +425,19 @@ def open_dataset(
is chosen based on available dependencies, with a preference for
"netcdf4". A custom backend class (a subclass of ``BackendEntrypoint``)
can also be used.
chunks : int, dict, 'auto' or None, optional
If chunks is provided, it is used to load the new dataset into dask
arrays. ``chunks=-1`` loads the dataset with dask using a single
chunk for all arrays. ``chunks={}`` loads the dataset with dask using
engine preferred chunks if exposed by the backend, otherwise with
a single chunk for all arrays. In order to reproduce the default behavior
of ``xr.open_zarr(...)`` use ``xr.open_dataset(..., engine='zarr', chunks={})``.
``chunks='auto'`` will use dask ``auto`` chunking taking into account the
engine preferred chunks. See dask chunking for more details.
chunks : int, dict, 'auto' or None, default: None
If provided, used to load the data into dask arrays.
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
engine preferred chunks.
- ``chunks=None`` skips using dask, which is generally faster for
small arrays.
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
size, generally identical to the format's chunk size. If not available, a
single chunk for all arrays.
See dask chunking for more details.
cache : bool, optional
If True, cache data loaded from the underlying datastore in memory as
NumPy arrays when accessed to avoid reading from the underlying data-
Expand Down Expand Up @@ -631,14 +635,19 @@ def open_dataarray(
Engine to use when reading files. If not provided, the default engine
is chosen based on available dependencies, with a preference for
"netcdf4".
chunks : int, dict, 'auto' or None, optional
If chunks is provided, it is used to load the new dataset into dask
arrays. ``chunks=-1`` loads the dataset with dask using a single
chunk for all arrays. `chunks={}`` loads the dataset with dask using
engine preferred chunks if exposed by the backend, otherwise with
a single chunk for all arrays.
``chunks='auto'`` will use dask ``auto`` chunking taking into account the
engine preferred chunks. See dask chunking for more details.
chunks : int, dict, 'auto' or None, default: None
If provided, used to load the data into dask arrays.
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
engine preferred chunks.
- ``chunks=None`` skips using dask, which is generally faster for
small arrays.
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
- ``chunks={}`` loads the data with dask using engine preferred chunks if
exposed by the backend, otherwise with a single chunk for all arrays.
See dask chunking for more details.
cache : bool, optional
If True, cache data loaded from the underlying datastore in memory as
NumPy arrays when accessed to avoid reading from the underlying data-
Expand Down
18 changes: 12 additions & 6 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -973,12 +973,18 @@ def open_zarr(
Array synchronizer provided to zarr
group : str, optional
Group path. (a.k.a. `path` in zarr terminology.)
chunks : int or dict or tuple or {None, 'auto'}, optional
Chunk sizes along each dimension, e.g., ``5`` or
``{'x': 5, 'y': 5}``. If `chunks='auto'`, dask chunks are created
based on the variable's zarr chunks. If `chunks=None`, zarr array
data will lazily convert to numpy arrays upon access. This accepts
all the chunk specifications as Dask does.
chunks : int, dict, 'auto' or None, default: 'auto'
If provided, used to load the data into dask arrays.
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
engine preferred chunks.
- ``chunks=None`` skips using dask, which is generally faster for
small arrays.
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
- ``chunks={}`` loads the data with dask using engine preferred chunks if
exposed by the backend, otherwise with a single chunk for all arrays.
See dask chunking for more details.
overwrite_encoded_chunks : bool, optional
Whether to drop the zarr chunks encoded for each variable when a
dataset is loaded with specified chunk sizes (default: False)
Expand Down

0 comments on commit deb2082

Please sign in to comment.