Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize writes to existing Zarr stores. #8875

Merged
merged 4 commits into from
Mar 29, 2024

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Mar 25, 2024

We need to read existing variables to make sure we append or write to a region with the right encoding. Currently we decode all arrays in a Zarr group. Instead only decode those arrays for which we require encoding information.

We need to read existing variables to make sure we append or write to a
region with the right encoding. Currently we request all arrays in a
Zarr group. Instead only request those arrays for which we require
encoding information.
@dcherian dcherian force-pushed the optimize-zarr-appends branch from 9bc43dd to eb37aed Compare March 25, 2024 15:44
@dcherian dcherian requested review from shoyer and max-sixty and removed request for shoyer March 25, 2024 17:48
@@ -623,7 +623,12 @@ def store(
# avoid needing to load index variables into memory.
# TODO: consider making loading indexes lazy again?
existing_vars, _, _ = conventions.decode_cf_variables(
self.get_variables(), self.get_attrs()
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we should also be skipping this for mode="w" to allow overwriting the existing encoding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what you mean? Is that because we would have just written these vars so already know their values/schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode="w" means overwrite so we shoudn't care about encoding on disk, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test, the encoding does get update as expected with mode="w", so presumably zarr is nuking the store with mode="w".

@dcherian dcherian added the plan to merge Final call for comments label Mar 28, 2024
@dcherian dcherian merged commit 5bf2cf4 into pydata:main Mar 29, 2024
29 checks passed
@dcherian dcherian deleted the optimize-zarr-appends branch March 29, 2024 14:35
dcherian added a commit to dcherian/xarray that referenced this pull request Apr 2, 2024
* main: (26 commits)
  [pre-commit.ci] pre-commit autoupdate (pydata#8900)
  Bump the actions group with 1 update (pydata#8896)
  New empty whatsnew entry (pydata#8899)
  Update reference to 'Weighted quantile estimators' (pydata#8898)
  2024.03.0: Add whats-new (pydata#8891)
  Add typing to test_groupby.py (pydata#8890)
  Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867)
  Check for aligned chunks when writing to existing variables (pydata#8459)
  Add dt.date to plottable types (pydata#8873)
  Optimize writes to existing Zarr stores. (pydata#8875)
  Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886)
  Don't allow overwriting indexes with region writes (pydata#8877)
  Migrate datatree.py module into xarray.core. (pydata#8789)
  warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874)
  groupby: Dispatch quantile to flox. (pydata#8720)
  Opt out of auto creating index variables (pydata#8711)
  Update docs on view / copies (pydata#8744)
  Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869)
  numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865)
  upstream-dev CI: Fix interp and cumtrapz (pydata#8861)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants