Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy Loading with DataArray vs. Variable #8753

Closed
dcherian opened this issue Feb 15, 2024 Discussed in #8751 · 0 comments · Fixed by #8754
Closed

Lazy Loading with DataArray vs. Variable #8753

dcherian opened this issue Feb 15, 2024 Discussed in #8751 · 0 comments · Fixed by #8754

Comments

@dcherian
Copy link
Contributor

Discussed in #8751

Originally posted by ilan-gold February 15, 2024
My goal is to get a dataset from custom io-zarr backend lazy-loaded. But when I declare a DataArray based on the Variable which uses LazilyIndexedArray, everything is read in. Is this expected? I specifically don't want to have to use dask if possible. I have seen https://github.com/aurghs/xarray-backend-tutorial/blob/main/2.Backend_with_Lazy_Loading.ipynb but it's a little bit different.

While I have a custom backend array inheriting from ZarrArrayWrapper, this example using ZarrArrayWrapper directly still highlights the same unexpected behavior of everything being read in.

import zarr
import xarray as xr
from tempfile import mkdtemp
import numpy as np
from pathlib import Path
from collections import defaultdict

class AccessTrackingStore(zarr.DirectoryStore):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._access_count = {}
        self._accessed = defaultdict(set)

    def __getitem__(self, key):
        for tracked in self._access_count:
            if tracked in key:
                self._access_count[tracked] += 1
                self._accessed[tracked].add(key)
        return super().__getitem__(key)

    def get_access_count(self, key):
        return self._access_count[key]

    def set_key_trackers(self, keys_to_track):
        if isinstance(keys_to_track, str):
            keys_to_track = [keys_to_track]
        for k in keys_to_track:
            self._access_count[k] = 0

    def get_subkeys_accessed(self, key):
        return self._accessed[key]

orig_path = Path(mkdtemp())
z = zarr.group(orig_path / "foo.zarr")
z['array'] = np.random.randn(1000, 1000)

store = AccessTrackingStore(orig_path / "foo.zarr")
store.set_key_trackers(['array'])
z = zarr.group(store)
arr = xr.backends.zarr.ZarrArrayWrapper(z['array'])
lazy_arr = xr.core.indexing.LazilyIndexedArray(arr)

# just `.zarray`
var = xr.Variable(('x', 'y'), lazy_arr)
print('Variable read in ', store.get_subkeys_accessed('array'))

# now everything is read in
da = xr.DataArray(var)
print('DataArray read in ', store.get_subkeys_accessed('array'))
```</div>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant