Add example to create a virtual dataset using lithops #203

thodson-usgs · 2024-07-29T22:06:10Z

At the suggestion of @TomNicholas, I created a simple example using lithops (and serverless compute) to create a virtual dataset from a list of netcdf files hosted on s3.

This PR depends on the fix provided in #206 (now merged).

Resolved with #206

The workflow was broken in the latest version of VirtualiZarr.
The example runs fine on 5d08519.
However, using 179bb2a the workflow will run, but complains about ValueError: Could not convert object to NumPy datetime when I open the dataset using xarray:

Traceback (most recent call last):
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/meta.py", line 127, in decode_array_metadata
    fill_value = cls.decode_fill_value(meta["fill_value"], dtype, object_codec)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/meta.py", line 260, in decode_fill_value
    return np.array(v, dtype=dtype)[()]
           ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Could not convert object to NumPy datetime

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 571, in open_dataset
    backend_ds = backend.open_dataset(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/kerchunk/xarray_backend.py", line 12, in open_dataset
    ref_ds = open_reference_dataset(
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/kerchunk/xarray_backend.py", line 46, in open_reference_dataset
    return xr.open_dataset(m, engine="zarr", consolidated=False, **open_dataset_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/api.py", line 571, in open_dataset
    backend_ds = backend.open_dataset(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/zarr.py", line 1182, in open_dataset
    ds = store_entrypoint.open_dataset(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/store.py", line 43, in open_dataset
    vars, attrs = filename_or_obj.load()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/common.py", line 221, in load
    (_decode_variable_name(k), v) for k, v in self.get_variables().items()
                                              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/zarr.py", line 563, in get_variables
    return FrozenDict(
           ^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/core/utils.py", line 443, in FrozenDict
    return Frozen(dict(*args, **kwargs))
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/xarray/backends/zarr.py", line 563, in <genexpr>
    return FrozenDict(
                     ^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/hierarchy.py", line 691, in _array_iter
    yield _key if keys_only else (_key, self[key])
                                        ~~~~^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/hierarchy.py", line 467, in __getitem__
    return Array(
           ^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/core.py", line 170, in __init__
    self._load_metadata()
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/core.py", line 193, in _load_metadata
    self._load_metadata_nosync()
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/core.py", line 207, in _load_metadata_nosync
    meta = self._store._metadata_class.decode_array_metadata(meta_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/thodson/micromamba/envs/cubed-dev/lib/python3.11/site-packages/zarr/meta.py", line 141, in decode_array_metadata
    raise MetadataError("error decoding metadata") from e
zarr.errors.MetadataError: error decoding metadata

TomNicholas · 2024-07-30T00:02:17Z

This is awesome @thodson-usgs ! Thanks for trying this out!

I'm marking this WIP, because the workflow is broken in the latest version of VirtualiZarr.

I'll use #201 to track whatever regression has occurred there, so we can talk about the cool serverless map_reduce you've done here! (which is relevant to #123)

My conception of what lithops.map_reduce is actually doing is quite fuzzy. Is it saving out any intermediate results to a storage layer? Is it doing multiple rounds of reduction (i.e. a tree-reduce)? Do you foresee any scaling issues with this approach?

thodson-usgs · 2024-07-30T02:19:07Z

Hmm. I pulled the lastest version of VirtualiZarr in my testing environment, but I neglected to rebuild the runtime image, so I'll double check that.

My conception of what lithops.map_reduce is actually doing is quite fuzzy. Is it saving out any intermediate results to a storage layer? Is it doing multiple rounds of reduction (i.e. a tree-reduce)? Do you foresee any scaling issues with this approach?

Right, I set up s3 storage for cubed, but I think this workflow is entirely in memory. So, we'll invariably hit scaling issues unless avoid the reduce and by writing to disk during the map operation. Nevertheless, I'm really excited by how easy this was to set up, and I hope others will help improve upon it.

thodson-usgs · 2024-07-30T03:34:50Z

Hmm. I pulled the lastest version of VirtualiZarr in my testing environment, but I neglected to rebuild the runtime image, so I'll double check that.

No, I rebuilt the runtime image using the latest VirtualiZarr commit, and the error persisted, so I believe it's real.

TomNicholas · 2024-07-30T03:37:32Z

Right, I set up s3 storage for cubed

You're not using cubed at all here, that's for the actual rechunking.

I think this workflow is entirely in memory.

Lithops does have the ability to persist things - did you set up the storage layer for that?

So, we'll invariably hit scaling issues unless avoid the reduce and by writing to disk during the map operation. Nevertheless, I'm really excited by how easy this was to set up, and I hope others will help improve upon it.

My plan for scaling this to arbitrary size is actually to use cubed for the virtualizarr array reduction too - see #123 (comment). I expect this to be pretty complicated to achieve though - I'm not even sure if it's possible yet.

TomNicholas · 2024-07-30T03:38:28Z

No, I rebuilt the runtime image using the latest VirtualiZarr commit, and the error persisted, so I believe it's real.

Presumably this error can be reproduced without lithops involved at all?

thodson-usgs · 2024-07-30T03:47:10Z

Presumably this error can be reproduced without lithops involved at all?

Good point. I'll try that next, though something is fishy or else how did this work with previous versions?

My plan for scaling this to arbitrary size is actually to use cubed for the virtualizarr array reduction too - see #123 (comment). I expect this to be pretty complicated to achieve though - I'm not even sure if it's possible yet.

Ah, okay. So maybe my hope of simply creating a skeleton zarr, then writing the meta-chunks during the map with to_zarr(region=) is half baked. (The current workflow writes to json, but I'll test cloud optimized formats once this is running.)

TomNicholas · 2024-07-30T04:18:22Z

I'll try that next, though something is fishy or else how did this work with previous versions?

My current guess is that we simply introduced some accidental regression in virtualizarr recently. The way to find it is to (1) reproduce the error without all the lithops stuff (opening and concatenating 2 files should be enough), then (2) use git bisect to find the offending commit.

Ah, okay. So maybe my hope of simply creating a skeleton zarr, then writing the meta-chunks during the map with to_zarr(region=) is half baked. (The current workflow writes to json, but I'll test cloud optimized formats once this is running.)

If you write a manifest to zarr right now (i.e. a "virtual zarr store") you have no way of opening or loading the data via xarray/zarr, because no zarr reader understands what a manifest.json is yet.

My model of what we're trying to do is:

We need to get to one in-memory virtual xr.Dataset, on one worker, containing references to all the netcdf files. Once we have that we can save it to kerchunk json / kerchunk parquet / zarr manifests / whatever and we would be done (for a large number of references we should currently save to kerchunk parquet format).
We know that that virtual dataset will fit on one worker, because we estimate the memory requirements in Performance roadmap #104 (comment).
Problem is that inspecting the netCDF files to generate chunk references is slow, and we want to parallelize that across many workers (i.e. call open_virtual_dataset(some_file.nc')` on each worker).
This step is embarrassingly parallel and hence can be done nicely with serverless functions, i.e. lithops map.
Issue is that once we open them we have to combine them in-memory via concatenation. If we used dask we could communicate them all to one worker, but with serverless the only way to communicate is to write to persistent storage then read from that storage.
I believe lithops map_reduce uses a storage layer to hold intermediate results https://lithops-cloud.github.io/docs/source/design.html#computation-flow. So it does a map, then reduce, each of which must read from and write to that storage layer.
Using that might be enough for a 1D reduce job, especially if there isn't too many files. But to do it in N-dimensions at large scale we really want a N-dimensional tree-reduce using serverless functions.
That's what cubed implements, for the case of N-D arrays. Then the question is how to get cubed to understand what to do with a ManifestArray instead of a np.ndarray?

for more information, see https://pre-commit.ci

thodson-usgs · 2024-08-19T14:50:02Z

Updated with the changes made in #206. Also fine to close this example, if cubed can do this better.

TomNicholas

Thanks @thodson-usgs !

Also fine to close this example, if cubed can do this better.

I think doing this will Cubed will be quite involved. If this works effectively for you then that's already awesome!

examples/virtualizarr-with-lithops/README.md

Co-authored-by: Tom Nicholas <tom@cworthy.org>

TomNicholas · 2024-09-05T16:39:38Z

Thank you @thodson-usgs !!

abarciauskas-bgse · 2024-09-06T18:22:05Z

examples/virtualizarr-with-lithops/virtualizarr-with-lithops.py

+print(f"{len(file_pattern)} file paths were retrieved.")
+
+
+def map_references(fil):


is there a reason not to use file instead of fil

I must've copied that directly from an example. I'll need to check whether this follows some convention or just a typo.

douglatornell · 2024-09-06T21:59:30Z

Just a guess... file was a reserved word in Python2. If the example code was old enough, or written by someone with habits formed in the Python2 days (I'm one of those folks 😁 ), that might explain things.

thodson-usgs · 2024-09-07T04:37:32Z

Just a guess... file was a reserved word in Python2. If the example code was old enough, or written by someone with habits formed in the Python2 days (I'm one of those folks 😁 ), that might explain things.

Ah, @douglatornell! I thought that was the case but then I didn't see file among the reserved words. Thanks for clarifying.

douglatornell · 2024-09-07T16:52:18Z

Yeah @thodson-usgs, I did a bit of a double take when I looked at the reserved words list and realized that file isn't reserved anymore in Python3.

abarciauskas-bgse · 2024-09-08T17:09:47Z

@thodson-usgs excited to see the lithops integration!

thodson-usgs temporarily deployed to test-release July 29, 2024 22:06 — with GitHub Actions Inactive

TomNicholas added references generation Reading byte ranges from archival files performance usage example Real world use case examples labels Jul 30, 2024

TomNicholas mentioned this pull request Jul 30, 2024

MetadataError from ValueError: Could not convert object to NumPy datetime #201

Closed

thodson-usgs temporarily deployed to test-release July 30, 2024 03:33 — with GitHub Actions Inactive

thodson-usgs added 4 commits July 30, 2024 22:14

Set ZArray fill_value back to nan

e526622

Set NaT as datetime64 default fill value

c21b171

Add example to create a virtual dataset using lithops

6be97df

Rename file

8e301ef

thodson-usgs force-pushed the lithops-to-kerchunk-example branch from 064380f to 8e301ef Compare July 31, 2024 19:35

thodson-usgs changed the title ~~WIP: Add example to create a virtual dataset using lithops~~ Add example to create a virtual dataset using lithops Jul 31, 2024

thodson-usgs temporarily deployed to test-release July 31, 2024 19:36 — with GitHub Actions Inactive

thodson-usgs and others added 2 commits August 19, 2024 09:44

Merge branch 'main' into lithops-to-kerchunk-example

0a7ee8d

[pre-commit.ci] auto fixes from pre-commit.com hooks

520bea2

for more information, see https://pre-commit.ci

pre-commit-ci bot temporarily deployed to test-release August 19, 2024 14:45 Inactive

TomNicholas approved these changes Aug 19, 2024

View reviewed changes

examples/virtualizarr-with-lithops/README.md Outdated Show resolved Hide resolved

examples/virtualizarr-with-lithops/README.md Outdated Show resolved Hide resolved

examples/virtualizarr-with-lithops/README.md Outdated Show resolved Hide resolved

thodson-usgs and others added 3 commits August 19, 2024 10:47

Update examples/virtualizarr-with-lithops/README.md

2ad483a

Co-authored-by: Tom Nicholas <tom@cworthy.org>

Update examples/virtualizarr-with-lithops/README.md

ab13915

Co-authored-by: Tom Nicholas <tom@cworthy.org>

Update examples/virtualizarr-with-lithops/README.md

cd361d4

Co-authored-by: Tom Nicholas <tom@cworthy.org>

thodson-usgs temporarily deployed to test-release August 19, 2024 15:49 — with GitHub Actions Inactive

Merge branch 'main' into lithops-to-kerchunk-example

3c35bbc

TomNicholas temporarily deployed to test-release August 27, 2024 13:24 — with GitHub Actions Inactive

norlandrhagen mentioned this pull request Aug 27, 2024

Virtualizarr + Coiled Serverless Example Notebook #233

Merged

1 task

Merge branch 'main' into lithops-to-kerchunk-example

f777ccf

thodson-usgs temporarily deployed to test-release September 5, 2024 13:55 — with GitHub Actions Inactive

TomNicholas merged commit 53a609f into zarr-developers:main Sep 5, 2024
8 checks passed

abarciauskas-bgse reviewed Sep 6, 2024

View reviewed changes

TomNicholas mentioned this pull request Dec 16, 2024

Add open_virtual_mfdataset #349

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example to create a virtual dataset using lithops #203

Add example to create a virtual dataset using lithops #203

thodson-usgs commented Jul 29, 2024 •

edited

Loading

TomNicholas commented Jul 30, 2024 •

edited

Loading

thodson-usgs commented Jul 30, 2024

thodson-usgs commented Jul 30, 2024 •

edited

Loading

TomNicholas commented Jul 30, 2024

TomNicholas commented Jul 30, 2024

thodson-usgs commented Jul 30, 2024 •

edited

Loading

TomNicholas commented Jul 30, 2024

thodson-usgs commented Aug 19, 2024

TomNicholas left a comment

TomNicholas commented Sep 5, 2024

abarciauskas-bgse Sep 6, 2024

thodson-usgs Sep 6, 2024

douglatornell commented Sep 6, 2024

thodson-usgs commented Sep 7, 2024

douglatornell commented Sep 7, 2024

abarciauskas-bgse commented Sep 8, 2024

		print(f"{len(file_pattern)} file paths were retrieved.")


		def map_references(fil):

Add example to create a virtual dataset using lithops #203

Add example to create a virtual dataset using lithops #203

Conversation

thodson-usgs commented Jul 29, 2024 • edited Loading

Resolved with #206

TomNicholas commented Jul 30, 2024 • edited Loading

thodson-usgs commented Jul 30, 2024

thodson-usgs commented Jul 30, 2024 • edited Loading

TomNicholas commented Jul 30, 2024

TomNicholas commented Jul 30, 2024

thodson-usgs commented Jul 30, 2024 • edited Loading

TomNicholas commented Jul 30, 2024

thodson-usgs commented Aug 19, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

TomNicholas commented Sep 5, 2024

abarciauskas-bgse Sep 6, 2024

Choose a reason for hiding this comment

thodson-usgs Sep 6, 2024

Choose a reason for hiding this comment

douglatornell commented Sep 6, 2024

thodson-usgs commented Sep 7, 2024

douglatornell commented Sep 7, 2024

abarciauskas-bgse commented Sep 8, 2024

thodson-usgs commented Jul 29, 2024 •

edited

Loading

TomNicholas commented Jul 30, 2024 •

edited

Loading

thodson-usgs commented Jul 30, 2024 •

edited

Loading

thodson-usgs commented Jul 30, 2024 •

edited

Loading