Add open_virtual_mfdataset #349

TomNicholas · 2024-12-16T00:03:56Z

Here I have copied the code from xr.open_mfdataset, changed it to use open_virtual_dataset, and added an option to parallelize with lithops as an alternative to using dask.delayed.

I haven't even tried to run this yet, but I think this is the right approach @tomwhite? I realised we don't need cubed's blockwise because xarray.open_mfdataset has internal logic to turn the N-dimensional concat into a 1D map already, so lithops.map should be fine?

Also I think based on our conversation we should be able to use lithops.map instead of lithops.map_reduce like @thodson-usgs did in #203 because the tiny size of the virtual datasets being returned to the client means that we should be able to get away with a single reduction step on the client even at large scale? (see also #104 for justification that we only need to send back kB-sized objects).

Adds open_virtual_mfdataset as suggested in open_virtual_mfdataset #345, but also sketches out how we might close both Trying to run open_virtual_dataset in parallel #95 and Serverless parallelization of reference generation #123
Tests added
Tests passing
Full type hint coverage
Changes are documented in docs/releases.rst
New functions/methods are listed in api.rst
New functionality has documentation

TomNicholas · 2024-12-16T00:04:49Z

virtualizarr/backend.py

+        datasets, closers = dask.compute(datasets, closers)
+    elif parallel == "lithops":
+
+        def generate_refs(path):


This is the equivalent of @thodson-usgs 's map_references function

VirtualiZarr/examples/virtualizarr-with-lithops/virtualizarr-with-lithops.py

Line 25 in 1dbd119

def map_references(fil):

TomNicholas · 2024-12-16T00:05:57Z

virtualizarr/backend.py

+
+        # wait for all the serverless workers to finish, and send their resulting virtual datasets back to the client
+        completed_futures, _ = fn_exec.wait(futures, download_results=True)
+        virtual_datasets = [future.get_result() for future in completed_futures]


IIUC this will cause every serverless worker to send a small virtual dataset back to the client process over the internet somehow

TomNicholas · 2024-12-16T00:07:41Z

virtualizarr/backend.py

+        elif combine == "by_coords":
+            # Redo ordering from coordinates, ignoring how they were ordered
+            # previously
+            combined = combine_by_coords(


This is only going to work if we have used loadable_variables to create indexes for 1D coordinates, so it's a good reason to implement the suggestion in #335 (comment)

TomNicholas · 2024-12-16T00:10:40Z

virtualizarr/backend.py

+from xarray.backends.api import _multi_file_closer
+from xarray.backends.common import _find_absolute_paths
+from xarray.core.combine import _infer_concat_order_from_positions, _nested_combine


I don't like importing these deep xarray internals like this (though _infer_concat_order_from_positions and _nested_combine haven't changed since I wrote them 6 years ago), but the only alternative would be to make a general virtualizarr backend engine for xarray (see #35).

TomNicholas · 2024-12-16T00:12:17Z

virtualizarr/backend.py

+        # lithops doesn't have a delayed primitive
+        open_ = open_virtual_dataset


I think the code would be more straightforward if the parallel primitive we used for lithops was the same as the one we used for dask.

TomNicholas · 2024-12-16T00:12:47Z

virtualizarr/backend.py

+    elif parallel == "lithops":
+        import lithops


I believe all of this could also be useful upstream in xr.open_mfdataset

tomwhite · 2024-12-17T12:45:34Z

I realised we don't need cubed's blockwise because xarray.open_mfdataset has internal logic to turn the N-dimensional concat into a 1D map already, so lithops.map should be fine?

Yes, that should work fine. We may want to loosen/generalize blockwise slightly in Cubed to return an arbitrary object so it can be done with Cubed - but that can be done later.

Also I think based on our conversation we should be able to use lithops.map instead of lithops.map_reduce like @thodson-usgs did in #203 because the tiny size of the virtual datasets being returned to the client means that we should be able to get away with a single reduction step on the client even at large scale? (see also #104 for justification that we only need to send back kB-sized objects).

Agreed - it will be interesting to see this for large datasets. (It's also similar to the approach I've taken for storing data in Icechunk where the changesets are returned to the client - again, small kB-sized UUIDs.)

TomNicholas · 2024-12-17T17:00:27Z

virtualizarr/tests/test_backend.py

+        print(combined_vds)
+        print(expected_vds)
+        print(combined_vds.indexes)
+        print(combined_vds.indexes)
+        print(combined_vds["lat"].attrs)
+        print(expected_vds["lat"].attrs)
+        print(combined_vds["lat"].encoding)
+        print(expected_vds["lat"].encoding)
+        print(combined_vds["lat"].data)
+        print(expected_vds["lat"].data)
+        print(combined_vds["lat"].data.zarray)
+        print(expected_vds["lat"].data.zarray)
+        print(combined_vds["lat"].data.manifest.dict())
+        print(expected_vds["lat"].data.manifest.dict())
+
+        # TODO this assertion unintentially triggers loading, see issue #354
+        # xrt.assert_identical(combined_vds.coords.variables['lat'], expected_vds.coords.variables['lat'])
+
+        # TODO I have no idea why this assertion fails for all the coords - everything about the coords looks identical
+        # xrt.assert_identical(combined_vds, expected_vds)


I'm stuck on why this assert_identical on the datasets doesn't pass. It complains that all the coordinates are different, but every attribute I can think to check looks identical 😖

TomNicholas · 2024-12-18T02:06:37Z

docs/api.rst

@@ -19,7 +19,7 @@ Reading
    :toctree: generated/

    open_virtual_dataset
-
+    open_virtual_mfdataset


Note to self: the docs and especially the readme should be rewritten to put this function front and center.

TomNicholas added 2 commits December 15, 2024 17:54

copy implementation from xarray

a48e8a4

sketch idea for lithops parallelization

75c7da3

TomNicholas added the enhancement New feature or request label Dec 16, 2024

TomNicholas had a problem deploying to test-release December 16, 2024 00:04 — with GitHub Actions Failure

TomNicholas commented Dec 16, 2024

View reviewed changes

standardize naming of variables

ce5a096

TomNicholas had a problem deploying to test-release December 16, 2024 05:03 — with GitHub Actions Failure

add to public API

bcf1b70

TomNicholas had a problem deploying to test-release December 16, 2024 05:07 — with GitHub Actions Failure

fix errors caused by trying to import xarray types

61f0f32

TomNicholas temporarily deployed to test-release December 16, 2024 05:10 — with GitHub Actions Inactive

start writing tests

5317207

TomNicholas temporarily deployed to test-release December 16, 2024 05:41 — with GitHub Actions Inactive

TomNicholas added 2 commits December 17, 2024 10:32

passing test for combining in serial

cd54328

Merge branch 'main' into open_virtual_mfdataset

323904c

TomNicholas temporarily deployed to test-release December 17, 2024 15:34 — with GitHub Actions Inactive

requires_kerchunk

c229c06

TomNicholas temporarily deployed to test-release December 17, 2024 16:06 — with GitHub Actions Inactive

test for lithops with default LocalHost executor

f296ef9

TomNicholas temporarily deployed to test-release December 17, 2024 16:25 — with GitHub Actions Inactive

notes on confusing AssertionError

542f063

TomNicholas temporarily deployed to test-release December 17, 2024 16:54 — with GitHub Actions Inactive

TomNicholas mentioned this pull request Dec 17, 2024

Split optional dependencies in pyproject.toml #309

Open

2 tasks

ensure lithops is installed

a013b2c

TomNicholas temporarily deployed to test-release December 17, 2024 16:57 — with GitHub Actions Inactive

remove uneeded fixture

f5123cf

TomNicholas temporarily deployed to test-release December 17, 2024 16:58 — with GitHub Actions Inactive

TomNicholas commented Dec 17, 2024

View reviewed changes

TomNicholas commented Dec 18, 2024

View reviewed changes

Merge branch 'main' into open_virtual_mfdataset

f134644

TomNicholas temporarily deployed to test-release December 18, 2024 02:37 — with GitHub Actions Inactive

Merge branch 'main' into open_virtual_mfdataset

a2c64d0

TomNicholas temporarily deployed to test-release December 18, 2024 05:16 — with GitHub Actions Inactive

Merge branch 'main' into open_virtual_mfdataset

86f2daf

TomNicholas temporarily deployed to test-release December 18, 2024 17:32 — with GitHub Actions Inactive

ayushnag mentioned this pull request Dec 20, 2024

Add tutorial notebook for open_virtual_dataset nsidc/earthaccess#903

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add open_virtual_mfdataset #349

Add open_virtual_mfdataset #349

TomNicholas commented Dec 16, 2024 •

edited

Loading

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024

TomNicholas Dec 16, 2024 •

edited

Loading

TomNicholas Dec 16, 2024

tomwhite commented Dec 17, 2024

TomNicholas Dec 17, 2024

TomNicholas Dec 18, 2024

		# lithops doesn't have a delayed primitive
		open_ = open_virtual_dataset

Add open_virtual_mfdataset #349

Are you sure you want to change the base?

Add open_virtual_mfdataset #349

Conversation

TomNicholas commented Dec 16, 2024 • edited Loading

TomNicholas Dec 16, 2024

Choose a reason for hiding this comment

TomNicholas Dec 16, 2024

Choose a reason for hiding this comment

TomNicholas Dec 16, 2024

Choose a reason for hiding this comment

TomNicholas Dec 16, 2024

Choose a reason for hiding this comment

TomNicholas Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas Dec 16, 2024

Choose a reason for hiding this comment

tomwhite commented Dec 17, 2024

TomNicholas Dec 17, 2024

Choose a reason for hiding this comment

TomNicholas Dec 18, 2024

Choose a reason for hiding this comment

TomNicholas commented Dec 16, 2024 •

edited

Loading

TomNicholas Dec 16, 2024 •

edited

Loading