Introducing FilePatternToChunks: IO with Pangeo-Forge's FilePattern interface. #31

alxmrs · 2021-08-09T22:30:30Z

This if the first of a few changes that will let users read in datasets using Pangeo-Forge's FilePattern interface 0. Here, users can describe how data is stored along concat and merge dimensions. This transform will read in the datasets into chunks. This module can be leveraged in pipelines to convert natively formatted datasets to Zarr.

To make use of this transform, the user will need to install pangeo-forge-recipes separately. This dependency is included in the test dependencies.

As of now, this transform is not exposed to the user (i.e., not included in the primary __init__.py). I plan to do this (and update the docs) once the module is tested and feature complete (#29).

…nterface. This if the first of a few changes that will let users read in datasets using Pangeo-Forge's `FilePattern` interface [0]. Here, users can describe how data is stored along concat and merge dimensions. This transform will read in the datasets into chunks (and optionally, smaller `sub_chunks`). This module can be leveraged in pipelines to convert natively formatted datasets to Zarr. To make use of this transform, the user will need to install `pangeo-forge-recipes` separately. This dependency is included in the test dependencies. As on now, this transform is not exposed to the user (i.e., not included in the primary `__init__.py`). I plan to do this (and update the docs) once the module is tested and feature complete (google#29). [0]: https://pangeo-forge.readthedocs.io/en/latest/file_patterns.html

… XArray).

shoyer · 2021-08-16T23:13:21Z

xarray_beam/_src/pangeo.py

+  ) -> Iterator[Tuple[core.Key, xarray.Dataset]]:
+    """Open datasets into chunks with XArray."""
+    path = self.pattern[index]
+    with FileSystems().open(path) as file:


Do Beam's filesystems really all work out of the box with Xarray? If so, that's awesome!

Can you verify that it works with both netCDF3 and netCDF4 files? These would be using different underlying storage backends (scipy vs h5netcdf).

To be honest, I'm a little skeptical that this will work well. I suspect we'll end up up needing to copy temporary files to local disk (but I'd love to be proven wrong!)

Let me experiment and see how this works. In my tests in the previous iteration of this change, this worked well with GCS's IO objects.

I experimented a bit more with this based on @mjwillson's suggestion.

Amazingly, it seems that uses file-like objects in Xarray does actually work as used here, though making a local copy might still have better performance.

What doesn't work yet -- but hopefully with small upstream changes to Xarray could work -- is passing xarray datasets opened with these file-like objects into a Beam pipeilne. That could let us do the actual data loading from netCDF in separate workers, which could be quite a win!

It's a bit unclear to me how this would not work in a Beam pipeline (or, what needs to be done to get this win). Can you explain a bit more?

Is this a correct understanding: With the change you're referring to, we could pickle the XArray open command (with the file-like object) as PCollections, which would allow us to split the open across workers?

Is this a correct understanding: With the change you're referring to, we could pickle the XArray open command (with the file-like object) as PCollections, which would allow us to split the open across workers?

With this change, we could pickle lazy xarray.Dataset objects corresponding to open netCDF files and pass them between stages in in a Beam pipeline.

Some data would still need to get loaded on worker on which xarray.open_dataset() is called, but this could be much less data than the entire file (e.g., only the "metadata" part of the file). The bulk of the loading work could be split across multiple workers, which could be quite useful for processing large (GB+) netCDF files.

xarray_beam/_src/pangeo.py

xarray_beam/_src/pangeo_test.py

Code is passing tests; cleanup and e2e testing is still needed.

xarray_beam/_src/pangeo_forge.py

Added comment to explain early return.

…nks. This transform will now only open file pattern datasets as whole chunks. Re-chunk (i.e. "sub_chunk"s) can be delegated to a SplitChunk() transform layered after this one.

As a backup to the `FileSystems().open(...)` method, we use fsspec to create a local copy of the data for opening with `xr.open_dataset(...)`.

shoyer

Looks great, thanks Alex!

xarray_beam/_src/pangeo_forge.py

Co-authored-by: Stephan Hoyer <shoyer@google.com>

shoyer · 2021-09-22T17:31:06Z

xarray_beam/_src/pangeo_forge.py

+      try:
+        yield xarray.open_dataset(file, **self.xarray_open_kwargs)
+      except (TypeError, OSError) as e:
+
+        if not self.local_copy:
+          raise ValueError(f'cannot open {path!r} with buffering.') from e
+
+        # The cfgrib engine (and others) may fail with the FileSystems method of
+        # opening with BufferedReaders. Here, we open the data locally to make
+        # it easier to work with XArray.
+        with fsspec.open_local(
+            f"simplecache::{path}",
+            simplecache={'cache_storage': '/tmp/files'}
+        ) as fs_file:
+          yield xarray.open_dataset(fs_file, **self.xarray_open_kwargs)


Rather than using local_copy as a fall-back, can we just use an if statement?

Suggested change

try:

yield xarray.open_dataset(file, **self.xarray_open_kwargs)

except (TypeError, OSError) as e:

if not self.local_copy:

raise ValueError(f'cannot open {path!r} with buffering.') from e

# The cfgrib engine (and others) may fail with the FileSystems method of

# opening with BufferedReaders. Here, we open the data locally to make

# it easier to work with XArray.

with fsspec.open_local(

f"simplecache::{path}",

simplecache={'cache_storage': '/tmp/files'}

) as fs_file:

yield xarray.open_dataset(fs_file, **self.xarray_open_kwargs)

if self.local_copy:

# The cfgrib engine (and others) may fail with the FileSystems method of

# opening with BufferedReaders. Here, we open the data locally to make

# it easier to work with XArray.

with fsspec.open_local(

f"simplecache::{path}",

simplecache={'cache_storage': '/tmp/files'}

) as fs_file:

yield xarray.open_dataset(fs_file, **self.xarray_open_kwargs)

else:

yield xarray.open_dataset(file, **self.xarray_open_kwargs)

The old contextmanager approach wasn't applicable, since `open_local` returns a string (path to the open file).

google-cla bot added the cla: yes label Aug 9, 2021

Including scipy as test dependency (so we can write NetCDF files with…

48162d2

… XArray).

rabernat mentioned this pull request Aug 16, 2021

Some notes on alignment with pangeo-forge-recipes around "keys" #32

Open

shoyer reviewed Aug 16, 2021

View reviewed changes

alxmrs added 3 commits September 14, 2021 16:30

Initial PR feedback, WIP

f211072

Better translation of FilePatternIndex to Key.

c125364

Code is passing tests; cleanup and e2e testing is still needed.

Initial cleanup; fixing broken CI.

5896d45

shoyer reviewed Sep 15, 2021

View reviewed changes

xarray_beam/_src/pangeo_forge.py Outdated Show resolved Hide resolved

xarray_beam/_src/pangeo_forge.py Outdated Show resolved Hide resolved

alxmrs added 2 commits September 14, 2021 22:45

expand step is splittable (only uses a create fn).

e0f22df

Added comment to explain early return.

Clean up file whitespace

83fe582

alxmrs mentioned this pull request Sep 15, 2021

Support opening datasets with file-like objects in a Beam pipeline #37

Open

alxmrs added 7 commits September 14, 2021 23:21

Revert create strategy.

8cf7117

Simplified FilePatternToChunks transform -- no split_chunks / sub_chu…

8d39374

…nks. This transform will now only open file pattern datasets as whole chunks. Re-chunk (i.e. "sub_chunk"s) can be delegated to a SplitChunk() transform layered after this one.

Fixed broken unit tests.

bb4675b

Using an "all close" instead of an "identical" assert.

1df6e81

Single chunks are also now using "all close".

b231673

Updating file open capability to support grib files.

c9d9f8c

As a backup to the `FileSystems().open(...)` method, we use fsspec to create a local copy of the data for opening with `xr.open_dataset(...)`.

Added back sub-chunks; Create + FlatMap is now splittable.

dd14005

shoyer approved these changes Sep 22, 2021

View reviewed changes

xarray_beam/_src/pangeo_forge.py Outdated Show resolved Hide resolved

shoyer reviewed Sep 22, 2021

View reviewed changes

xarray_beam/_src/pangeo_forge.py Outdated Show resolved Hide resolved

Renaming 'sub_chunks' to 'chunks'.

73671d2

alxmrs mentioned this pull request Sep 22, 2021

Add split_vars to FilePatternToChunks transform. #38

Open

alxmrs and others added 2 commits September 22, 2021 10:01

_open_dataset() has error handling for open_local call.

209632d

Co-authored-by: Stephan Hoyer <shoyer@google.com>

Merge branch 'pangeo-fp' of github.com:alxmrs/xarray-beam into pangeo-fp

d1e91a8

shoyer added the pull ready Ready for Copybara import and testing label Sep 22, 2021

Added 'local_copy' option.

c3a668a

shoyer reviewed Sep 22, 2021

View reviewed changes

alxmrs added 2 commits September 22, 2021 10:40

Imperative 'local_copy' flag instead of a fallback.

d7f284b

open_local downloads file to a temporary directory.

0ee8980

The old contextmanager approach wasn't applicable, since `open_local` returns a string (path to the open file).

shoyer approved these changes Sep 22, 2021

View reviewed changes

copybara-service bot merged commit ebfdbf0 into google:main Sep 22, 2021

rabernat mentioned this pull request Dec 18, 2021

Should we just adopt xarray-beam as our internal data model? pangeo-forge/pangeo-forge-recipes#256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing FilePatternToChunks: IO with Pangeo-Forge's FilePattern interface. #31

Introducing FilePatternToChunks: IO with Pangeo-Forge's FilePattern interface. #31

alxmrs commented Aug 9, 2021 •

edited

Loading

shoyer Aug 16, 2021

alxmrs Aug 24, 2021

shoyer Aug 26, 2021

alxmrs Aug 30, 2021

shoyer Aug 30, 2021

shoyer left a comment

shoyer Sep 22, 2021

Introducing FilePatternToChunks: IO with Pangeo-Forge's FilePattern interface. #31

Introducing FilePatternToChunks: IO with Pangeo-Forge's FilePattern interface. #31

Conversation

alxmrs commented Aug 9, 2021 • edited Loading

shoyer Aug 16, 2021

Choose a reason for hiding this comment

alxmrs Aug 24, 2021

Choose a reason for hiding this comment

shoyer Aug 26, 2021

Choose a reason for hiding this comment

alxmrs Aug 30, 2021

Choose a reason for hiding this comment

shoyer Aug 30, 2021

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

shoyer Sep 22, 2021

Choose a reason for hiding this comment

alxmrs commented Aug 9, 2021 •

edited

Loading