-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization issue with distributed, h5netcdf, and fsspec (ImplicitToExplicitIndexingAdapter) #4591
Comments
I finally found a permutation that works, which makes me think this is an fsspec error. import gcsfs
gcs = gcsfs.GCSFileSystem()
url = 'gs://ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc'
openfile = gcs.open(url, mode='rb')
dsgcs = xr.open_dataset(openfile, chunks=3000)
dsgcs.surface.mean().compute() |
I don't think it's fsspec, the HTTPFileSystem and file objects are known to serialise. However
(that's one of the keys I picked from the graph at random, your keys may differ) |
Can you figure out how the http version differs from the gcs version? That might hold a clue. |
OK, I can see a thing after all... please stand by |
|
The |
Thanks for your quick response to this Martin! |
OK, I think I understand what's going on. Xarray serializes arguments that should suffice to recreate/open a backend-specific file object (e.g., |
This is fixed by fsspec/filesystem_spec#477. However, the existence of this issue points to the need for more ecosystem-wide integration testing of xarray / dask / zarr / fsspec. I know we discussed this is on some other issue, but I can't find it... |
This issue appears to be back in some form, with The code looks like this, using fsspec's mapper API to access Azure blob store:
I have not tracked down a self-contained reproducer, as it only fails for one call but not others of a similar form. Reporting it while I dig into it further, in case you have any suggestions.
|
I only have vague thoughts. To be sure: you can pickle the file-system, any mapper ( The question here is, why msgpack is being invoked. Those items, as well as any internal xarray stuff should only be in tasks, and so pickled. Is there a high-level-graph layer encapsulating things that were previously pickled? The only things that appear in any HLG-layer should be the paths and storage options needed to open a file-system, not the file-system itself. |
I am trying to use |
This was originally reported by @jkingslake at pangeo-data/pangeo-datastore#116.
What happened:
I tried to open a netcdf file over http using fsspec and the h5netcdf engine and compute data using dask.distributed. It appears that our
ImplicitToExplicitIndexingAdapter
is [no longer?] serializable?What you expected to happen:
Things would work. Indeed, I could swear this used to work with previous versions.
Minimal Complete Verifiable Example:
raises the following error
Anything else we need to know?:
One can work around this by using the netcdf4 library's new and undocumented ability to open files over http.
However, the fsspec + h5netcdf path should work!
Environment:
Output of xr.show_versions()
Also fsspec 0.8.4
cc @martindurant for fsspec integration.
The text was updated successfully, but these errors were encountered: