Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow other fsspec protocols than local and s3 #126

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented May 27, 2024

This change simplifies the handling of filepaths in _fsspec_openfile_from_filepath, and removes some restrictions around what can be passed in. Most notably, it allows the use of non-S3 and local filepaths.

I see that virtualizarr/tests/test_xarray.py::test_anon_read_s3 covers this, but that's failing for me on main with

E           aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host carbonplan-share.s3.regionone.amazonaws.com:443 ssl:default [Name or service not known]

Is that just a local configuration issue for me, or is it failing for others as well?

@norlandrhagen
Copy link
Collaborator

norlandrhagen commented May 28, 2024

Hey @TomAugspurger, I'm not getting that failure when running the tests on your branch, which is probably due to some AWS credential magic on my end. Updating the tests to run against minio or something similar would probably help with that..

@jbusecke
Copy link
Contributor

jbusecke commented Jun 3, 2024

Thanks for the PR @TomAugspurger.

I just tried this out and am running into some unexpected behavior:

I installed virtualizarr from this PR branc
!pip install git+https://github.com/TomAugspurger/VirtualiZarr.git@user/tom/feature/filesystems

For the setup

from virtualizarr import open_virtual_dataset
from virtualizarr.kerchunk import FileType

urls = [
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc",
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_202001-202412.nc",
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_202501-202912.nc",
    "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_203001-203412.nc",
]

Then I tried to naively do this:

vds_list = []
for url in urls:
    vds = open_virtual_dataset(
        url, indexes={}
    )
    vds_list.append(vds)

which failed with this error

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File [/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:422](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py#line=421), in HTTPFileSystem._info(self, url, **kwargs) 420 try: 421 info.update( --> 422 await _file_info( 423 self.encode_url(url), 424 size_policy=policy, 425 session=session, 426 **self.kwargs, 427 **kwargs, 428 ) 429 ) 430 if info.get("size") is not None:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:831, in _file_info(url, session, size_policy, **kwargs)
830 elif size_policy == "get":
--> 831 r = await session.get(url, allow_redirects=ar, **kwargs)
832 else:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/client.py:978, in ClientSession.get(self, url, allow_redirects, **kwargs)
976 """Perform HTTP GET request."""
977 return _RequestContextManager(
--> 978 self._request(hdrs.METH_GET, url, allow_redirects=allow_redirects, **kwargs)
979 )

TypeError: ClientSession._request() got an unexpected keyword argument 'key'

The above exception was the direct cause of the following exception:

FileNotFoundError Traceback (most recent call last)
Cell In[13], line 4
2 vds_list = []
3 for f in tqdm(urls):
----> 4 vds = open_virtual_dataset(
5 f, indexes={}
6 )
7 vds_list.append(vds)
9 #, reader_options={}

File /srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/xarray.py:108, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
102 return open_virtual_dataset_from_v3_store(
103 storepath=filepath, drop_variables=drop_variables, indexes=indexes
104 )
105 else:
106 # this is the only place we actually always need to use kerchunk directly
107 # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
--> 108 vds_refs = kerchunk.read_kerchunk_references_from_file(
109 filepath=filepath,
110 filetype=filetype,
111 reader_options=reader_options,
112 )
113 virtual_vars = virtual_vars_from_kerchunk_refs(
114 vds_refs,
115 drop_variables=drop_variables + loadable_variables,
116 virtual_array_class=virtual_array_class,
117 )
118 ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})

File /srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/kerchunk.py:76, in read_kerchunk_references_from_file(filepath, filetype, reader_options)
60 """
61 Read a single legacy file and return kerchunk references to its contents.
62
(...)
72 so ensure reader_options match selected Kerchunk reader arguments.
73 """
75 if filetype is None:
---> 76 filetype = _automatically_determine_filetype(
77 filepath=filepath, reader_options=reader_options
78 )
80 # if filetype is user defined, convert to FileType
81 filetype = FileType(filetype)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/kerchunk.py:117, in _automatically_determine_filetype(filepath, reader_options)
113 def _automatically_determine_filetype(
114 *, filepath: str, reader_options: Optional[dict] = {}
115 ) -> FileType:
116 file_extension = Path(filepath).suffix
--> 117 fpath = _fsspec_openfile_from_filepath(
118 filepath=filepath, reader_options=reader_options
119 )
121 if file_extension == ".nc":
122 # based off of: #43 (comment)
123 magic = fpath.read()

File /srv/conda/envs/notebook/lib/python3.11/site-packages/virtualizarr/utils.py:58, in _fsspec_openfile_from_filepath(filepath, reader_options)
56 # using dict merge operator to add in defaults if keys are not specified
57 storage_options = protocol_defaults | storage_options
---> 58 fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
60 return fpath

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/spec.py:1298, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
1296 else:
1297 ac = kwargs.pop("autocommit", not self._intrans)
-> 1298 f = self._open(
1299 path,
1300 mode=mode,
1301 block_size=block_size,
1302 autocommit=ac,
1303 cache_options=cache_options,
1304 **kwargs,
1305 )
1306 if compression is not None:
1307 from fsspec.compression import compr

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:361, in HTTPFileSystem._open(self, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs)
359 kw["asynchronous"] = self.asynchronous
360 kw.update(kwargs)
--> 361 size = size or self.info(path, **kwargs)["size"]
362 session = sync(self.loop, self.set_session)
363 if block_size and size:

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py:118, in sync_wrapper..wrapper(*args, **kwargs)
115 @functools.wraps(func)
116 def wrapper(*args, **kwargs):
117 self = obj or args[0]
--> 118 return sync(self.loop, func, *args, **kwargs)

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
101 raise FSTimeoutError from return_result
102 elif isinstance(return_result, BaseException):
--> 103 raise return_result
104 else:
105 return return_result

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
54 coro = asyncio.wait_for(coro, timeout=timeout)
55 try:
---> 56 result[0] = await coro
57 except Exception as ex:
58 result[0] = ex

File /srv/conda/envs/notebook/lib/python3.11/site-packages/fsspec/implementations/http.py:435, in HTTPFileSystem._info(self, url, **kwargs)
432 except Exception as exc:
433 if policy == "get":
434 # If get failed, then raise a FileNotFoundError
--> 435 raise FileNotFoundError(url) from exc
436 logger.debug("", exc_info=exc)
438 return {"name": url, "size": None, **info, "type": "file"}

FileNotFoundError: http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc

TypeError: ClientSession._request() got an unexpected keyword argument 'key'

Makes me think that the protocol is not properly detected?

When I add reader_options={} it works as intended:

vds_list = []
for url in tqdm(urls):
    vds = open_virtual_dataset(
        url, indexes={}, reader_options={}
    )
    vds_list.append(vds)

I believe that the default values for reader options are basically invalidating this logic.

virtualizarr/utils.py Outdated Show resolved Hide resolved
@TomNicholas TomNicholas added the remote files Reading references from non-local files label Jun 3, 2024
@TomNicholas
Copy link
Member

Thanks for this contribution @TomAugspurger !

If @norlandrhagen is happy with this (including @jbusecke 's fix), then I am happy to merge it.

@norlandrhagen
Copy link
Collaborator

100% Thanks for the fixes @TomAugspurger and @jbusecke!

Co-authored-by: Julius Busecke <julius@ldeo.columbia.edu>
@TomNicholas TomNicholas merged commit cc97112 into zarr-developers:main Jun 5, 2024
5 checks passed
@jbusecke
Copy link
Contributor

jbusecke commented Jun 5, 2024

Oh weird. My test case is still failing after pulling from main? My fix might have not been sufficient?

@TomNicholas
Copy link
Member

Damn - is there a test/reproducer for this issue @jbusecke (that you can raise in a new issue)?

@jbusecke
Copy link
Contributor

jbusecke commented Jun 6, 2024

Yeah I have that on my list, but very busy this week, so might have to push to next week. Please ping me as needed 😆

@TomAugspurger TomAugspurger deleted the user/tom/feature/filesystems branch June 7, 2024 14:51
@jbusecke
Copy link
Contributor

@TomNicholas see #135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
remote files Reading references from non-local files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support opening files over http
4 participants