Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening http remote files still not working with defaults #135

Closed
jbusecke opened this issue Jun 11, 2024 · 5 comments · Fixed by #137
Closed

Opening http remote files still not working with defaults #135

jbusecke opened this issue Jun 11, 2024 · 5 comments · Fixed by #137

Comments

@jbusecke
Copy link
Contributor

Continuation of #126

Over at https://github.com/jbusecke/esgf-virtual-zarr-data-access I am maintaining a simple example how to virtualize CMIP6 datasets.

This is a minimal example to reproduce the bug:

from virtualizarr import open_virtual_dataset
url = "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc"
vds = open_virtual_dataset(
          url, indexes={},
          reader_options={}
       )

this works as intended, but if I leave out the reader_options={} it fails

from virtualizarr import open_virtual_dataset
url = "http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc"
vds = open_virtual_dataset(
          url, indexes={},
       )
File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/implementations/http.py:361, in HTTPFileSystem._open(self, path, mode, block_size, autocommit, cache_type, cache_options, size, **kwargs) 359 kw["asynchronous"] = self.asynchronous 360 kw.update(kwargs) --> 361 size = size or self.info(path, **kwargs)["size"] 362 session = sync(self.loop, self.set_session) 363 if block_size and size:

File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/asyn.py:118, in sync_wrapper..wrapper(*args, **kwargs)
115 @functools.wraps(func)
116 def wrapper(*args, **kwargs):
117 self = obj or args[0]
--> 118 return sync(self.loop, func, *args, **kwargs)

File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
101 raise FSTimeoutError from return_result
102 elif isinstance(return_result, BaseException):
--> 103 raise return_result
104 else:
105 return return_result

File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
54 coro = asyncio.wait_for(coro, timeout=timeout)
55 try:
---> 56 result[0] = await coro
57 except Exception as ex:
58 result[0] = ex

File ~/miniforge3/envs/esgf-virtual-zarr-data-access/lib/python3.11/site-packages/fsspec/implementations/http.py:435, in HTTPFileSystem._info(self, url, **kwargs)
432 except Exception as exc:
433 if policy == "get":
434 # If get failed, then raise a FileNotFoundError
--> 435 raise FileNotFoundError(url) from exc
436 logger.debug("", exc_info=exc)
438 return {"name": url, "size": None, **info, "type": "file"}

FileNotFoundError: http://aims3.llnl.gov/thredds/fileServer/css03_data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp126/r1i1p1f1/Amon/tas/gn/v20190710/tas_Amon_MPI-ESM1-2-HR_ssp126_r1i1p1f1_gn_201501-201912.nc

I suspect that there is still something going wrong with the default keywords logic, as I raised back here?

Wondering if we should add some end to end tests with small files on every supported filesystem?

@norlandrhagen
Copy link
Collaborator

Wondering if we should add some end to end tests with small files on every supported filesystem?

The fsspec.open feels very inconsistent. Ryan went on a super deep dive looking into behavior across https/local etc.

@TomNicholas
Copy link
Member

TomNicholas commented Jun 12, 2024

The fsspec.open feels very inconsistent. Ryan went on a super deep dive looking into behavior across https/local etc.

God how annoying. Literally what is the point of fsspec if it doesn't actually specify a common behaviour across different filesystems? It really all is a symptom of this fsspec/filesystem_spec#1446.


What can we do about it in virtualizarr though? We could try to test all the filesystems we might care about (again fsspec should really be doing that...) We could maintain our own little open context manager that actually has consistent behaviour. Or perhaps we could use some other learnings from pangeo-forge?

@TomAugspurger
Copy link
Contributor

I suspect that there is still something going wrong with the default keywords logic, as I raised back #126 (comment)?

The problem is in virtualizarr.kerchunk.read_kerchunk_references_from_file:

reader_options: Optional[dict] = {
"storage_options": {"key": "", "secret": "", "anon": True}
},

That's defining reader_options with s3-specific values.

It's technically API-breaking, but I'd recommend just setting that to Optional[dict[str, Any]] = None. I don't know whether virtualizarr has a sufficiently better opinion for the default arguments that the backend (s3fs / adlfs / etc), but probably not.

@TomNicholas
Copy link
Member

Oh thanks @TomAugspurger ! I thought I fixed that bug in a commit I added to #126 but I obviously missed another occurrence of the default s3-specific values.

It would still be great to have tests that this all works on the main types of storage we care about.

@jbusecke
Copy link
Contributor Author

It would still be great to have tests that this all works on the main types of storage we care about.

I like this idea, but maybe I should take a backseat, since I have the tendency to always write massive end-to-end tests 😆.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants