Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support opening files over http #121

Closed
jbusecke opened this issue May 20, 2024 · 4 comments · Fixed by #126
Closed

Support opening files over http #121

jbusecke opened this issue May 20, 2024 · 4 comments · Fixed by #126
Labels
references generation Reading byte ranges from archival files remote files Reading references from non-local files

Comments

@jbusecke
Copy link
Contributor

I have another usecase and associated feature request.
I want to prototype virtualizarring some of these datasets, but for that we need to be able to support opening files via HTTP protocol in addition to S3 and local.

Here is a MRE:

# MRE for issue
import fsspec
url = 'https://files.isimip.org/ISIMIP3b/InputData/climate/atmosphere/bias-adjusted/global/daily/historical/GFDL-ESM4/gfdl-esm4_r1i1p1f1_w5e5_historical_hurs_global_daily_1850_1850.nc'
vds = open_virtual_dataset(url)

Gives:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[14], line 4
      2 import fsspec
      3 url = 'https://files.isimip.org/ISIMIP3b/InputData/climate/atmosphere/bias-adjusted/global/daily/historical/GFDL-ESM4/gfdl-esm4_r1i1p1f1_w5e5_historical_hurs_global_daily_1850_1850.nc'
----> 4 vds = open_virtual_dataset(url)

File ~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/xarray.py:108, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
    102     return open_virtual_dataset_from_v3_store(
    103         storepath=filepath, drop_variables=drop_variables, indexes=indexes
    104     )
    105 else:
    106     # this is the only place we actually always need to use kerchunk directly
    107     # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
--> 108     vds_refs = kerchunk.read_kerchunk_references_from_file(
    109         filepath=filepath,
    110         filetype=filetype,
    111     )
    112     virtual_vars = virtual_vars_from_kerchunk_refs(
    113         vds_refs,
    114         drop_variables=drop_variables + loadable_variables,
    115         virtual_array_class=virtual_array_class,
    116     )
    117     ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})

File [~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py:76](https://leap.2i2c.cloud/user/jbusecke/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py#line=75), in read_kerchunk_references_from_file(filepath, filetype, reader_options)
     60 """
     61 Read a single legacy file and return kerchunk references to its contents.
     62 
   (...)
     72     so ensure reader_options match selected Kerchunk reader arguments.
     73 """
     75 if filetype is None:
---> 76     filetype = _automatically_determine_filetype(
     77         filepath=filepath, reader_options=reader_options
     78     )
     80 # if filetype is user defined, convert to FileType
     81 filetype = FileType(filetype)

File [~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py:117](https://leap.2i2c.cloud/user/jbusecke/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py#line=116), in _automatically_determine_filetype(filepath, reader_options)
    113 def _automatically_determine_filetype(
    114     *, filepath: str, reader_options: Optional[dict] = {}
    115 ) -> FileType:
    116     file_extension = Path(filepath).suffix
--> 117     fpath = _fsspec_openfile_from_filepath(
    118         filepath=filepath, reader_options=reader_options
    119     )
    121     if file_extension == ".nc":
    122         # based off of: https://github.com/TomNicholas/VirtualiZarr/pull/43#discussion_r1543415167
    123         magic = fpath.read()

File ~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/utils.py:60, in _fsspec_openfile_from_filepath(filepath, reader_options)
     57     fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
     59 else:
---> 60     raise NotImplementedError(
     61         "Only local and s3 file protocols are currently supported"
     62     )
     64 return fpath

NotImplementedError: Only local and s3 file protocols are currently supported

It seems that we might just have to add a new case here to make this work?

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label May 20, 2024
@norlandrhagen
Copy link
Collaborator

@jbusecke or anyone else wanna take this one? Don't wanna step on any toes.

Probably related headache issue:

fsspec/filesystem_spec#579

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 20, 2024

I was messing around with this last week. Only tricky thing is the reader_options (I think the current defaults might not be generally appropriate)

diff --git a/virtualizarr/utils.py b/virtualizarr/utils.py
index 6ba7105..f332163 100644
--- a/virtualizarr/utils.py
+++ b/virtualizarr/utils.py
@@ -2,6 +2,8 @@ from __future__ import annotations
 
 from typing import TYPE_CHECKING, Optional
 
+from pandas import read_csv
+
 if TYPE_CHECKING:
     from fsspec.implementations.local import LocalFileOpener
     from s3fs.core import S3File
@@ -54,11 +56,15 @@ def _fsspec_openfile_from_filepath(
             # using dict merge operator to add in defaults if keys are not specified
             storage_options = s3_anon_defaults | storage_options
 
-        fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
-
     else:
-        raise NotImplementedError(
-            "Only local and s3 file protocols are currently supported"
+        storage_options = (
+            reader_options.get("storage_options") if reader_options else {}
         )
+    fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
+
+    # else:
+    #     raise NotImplementedError(
+    #         "Only local and s3 file protocols are currently supported"
+    #     )
 
     return fpath
diff --git a/virtualizarr/xarray.py b/virtualizarr/xarray.py
index 80fce37..9ea0170 100644
--- a/virtualizarr/xarray.py
+++ b/virtualizarr/xarray.py
@@ -108,6 +108,7 @@ def open_virtual_dataset(
         vds_refs = kerchunk.read_kerchunk_references_from_file(
             filepath=filepath,
             filetype=filetype,
+            reader_options=reader_options,
         )
         virtual_vars = virtual_vars_from_kerchunk_refs(
             vds_refs,

With that, I think I got HTTPs urls (to Azure Blob Storage) working:

import virtualizarr
import planetary_computer
import xarray as xr
import virtualizarr.kerchunk

urls = [
    "https://nasagddp.blob.core.windows.net/nex-gddp-cmip6/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tas/tas_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc",
    "https://nasagddp.blob.core.windows.net/nex-gddp-cmip6/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/pr/pr_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc"
]

urls = [planetary_computer.sign(url) for url in urls]

ds1 = virtualizarr.open_virtual_dataset(urls[0], reader_options={"storage_options": {}}, filetype=virtualizarr.kerchunk.FileType.netcdf4)

ds2 = virtualizarr.open_virtual_dataset(urls[1], reader_options={"storage_options": {}}, filetype=virtualizarr.kerchunk.FileType.netcdf4)

ds = xr.combine_by_coords([ds1, ds2], join="exact", combine_attrs="drop_conflicts")
ds

@jbusecke
Copy link
Contributor Author

Thats super nice @TomAugspurger. Could you make a PR out of that branch of yours? Id love to try this for some testing, and just want to avoid duplication.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 21, 2024 via email

@TomNicholas TomNicholas added the remote files Reading references from non-local files label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references generation Reading byte ranges from archival files remote files Reading references from non-local files
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants