Support opening files over http #121

jbusecke · 2024-05-20T14:51:09Z

I have another usecase and associated feature request.
I want to prototype virtualizarring some of these datasets, but for that we need to be able to support opening files via HTTP protocol in addition to S3 and local.

Here is a MRE:

# MRE for issue
import fsspec
url = 'https://files.isimip.org/ISIMIP3b/InputData/climate/atmosphere/bias-adjusted/global/daily/historical/GFDL-ESM4/gfdl-esm4_r1i1p1f1_w5e5_historical_hurs_global_daily_1850_1850.nc'
vds = open_virtual_dataset(url)

Gives:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[14], line 4
      2 import fsspec
      3 url = 'https://files.isimip.org/ISIMIP3b/InputData/climate/atmosphere/bias-adjusted/global/daily/historical/GFDL-ESM4/gfdl-esm4_r1i1p1f1_w5e5_historical_hurs_global_daily_1850_1850.nc'
----> 4 vds = open_virtual_dataset(url)

File ~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/xarray.py:108, in open_virtual_dataset(filepath, filetype, drop_variables, loadable_variables, indexes, virtual_array_class, reader_options)
    102     return open_virtual_dataset_from_v3_store(
    103         storepath=filepath, drop_variables=drop_variables, indexes=indexes
    104     )
    105 else:
    106     # this is the only place we actually always need to use kerchunk directly
    107     # TODO avoid even reading byte ranges for variables that will be dropped later anyway?
--> 108     vds_refs = kerchunk.read_kerchunk_references_from_file(
    109         filepath=filepath,
    110         filetype=filetype,
    111     )
    112     virtual_vars = virtual_vars_from_kerchunk_refs(
    113         vds_refs,
    114         drop_variables=drop_variables + loadable_variables,
    115         virtual_array_class=virtual_array_class,
    116     )
    117     ds_attrs = kerchunk.fully_decode_arr_refs(vds_refs["refs"]).get(".zattrs", {})

File [~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py:76](https://leap.2i2c.cloud/user/jbusecke/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py#line=75), in read_kerchunk_references_from_file(filepath, filetype, reader_options)
     60 """
     61 Read a single legacy file and return kerchunk references to its contents.
     62 
   (...)
     72     so ensure reader_options match selected Kerchunk reader arguments.
     73 """
     75 if filetype is None:
---> 76     filetype = _automatically_determine_filetype(
     77         filepath=filepath, reader_options=reader_options
     78     )
     80 # if filetype is user defined, convert to FileType
     81 filetype = FileType(filetype)

File [~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py:117](https://leap.2i2c.cloud/user/jbusecke/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/kerchunk.py#line=116), in _automatically_determine_filetype(filepath, reader_options)
    113 def _automatically_determine_filetype(
    114     *, filepath: str, reader_options: Optional[dict] = {}
    115 ) -> FileType:
    116     file_extension = Path(filepath).suffix
--> 117     fpath = _fsspec_openfile_from_filepath(
    118         filepath=filepath, reader_options=reader_options
    119     )
    121     if file_extension == ".nc":
    122         # based off of: https://github.com/TomNicholas/VirtualiZarr/pull/43#discussion_r1543415167
    123         magic = fpath.read()

File ~/TESTING/iri-isimip-virtual/VirtualiZarr/virtualizarr/utils.py:60, in _fsspec_openfile_from_filepath(filepath, reader_options)
     57     fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
     59 else:
---> 60     raise NotImplementedError(
     61         "Only local and s3 file protocols are currently supported"
     62     )
     64 return fpath

NotImplementedError: Only local and s3 file protocols are currently supported

It seems that we might just have to add a new case here to make this work?

The text was updated successfully, but these errors were encountered:

norlandrhagen · 2024-05-20T21:05:22Z

@jbusecke or anyone else wanna take this one? Don't wanna step on any toes.

Probably related ~~headache~~ issue:

fsspec/filesystem_spec#579

TomAugspurger · 2024-05-20T21:19:07Z

I was messing around with this last week. Only tricky thing is the reader_options (I think the current defaults might not be generally appropriate)

diff --git a/virtualizarr/utils.py b/virtualizarr/utils.py
index 6ba7105..f332163 100644
--- a/virtualizarr/utils.py
+++ b/virtualizarr/utils.py
@@ -2,6 +2,8 @@ from __future__ import annotations
 
 from typing import TYPE_CHECKING, Optional
 
+from pandas import read_csv
+
 if TYPE_CHECKING:
     from fsspec.implementations.local import LocalFileOpener
     from s3fs.core import S3File
@@ -54,11 +56,15 @@ def _fsspec_openfile_from_filepath(
             # using dict merge operator to add in defaults if keys are not specified
             storage_options = s3_anon_defaults | storage_options
 
-        fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
-
     else:
-        raise NotImplementedError(
-            "Only local and s3 file protocols are currently supported"
+        storage_options = (
+            reader_options.get("storage_options") if reader_options else {}
         )
+    fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
+
+    # else:
+    #     raise NotImplementedError(
+    #         "Only local and s3 file protocols are currently supported"
+    #     )
 
     return fpath
diff --git a/virtualizarr/xarray.py b/virtualizarr/xarray.py
index 80fce37..9ea0170 100644
--- a/virtualizarr/xarray.py
+++ b/virtualizarr/xarray.py
@@ -108,6 +108,7 @@ def open_virtual_dataset(
         vds_refs = kerchunk.read_kerchunk_references_from_file(
             filepath=filepath,
             filetype=filetype,
+            reader_options=reader_options,
         )
         virtual_vars = virtual_vars_from_kerchunk_refs(
             vds_refs,

With that, I think I got HTTPs urls (to Azure Blob Storage) working:

import virtualizarr
import planetary_computer
import xarray as xr
import virtualizarr.kerchunk

urls = [
    "https://nasagddp.blob.core.windows.net/nex-gddp-cmip6/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tas/tas_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc",
    "https://nasagddp.blob.core.windows.net/nex-gddp-cmip6/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/pr/pr_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc"
]

urls = [planetary_computer.sign(url) for url in urls]

ds1 = virtualizarr.open_virtual_dataset(urls[0], reader_options={"storage_options": {}}, filetype=virtualizarr.kerchunk.FileType.netcdf4)

ds2 = virtualizarr.open_virtual_dataset(urls[1], reader_options={"storage_options": {}}, filetype=virtualizarr.kerchunk.FileType.netcdf4)

ds = xr.combine_by_coords([ds1, ds2], join="exact", combine_attrs="drop_conflicts")
ds

jbusecke · 2024-05-21T18:07:00Z

Thats super nice @TomAugspurger. Could you make a PR out of that branch of yours? Id love to try this for some testing, and just want to avoid duplication.

TomAugspurger · 2024-05-21T18:10:50Z

I won't have time in the next few weeks, but feel free to take whatever you want 🙂 Tangentially related: hopefully virtualizarr can be deliberate about exactly where IO can occur (ideally isolated to scanning source files). The majority of my headaches with running data pipelines comes from IO, and if that can be isolated / minimized we'll be better off for it.

…

________________________________ From: Julius Busecke ***@***.***> Sent: Tuesday, May 21, 2024 1:07 PM To: TomNicholas/VirtualiZarr ***@***.***> Cc: Mention ***@***.***>; Comment ***@***.***>; Subscribed ***@***.***> Subject: Re: [TomNicholas/VirtualiZarr] Support opening files over http (Issue #121) Thats super nice @TomAugspurger<https://github.com/TomAugspurger>. Could you make a PR out of that branch of yours? Id love to try this for some testing, and just want to avoid duplication. — Reply to this email directly, view it on GitHub<#121 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAKAOIU47MFGR3YVNLCQ6W3ZDOENTBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVI3DQMZWGI3DGNBVGGSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJXGY4TGMBZGI3TRAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEMZQGYYTOMBXGM4YFJDUPFYGLJLMMFRGK3FFOZQWY5LFVI3DQMZWGI3DGNBVGGTXI4TJM5TWK4VGMNZGKYLUMU>. You are receiving this email because you were mentioned. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

TomNicholas added the references generation Reading byte ranges from archival files label May 20, 2024

TomAugspurger mentioned this issue May 27, 2024

Allow other fsspec protocols than local and s3 #126

Merged

2 tasks

TomNicholas added the remote files Reading references from non-local files label Jun 3, 2024

TomNicholas closed this as completed in #126 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support opening files over http #121

Support opening files over http #121

jbusecke commented May 20, 2024

norlandrhagen commented May 20, 2024

TomAugspurger commented May 20, 2024 •

edited

Loading

jbusecke commented May 21, 2024

TomAugspurger commented May 21, 2024 via email

Support opening files over http #121

Support opening files over http #121

Comments

jbusecke commented May 20, 2024

norlandrhagen commented May 20, 2024

TomAugspurger commented May 20, 2024 • edited Loading

jbusecke commented May 21, 2024

TomAugspurger commented May 21, 2024 via email

TomAugspurger commented May 20, 2024 •

edited

Loading