Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Unable to specify list for urlpath with NetCDFSource + fsspec #98

Closed
pbranson opened this issue Feb 2, 2021 · 6 comments
Closed

Comments

@pbranson
Copy link

pbranson commented Feb 2, 2021

I have encountered a bug with a recent change in either intake-xarray or fsspec where passing a list for the urlpath to intake-xarray fails. This worked previously. Sorry havent had time to dig further, but thought I would log it

A minimum reproducible example is:

import intake
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry
mycat = Catalog.from_dict({
    'eta': LocalCatalogEntry('test', 'test multi file', 'netcdf', args={'urlpath':['https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_000_eta.nc','https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_024_eta.nc'],
                                                                            'concat_dim':'Time',
                                                                            'combine':'nested'}),
    })
print(mycat.eta.yaml())
mycat.eta.to_dask()

Which outputs:

sources:
  test:
    args:
      combine: nested
      concat_dim: Time
      urlpath:
      - https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_000_eta.nc
      - https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_024_eta.nc
    description: test multi file
    driver: intake_xarray.netcdf.NetCDFSource
    metadata:
      catalog_dir: ''


AttributeError                            Traceback (most recent call last)
<ipython-input-37-f969d7f0cd32> in <module>
      7     })
      8 print(mycat.eta.yaml())
----> 9 mycat.eta.to_dask()

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
     88         else:
     89             # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
---> 90             url = fsspec.open(self.urlpath, **self.storage_options).open()
     91 
     92         self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in open(urlpath, mode, compression, encoding, errors, protocol, newline, **kwargs)
    436         newline=newline,
    437         expand=False,
--> 438         **kwargs
    439     )[0]
    440 

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in open_files(urlpath, mode, compression, encoding, errors, name_function, num, protocol, newline, auto_mkdir, expand, **kwargs)
    285         storage_options=kwargs,
    286         protocol=protocol,
--> 287         expand=expand,
    288     )
    289     if "r" not in mode and auto_mkdir:

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
    600         cls = get_filesystem_class(protocol)
    601         optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 602         paths = [cls._strip_protocol(u) for u in urlpath]
    603         options = optionss[0]
    604         if not all(o == options for o in optionss):

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in <listcomp>(.0)
    600         cls = get_filesystem_class(protocol)
    601         optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 602         paths = [cls._strip_protocol(u) for u in urlpath]
    603         options = optionss[0]
    604         if not all(o == options for o in optionss):

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/implementations/local.py in _strip_protocol(cls, path)
    145     def _strip_protocol(cls, path):
    146         path = stringify_path(path)
--> 147         if path.startswith("file://"):
    148             path = path[7:]
    149         path = os.path.expanduser(path)

AttributeError: 'list' object has no attribute 'startswith'

Relevant library versions:

aiohttp==3.7.3
intake==0.6.0
intake-xarray==0.4.1
fsspec==0.8.5
xarray==0.16.2

cc @martindurant @scottyhq

@pbranson pbranson changed the title Bug: Unable to specify list for urlpath with fsspec Bug: Unable to specify list for urlpath with NetCDFSource + fsspec Feb 2, 2021
@scottyhq
Copy link
Collaborator

scottyhq commented Feb 3, 2021

thanks for the detailed reports @pbranson looks like we should definitely add some additional tests in this library to try and catch these issues.

I think an open question right now is whether fsspec or xarray should handle the loading. Hopefully @martindurant can provide some guidance here. With some backend refactoring in xarray underway i'm not sure of the best way forward as it seems there are incompatibilities with how the data is stored (Zarr, netcdf, tiff) and what code ultimately handles I/O (zarr, h5netcdf, rasterio) given either URLs or fsspec objects. See pydata/xarray#4823 (comment).

@martindurant
Copy link
Member

xarray should do this in the long term, but the storage backend implementation has changed a lot. The PR that is still not merged now only deals with the zarr engine, precisely because it's not clear which engine can accept what kind of input, and there are no tests for any of it.

@pbranson
Copy link
Author

pbranson commented Feb 6, 2021

Yeah I'm not sure the best solution here.

It would seem that the solution would be to revert the NetCDFSource to defer to XArray, but where does that leave catalogs of netCDF files accessed from a http server.

Maybe could add an engine argument that passes urlpath as str for engine='netcdf'?

Or add some of the features (like a list or pattern of urls) that are in NetCDFSource to OpenDAPSource?

@aaronspring
Copy link
Collaborator

intake_thredds handles this

import intake_thredds
intake_thredds.__version__ # '2021.6.16'

intake_thredds.THREDDSMergedSource(url='simplecache::https://dapds00.nci.org.au/thredds/catalog/rr6/oceanmaps_datasets/version_3.3/analysis/eta/catalog.xml',
                                  path=['ocean_an00_2020052*12_eta.nc'],
                                  driver='netcdf').to_dask()
<xarray.Dataset>
Dimensions:      (xt_ocean: 3600, yt_ocean: 1500, Time: 9, nv: 2)
Coordinates:
  * xt_ocean     (xt_ocean) float64 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
  * yt_ocean     (yt_ocean) float64 -74.95 -74.85 -74.75 ... 74.75 74.85 74.95
  * Time         (Time) datetime64[ns] 2020-05-22 2020-05-23 ... 2020-05-30
  * nv           (nv) float64 1.0 2.0
Data variables:
    eta_t        (Time, yt_ocean, xt_ocean) float32 dask.array<chunksize=(1, 1500, 3600), meta=np.ndarray>
    average_T1   (Time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    average_T2   (Time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    average_DT   (Time) timedelta64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    Time_bounds  (Time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>

@aaronspring
Copy link
Collaborator

@pbranson is this still an issue for you?

@pbranson
Copy link
Author

pbranson commented Feb 4, 2022

Thanks @aaronspring yes this isnt so much a problem now that intake-thredds has been developed further.

@pbranson pbranson closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants