Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServerDisconnectedError and Cannot seek streaming HTTP file ValueError in version 0.8.7 and 0.8.4 #550

Closed
jkellndorfer opened this issue Mar 1, 2021 · 41 comments

Comments

@jkellndorfer
Copy link

I wonder if @rsignell-usgs or @martindurant have insights into this issue:

I have the following code snippet that worked in fsspec version 0.8.4 (for TEST 2):

import netrc
import fsspec
import aiohttp

(username, account, password) = netrc.netrc(file='/home/ubuntu/.netrc').authenticators("urs.earthdata.nasa.gov")
cloud_fs = fsspec.filesystem('http', client_kwargs={'auth': aiohttp.BasicAuth(username, password)})

# TEST 1
url='https://e4ftl01.cr.usgs.gov/MEASURES/NASADEM_HGT.001/2000.02.11/NASADEM_HGT_n00e006.zip'
ftmp=cloud_fs.open(url)
z=fsspec.filesystem('zip', fo=ftmp)
print('\n'.join(z.find('/')))

# TEST 2
url='https://datapool.asf.alaska.edu/SLC/SB/S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.zip'
ftmp=cloud_fs.open(url)
z=fsspec.filesystem('zip', fo=ftmp)
print('\n'.join(z.find('/')))

There are two issues:
TEST 1 produces a ServerDisconnectedError, but on a second run returns the expected output. This happens in both tested versions, 0.8.4 and 0.8.7
TEST 2 produces the correct output in version 0.8.4 (on the first run) but fails with e ValueError: Cannot seek streaming HTTP file in version 0.8.7

Should I approach this differently?
I attach the tracebacks below. Thanks!

ServerDisconnectedError

---------------------------------------------------------------------------
ServerDisconnectedError                   Traceback (most recent call last)
<ipython-input-3-1bac1fbcb8ec> in <module>
      1 url='https://e4ftl01.cr.usgs.gov/MEASURES/NASADEM_HGT.001/2000.02.11/NASADEM_HGT_n00e006.zip'
      2 ftmp=cloud_fs.open(url)
----> 3 z=fsspec.filesystem('zip', fo=ftmp)
      4 print('\n'.join(z.find('/')))

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/registry.py in filesystem(protocol, **storage_options)
    233     """
    234     cls = get_filesystem_class(protocol)
--> 235     return cls(**storage_options)

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
     56             return cls._cache[token]
     57         else:
---> 58             obj = super().__call__(*args, **kwargs)
     59             # Setting _fs_token here causes some static linters to complain.
     60             obj._fs_token_ = token

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/implementations/zip.py in __init__(self, fo, mode, target_protocol, target_options, block_size, **kwargs)
     52             fo = files[0]
     53         self.fo = fo.__enter__()  # the whole instance is a context
---> 54         self.zip = zipfile.ZipFile(self.fo)
     55         self.block_size = block_size
     56         self.dir_cache = None

/s/anaconda/envs/seppo/lib/python3.8/zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1267         try:
   1268             if mode == 'r':
-> 1269                 self._RealGetContents()
   1270             elif mode in ('w', 'x'):
   1271                 # set the modified flag so central directory gets written

/s/anaconda/envs/seppo/lib/python3.8/zipfile.py in _RealGetContents(self)
   1330         fp = self.fp
   1331         try:
-> 1332             endrec = _EndRecData(fp)
   1333         except OSError:
   1334             raise BadZipFile("File is not a zip file")

/s/anaconda/envs/seppo/lib/python3.8/zipfile.py in _EndRecData(fpin)
    272     except OSError:
    273         return None
--> 274     data = fpin.read()
    275     if (len(data) == sizeEndCentDir and
    276         data[0:4] == stringEndArchive and

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/implementations/http.py in read(self, length)
    341             )  # all fits in one block anyway
    342         ):
--> 343             self._fetch_all()
    344         if self.size is None:
    345             if length < 0:

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
    119     def wrapper(*args, **kwargs):
    120         self = obj or args[0]
--> 121         return maybe_sync(func, self, *args, **kwargs)
    122
    123     return wrapper

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
     98         if inspect.iscoroutinefunction(func):
     99             # run the awaitable on the loop
--> 100             return sync(loop, func, *args, **kwargs)
    101         else:
    102             # just call the blocking function

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
     69     if error[0]:
     70         typ, exc, tb = error[0]
---> 71         raise exc.with_traceback(tb)
     72     else:
     73         return result[0]

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/asyn.py in f()
     53             if callback_timeout is not None:
     54                 future = asyncio.wait_for(future, callback_timeout)
---> 55             result[0] = await future
     56         except Exception:
     57             error[0] = sys.exc_info()

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/implementations/http.py in async_fetch_all(self)
    356         """
    357         if not isinstance(self.cache, AllBytes):
--> 358             r = await self.session.get(self.url, **self.kwargs)
    359             async with r:
    360                 r.raise_for_status()

/s/anaconda/envs/seppo/lib/python3.8/site-packages/aiohttp/client.py in _request(self, method, str_or_url, params, data, json, cookies, headers, skip_auto_headers, auth, allow_redirects, max_redirects, compress, chunked, expect100, raise_for_status, read_until_eof, proxy, proxy_auth, timeout, verify_ssl, fingerprint, ssl_context, ssl, proxy_headers, trace_request_ctx, read_bufsize)
    549                             resp = await req.send(conn)
    550                             try:
--> 551                                 await resp.start(conn)
    552                             except BaseException:
    553                                 resp.close()

/s/anaconda/envs/seppo/lib/python3.8/site-packages/aiohttp/client_reqrep.py in start(self, connection)
    888                 # read response
    889                 try:
--> 890                     message, payload = await self._protocol.read()  # type: ignore  # noqa
    891                 except http.HttpProcessingError as exc:
    892                     raise ClientResponseError(

/s/anaconda/envs/seppo/lib/python3.8/site-packages/aiohttp/streams.py in read(self)
    603             self._waiter = self._loop.create_future()
    604             try:
--> 605                 await self._waiter
    606             except (asyncio.CancelledError, asyncio.TimeoutError):
    607                 self._waiter = None

ServerDisconnectedError: Server disconnected

ValueError: Cannot seek streaming HTTP file

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-0f6667c2fc50> in <module>
      6 url='https://datapool.asf.alaska.edu/SLC/SB/S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.zip'
      7 ftmp=cloud_fs.open(url)
----> 8 z=fsspec.filesystem('zip', fo=ftmp)
      9 print('\n'.join(z.find('/')))

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/registry.py in filesystem(protocol, **storage_options)
    242     """
    243     cls = get_filesystem_class(protocol)
--> 244     return cls(**storage_options)

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
     64             return cls._cache[token]
     65         else:
---> 66             obj = super().__call__(*args, **kwargs)
     67             # Setting _fs_token here causes some static linters to complain.
     68             obj._fs_token_ = token

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/implementations/zip.py in __init__(self, fo, mode, target_protocol, target_options, block_size, **kwargs)
     53             fo = files[0]
     54         self.fo = fo.__enter__()  # the whole instance is a context
---> 55         self.zip = zipfile.ZipFile(self.fo)
     56         self.block_size = block_size
     57         self.dir_cache = None

/s/anaconda/envs/seppo/lib/python3.8/zipfile.py in __init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1267         try:
   1268             if mode == 'r':
-> 1269                 self._RealGetContents()
   1270             elif mode in ('w', 'x'):
   1271                 # set the modified flag so central directory gets written

/s/anaconda/envs/seppo/lib/python3.8/zipfile.py in _RealGetContents(self)
   1330         fp = self.fp
   1331         try:
-> 1332             endrec = _EndRecData(fp)
   1333         except OSError:
   1334             raise BadZipFile("File is not a zip file")

/s/anaconda/envs/seppo/lib/python3.8/zipfile.py in _EndRecData(fpin)
    262
    263     # Determine file size
--> 264     fpin.seek(0, 2)
    265     filesize = fpin.tell()
    266

/s/anaconda/envs/seppo/lib/python3.8/site-packages/fsspec/implementations/http.py in seek(self, *args, **kwargs)
    555
    556     def seek(self, *args, **kwargs):
--> 557         raise ValueError("Cannot seek streaming HTTP file")
    558
    559     async def _read(self, num=-1):

ValueError: Cannot seek streaming HTTP file
@martindurant
Copy link
Member

I cannot see anything wrong with your code, except I would use the more compact form

with fsspec.open("zip+http://..", http={username=, password=}) as openfiles:

unless you really want file-system-like operations rather than just file access.

I note that the second initial URL returns a 307 temporary redirect to a cloudfront.net URL. The specific URL is presumably time-sensitive and might expire, which is not something fsspec is prepared for - it will continue hitting the original URL.

In any case, fsspec seems to be able to infer the size of the files OK (ftmp.size), so I don't know how you end up in the Streaming branch.

I cannot authenticate with the servers, so I can't debug further. Maybe you can send me credentials on a private channel.

@jkellndorfer
Copy link
Author

Thanks for taking a look. This approach is flexible when the file is cached locally, e.g. when I do some full processing. Happy to send you credentials. What's your preferred private channel?

@martindurant
Copy link
Member

email, gitter ?

@jkellndorfer
Copy link
Author

sent via gitter ...

@martindurant
Copy link
Member

Sounds a lot like aio-libs/aiohttp#4549 , but adding asyncio.sleep(0) does not seem to make a difference.

@jkellndorfer
Copy link
Author

seems like there is some related issue. Interestingly, the 0.8.4 version works for TEST 2 (which I am most concerned with). Do you think I need to force version 0.8.4 in my conda environment for now or can we see another pathway?

@martindurant
Copy link
Member

I am experimenting a little. I have a feeling that the server is refusing multiple connection. Waiting works, possibly the connections time out. HTTPFileSystem does not explicitly have any retries implemented, because the HTTP layer may do this - but perhaps it should.

The creds were for both URLs, or just the first one? I'm pretty sure the second is something to do with the redirect - the new URL should be stored instead of continuing using the original one.

@jkellndorfer
Copy link
Author

thanks, Martin! Creds should be good for both URLs.

@martindurant
Copy link
Member

The second URL is working fine :)

@jkellndorfer
Copy link
Author

in version 0.8.7?

@martindurant
Copy link
Member

local main branch

In [1]: import aiohttp
In [2]: import fsspec
In [3]: user = "***"
In [4]: pw = "***"
In [5]: url='https://datapool.asf.alaska.edu/SLC/SB/S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.zip'
In [6]: h = fsspec.filesystem('http', client_kwargs={'auth': aiohttp.BasicAuth(user, pw)})
In [7]: f = h.open(url, 'rb')
In [8]: z = fsspec.filesystem("zip", fo=f)
In [9]: z.find("")
Out[9]:
['S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.SAFE/S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.SAFE-report-20201106T151146.pdf',
...

@martindurant
Copy link
Member

version '0.8.7+0.gef1001a.dirty', with local change

--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
@@ -607,11 +609,14 @@ async def _file_size(url, session=None, size_policy="head", **kwargs):
         r = await session.get(url, allow_redirects=ar, **kwargs)
     else:
         raise TypeError('size_policy must be "head" or "get", got %s' "" % size_policy)
-    async with r:
-        if "Content-Length" in r.headers:
-            return int(r.headers["Content-Length"])
-        elif "Content-Range" in r.headers:
-            return int(r.headers["Content-Range"].split("/")[1])
+    try:
+        async with r:
+            if "Content-Length" in r.headers:
+                return int(r.headers["Content-Length"])
+            elif "Content-Range" in r.headers:
+                return int(r.headers["Content-Range"].split("/")[1])
+    finally:
+        r.close()

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 1, 2021

I tried your sequence and still get the
ValueError: Cannot seek streaming HTTP file

fsspec 0.8.7 pyhd8ed1ab_0 conda-forge

@martindurant
Copy link
Member

I guess that in your case, the size determination is being hit by the disconnect error. You need the size to be able to seek...

I must stop here for the moment, I'm afraid. I should at least add logging to the http module, so we know what calls are succeeding versus failing; would you be interested in wrapping the aiohttp client with https://pypi.org/project/aiohttp-retry/ ? It seems that might well be the kind of solution we are after.

@jkellndorfer
Copy link
Author

thanks, Martin. I'll give that a try and let you know how far I get.

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 1, 2021

Sorry to say, I didn't get very far as this doesn't seem to even want to load ...

pip install aiohttp-retry
Requirement already satisfied: aiohttp-retry in /s/anaconda/lib/python3.8/site-packages (2.3.3)
Requirement already satisfied: aiohttp in /s/anaconda/lib/python3.8/site-packages (from aiohttp-retry) (3.7.4)
Requirement already satisfied: typing-extensions>=3.6.5 in /s/anaconda/lib/python3.8/site-packages (from aiohttp->aiohttp-retry) (3.7.4.3)
Requirement already satisfied: async-timeout<4.0,>=3.0 in /s/anaconda/lib/python3.8/site-packages (from aiohttp->aiohttp-retry) (3.0.1)
Requirement already satisfied: chardet<4.0,>=2.0 in /s/anaconda/lib/python3.8/site-packages (from aiohttp->aiohttp-retry) (3.0.4)
Requirement already satisfied: multidict<7.0,>=4.5 in /s/anaconda/lib/python3.8/site-packages (from aiohttp->aiohttp-retry) (5.1.0)
Requirement already satisfied: attrs>=17.3.0 in /s/anaconda/lib/python3.8/site-packages (from aiohttp->aiohttp-retry) (20.3.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /s/anaconda/lib/python3.8/site-packages (from aiohttp->aiohttp-retry) (1.6.3)
Requirement already satisfied: idna>=2.0 in /s/anaconda/lib/python3.8/site-packages (from yarl<2.0,>=1.0->aiohttp->aiohttp-retry) (2.10)
(seppo) ubuntu@ip-172-31-62-164:/s/seppo$ ipython
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.21.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from aiohttp_retry import RetryClient, ExponentialRetry
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-9214afd70e56> in <module>
----> 1 from aiohttp_retry import RetryClient, ExponentialRetry

ModuleNotFoundError: No module named 'aiohttp_retry'

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 1, 2021

More tests, more strange behavior...
I tried to downgrade to 0.8.4 on my new kernel, and the same issues now happen. Difference is that the kernels are run on different version of Ubuntu:

Not working: 5.4.0-1038-aws #40~18.04.1-Ubuntu SMP Sat Feb 6 01:56:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Working on:  5.4.0-1032-aws #33~18.04.1-Ubuntu SMP Thu Dec 10 08:19:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

On the working one I also get details with the size:

In [6]: ftmp.details
Out[6]:
{'name': 'https://datapool.asf.alaska.edu/SLC/SB/S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.zip',
 'size': 5002388581,
 'type': 'file'}

Not on the not working version:

In [19]: ftmp.details
Out[19]:
{'name': 'https://datapool.asf.alaska.edu/SLC/SB/S1B_IW_SLC__1SDH_20201106T085415_20201106T085442_024142_02DE43_E9A2.zip',
 'size': None}

does that help?

@martindurant
Copy link
Member

So it certainly shows the difference, that server-reset happened during size query for the second instance, but I don't know what the system kernel has to do with it. This is a conda install of python?

@jkellndorfer
Copy link
Author

yes. it's a conda install with default channel conda-forge. I am attaching the yaml to build the kernel.
env.zip

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 1, 2021

One other tidbit of detail, again where we don't really make a connection readily I assume: Non working machine is in AWS us-west-2, same region of the bucket where the zip is stored. Working machine is located in us-east-1. Who knows if that might make a difference?

@jkellndorfer
Copy link
Author

Upon further testing I am honing in on the AWS region difference! I cloned my amazon machine image from us-east-1 to be used in us-west-2. The machine gave me the same environment in both regions. Result: my fsspec/zip code works in us-east-1, but fails with the ValueError: Cannot seek strteaming HTTP file in us-west-2. Hmm. Puzzling.

@martindurant
Copy link
Member

Since you are not using the S3 API actually, but getting an HTTP URL from an intermediary, I really don't know why region would matter, except of course that latencies are different. -EDIT- after writing the text below, I wonder whether the cloudfront cache is skipped for intra-region requests?

I suspect that it's the intermediary which is causing the hangup, not S3 - so HTTPFileSystem should notice on the first call that a redirect is happening, and thereafter use the new URL rather than keep hitting the redirect server. I don't have proof for this... Actually the initial 307 redirect is to cloudfront, which gives a 303 redirect to another cloudfront URL embedding an S3 signed URL. The final response header has "'X-Cache': 'Miss from cloudfront'" - could the difference be whether the call is cached or not?

Also note that only the "datapool" version requires any credentials. In the case of auth failure, the final requests generates yet another 303 redirect, which points to a "DENIED" html page. Note that by playing with this chain of redirects, I was able to get a Bad Gateway response, so the server(s) are certainly not playing quite well.

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 3, 2021 via email

@martindurant
Copy link
Member

It would be good to have their input, see if anything suspicious is appearing in the logs for instance.

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 3, 2021 via email

@bbuechler
Copy link

Hello. 👋 - It's my team at ASF that maintains both the AWS storage as well as the that application that distributes the data. I have some ideas why a migration from operating in US-EAST-1 to US-WEST-2 could cause problems. Starting at the distribution app design:
TEA
Our data and distribution apps run in the US-WEST-2 region. When requests for data come from OUTSIDE that region, the data is proxied through CloudFront. However, requests from in-region are serviced with direct S3 pre-signed urls. One way this can go wrong is that S3 (unlike CloudFront) will complain if you provide two types of authentication. Since pre-signing a URL is considered one type of auth, providing URS credentials in the form of Username and Password. This is one of the most frequent issues to come up. You can see by running the same request in and out of region:

# make a request OUT of region, push auth through all redirects: --location-trusted 
$ curl -L -b ~/junkcookies -c ~/junkcookies --location-trusted -u "$up" https://datapool.asf.alaska.edu/METADATA_GRD_HD/SA/S1A_IW_GRDH_1SDV_20200227T102024_20200227T102049_031437_039E84_650B.iso.xml -o - | head -10
<?xml version='1.0' encoding='UTF-8'?>
<gmd:DS_Series xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:eos="http://earthdata.nasa.gov/schema/eos" xmlns:echo="http://www.echo.nasa.gov/ingest/schemas/operatations" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmx="http://www.isotc211.org/2005/gmx">
  <gmd:composedOf>
    <gmd:DS_DataSet>
      <gmd:has>
        <gmi:MI_Metadata>
          <gmd:fileIdentifier>
            <gco:CharacterString>S1A_IW_GRDH_1SDV_20200227T102024_20200227T102049_031437_039E84_650B.iso.xml</gco:CharacterString>
          </gmd:fileIdentifier>
          <gmd:language>

This has worked and we get our file. Now, make the same request from an EC2 instance running in US-WEST-2:

$ curl -L -b ~/junkcookies -c ~/junkcookies --location-trusted -u "$up" https://datapool.asf.alaska.edu/METADATA_GRD_HD/SA/S1A_IW_GRDH_1SDV_20200227T102024_20200227T102049_031437_039E84_650B.iso.xml -o - | head -10
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidArgument</Code><Message>Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified</Message><ArgumentName>Authorization</ArgumentName><ArgumentValue>Basic Yxxxxxxxxxg=</ArgumentValue><RequestId>E7xxxxxxxxQ</RequestId><HostId>K4xxxxxxxxY=</HostId></Error>

Here we see the double-auth error.

I cannot say for certain this is the problem you're experiencing, but It is something to look out for. Mitigating this can be as easy as ensuring that the Authorization: Basic XXXXXX= header is ONLY provided to that authentication host (urs.earthdata.nasa.gov) and never any other intermediate stops along the way.

I will reach out via email (I was forward your contact from Ciji) in case you'd like further assistance troubleshooting the problems. It would not be terribly difficult, provided a time-range, a url, and perhaps a URS ID and/or error message to for me to dig into the logs and see if I can't find more useful troubleshooting/debug information

@martindurant
Copy link
Member

The original exception was ServerDisconnectedError, which I would tentatively connect with the "throttling" part of the diagram you provide. Is it the case that two subsequent calls to the original (auth) server could cause a disconnect or other service refusal; i.e., if we has stashed the redirect URL and used it directly, would you think the problem goes away?

I cannot say for certain this is the problem you're experiencing, but It is something to look out for. Mitigating this can be as easy as ensuring that the Authorization: Basic XXXXXX= header is ONLY provided to that authentication host (urs.earthdata.nasa.gov) and never any other intermediate stops along the way.

This would mean intercepting the redirect chain, which is certainly doable, but not something that easily fits into the fsspec implementation. Perhaps the workaround would be to directly hit the auth host with a requests call (i.e., not using fsspec) and then finding the resultant final URL and use that with fsspec.

@jkellndorfer
Copy link
Author

jkellndorfer commented Mar 4, 2021 via email

@bbuechler
Copy link

bbuechler commented Mar 4, 2021

Out-of-region throttling from CloudFront is a transparent slowdown of the bandwidth, its also pretty rare. Ideally you'd never see it, because you'll never be throttled. In region (from within US-WEST-2 in this case), there is never throttling, and you'll always have native AWS/S3 (Fast!) performance.

If you're seeing issues with the authentication host (urs.earthdata.nasa.gov), that is outside my bailiwick and control. I do know that EDL (aka URS) limits individual user_id to a limited (100, I think?) number of active sessions, but that shouldn't be in play here.

For optimum performance, I'd suggest these two tweaks if you're not currently doing them:

  • Authenticate ONCE with the datapool app by downloading a file or with the auth service (https://auth.asf.alaska.edu/), capture and save the asf-urs cookie, and provide that cookie with all subsequent download requests. This cookies acts as your key to fast access of data. If that cookie is NOT provided, it will trigger performing auth which is time and computationally expensive at the EDL level.
  • Don't cache redirect urls. Pre-signed AWS urls are only valid for about 50 minutes. There is no penalty for re-using the https://datapool.asf.alaska.edu/... URL as long as you're providing the asf-urs cookie.

Other than the double-auth problem, there shouldn't be be any region change issues that are directly coming into play.

@bbuechler
Copy link

bbuechler commented Mar 4, 2021

Brian: Once we go that route can we use the s3 url directly?

If you're operating in US-WEST-2, your final stop in the redirect journey will always be a pre-signed S3 URL. This is true for a large majority of ASF data include all Sentinel Products. Again, I cannot pinpoint double-auth as the exact problem you're encountering, however it is the issue most likely to be associated with a change of operating region.

This issue with in-region/double-auth for accessing S3 data is not unique to ASF. This architecture is the same across all NASA DAAC's and will affect access to all https datasets that have been migrated into AWS (though the cookie/authentication details can and will vary). It is also not limited to just DAAC data. AWS S3 will ALWAYS complain about being provided double auth.

@martindurant
Copy link
Member

Martin, please recall that the ServerDisconnectedError happened with the access to a different DAAC. The ASF DAAC access attempt threw the ValueError: Cannot seek streaming HTTP file.

This happened when fsspec was unable to complete a HEAD request to determine the file's size, very likely because of ServerDisconnect.

Martin: How would you attempt the request call ideally first to get to the final URL?

r = requests.get(url, auth=(user, pw), allow_redirects=False)  

and manually check the r.headers["Location"] until you stop seeing 3xx redirects. Here you can also omit the auth part at the appropriate stage.

@martindurant
Copy link
Member

I haven't had a chance yet to check that the cookie is being passed each time. If someone has the time, would appreciate it.

@avalentino
Copy link

Hi, I'm experimenting the same problem.
Is there any update on this issue?

@martindurant
Copy link
Member

Since this issue is rather old and many things have changed since, please can you @avalentino write a full description of your situation: fsspec and aiohttp/requests versions, code executed and exception returned.

@avalentino
Copy link

OS: GNU/Linux - Ubuntu 22.10 - x86_64
Python: 3.10.7
fsspec: 2022.5.0
aiohttp: 3.8.1
requests: 2.27.1

import netrc
import fsspec
import aiohttp

db = netrc.netrc()
user, _, pwd = db.hosts['https://api.daac.asf.alaska.edu']
auth = aiohttp.BasicAuth(user, pwd)

url = "https://datapool.asf.alaska.edu/SLC/SA/S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.zip"

fd = fsspec.open(url, auth=auth)
fs = fsspec.filesystem('zip', fo=fd)

Output:

Traceback (most recent call last):
  File "~/projects/fsspec-sandbox/fsspec-test.py", line 12, in <module>
    fs = fsspec.filesystem('zip', fo=fd)
  File "/usr/lib/python3/dist-packages/fsspec/registry.py", line 262, in filesystem
    return cls(**storage_options)
  File "/usr/lib/python3/dist-packages/fsspec/spec.py", line 76, in __call__
    obj = super().__call__(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/fsspec/implementations/zip.py", line 58, in __init__
    self.zip = zipfile.ZipFile(self.fo)
  File "/usr/lib/python3.10/zipfile.py", line 1267, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.10/zipfile.py", line 1330, in _RealGetContents
    endrec = _EndRecData(fp)
  File "/usr/lib/python3.10/zipfile.py", line 274, in _EndRecData
    data = fpin.read()
  File "/usr/lib/python3/dist-packages/fsspec/implementations/http.py", line 574, in read
    return super().read(length)
  File "/usr/lib/python3/dist-packages/fsspec/spec.py", line 1578, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/usr/lib/python3/dist-packages/fsspec/caching.py", line 377, in _fetch
    self.cache = self.fetcher(start, bend)
  File "/usr/lib/python3/dist-packages/fsspec/asyn.py", line 86, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/fsspec/asyn.py", line 66, in sync
    raise return_result
  File "/usr/lib/python3/dist-packages/fsspec/asyn.py", line 26, in _runner
    result[0] = await coro
  File "/usr/lib/python3/dist-packages/fsspec/implementations/http.py", line 613, in async_fetch_range
    r.raise_for_status()
  File "/usr/lib/python3/dist-packages/aiohttp/client_reqrep.py", line 1004, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 401, message='Unauthorized', url=URL('https://urs.earthdata.nasa.gov/oauth/authorize?client_id=BO_n7nTIlMljdvU6kRRB3g&response_type=code&redirect_uri=https://sentinel1.asf.alaska.edu/login&state=/SLC/SA/S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.zip&app_type=401')

@avalentino
Copy link

Same results with the latest version of requests, aiohttp, fsspec installed using pip.

@martindurant
Copy link
Member

I note that this is not the exception that this thread is talking about.

Given "unauthorized", does a straight-forward get work?

import requests
r = requests.get(url, auth=auth)
r.content

@avalentino
Copy link

Your code results in an authorization error.
Moreover the file requested in more that 5GB large so I modified a little bit your example:

import requests
auth = (user, pwd)
r = requests.get(url, auth=auth, allow_redirects=True, stream=True)
print(r)
print(r.raw.read(10))

the result is

<Response [200]>
b'PK\x03\x04\x14\x03\x00\x00\x00\x00'

@martindurant
Copy link
Member

Insetaad of

fd = fsspec.open(url, auth=auth)

, please try

fd = fsspec.open(url, client_kwargs={"auth": auth})

@avalentino
Copy link

Yes, this works.
Thanks a lot for your help.

>>> fd2 = fsspec.open(url, client_kwargs={"auth": auth})
>>> fs2 = fsspec.filesystem('zip', fo=fd2)
>>> fs2.listdir("")

[{'orig_filename': 'S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.SAFE/',
  'filename': 'S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.SAFE/',
  'date_time': (2022, 10, 30, 15, 44, 14),
  'compress_type': 0,
  '_compresslevel': None,
  'comment': b'',
  'extra': b'\n\x00 \x00\x00\x00\x00\x00\x01\x00\x18\x00\x80\x84\x8euv\xec\xd8\x01\x80\xd3\x85\\v\xec\xd8\x01\x80\x84\x8euv\xec\xd8\x01',
  'create_system': 3,
  'create_version': 63,
  'extract_version': 20,
  'reserved': 3,
  'flag_bits': 0,
  'volume': 0,
  'internal_attr': 0,
  'external_attr': 1106083856,
  'header_offset': 0,
  'CRC': 0,
  'compress_size': 0,
  'file_size': 0,
  '_raw_time': 32135,
  'name': 'S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.SAFE/',
  'size': 0,
  'type': 'directory'}]

@martindurant
Copy link
Member

I was about to say how auth is specifically mentioned in the docstring f HTTPFileSystem, but I see that it is inexplicably missing from the API docs page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants