-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ServerDisconnectedError and Cannot seek streaming HTTP file ValueError in version 0.8.7 and 0.8.4 #550
Comments
I cannot see anything wrong with your code, except I would use the more compact form
unless you really want file-system-like operations rather than just file access. I note that the second initial URL returns a 307 temporary redirect to a cloudfront.net URL. The specific URL is presumably time-sensitive and might expire, which is not something fsspec is prepared for - it will continue hitting the original URL. In any case, fsspec seems to be able to infer the size of the files OK ( I cannot authenticate with the servers, so I can't debug further. Maybe you can send me credentials on a private channel. |
Thanks for taking a look. This approach is flexible when the file is cached locally, e.g. when I do some full processing. Happy to send you credentials. What's your preferred private channel? |
email, gitter ? |
sent via gitter ... |
Sounds a lot like aio-libs/aiohttp#4549 , but adding |
seems like there is some related issue. Interestingly, the 0.8.4 version works for TEST 2 (which I am most concerned with). Do you think I need to force version 0.8.4 in my conda environment for now or can we see another pathway? |
I am experimenting a little. I have a feeling that the server is refusing multiple connection. Waiting works, possibly the connections time out. HTTPFileSystem does not explicitly have any retries implemented, because the HTTP layer may do this - but perhaps it should. The creds were for both URLs, or just the first one? I'm pretty sure the second is something to do with the redirect - the new URL should be stored instead of continuing using the original one. |
thanks, Martin! Creds should be good for both URLs. |
The second URL is working fine :) |
in version 0.8.7? |
local main branch
|
version '0.8.7+0.gef1001a.dirty', with local change --- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
@@ -607,11 +609,14 @@ async def _file_size(url, session=None, size_policy="head", **kwargs):
r = await session.get(url, allow_redirects=ar, **kwargs)
else:
raise TypeError('size_policy must be "head" or "get", got %s' "" % size_policy)
- async with r:
- if "Content-Length" in r.headers:
- return int(r.headers["Content-Length"])
- elif "Content-Range" in r.headers:
- return int(r.headers["Content-Range"].split("/")[1])
+ try:
+ async with r:
+ if "Content-Length" in r.headers:
+ return int(r.headers["Content-Length"])
+ elif "Content-Range" in r.headers:
+ return int(r.headers["Content-Range"].split("/")[1])
+ finally:
+ r.close() |
I tried your sequence and still get the fsspec 0.8.7 pyhd8ed1ab_0 conda-forge |
I guess that in your case, the size determination is being hit by the disconnect error. You need the size to be able to seek... I must stop here for the moment, I'm afraid. I should at least add logging to the http module, so we know what calls are succeeding versus failing; would you be interested in wrapping the aiohttp client with https://pypi.org/project/aiohttp-retry/ ? It seems that might well be the kind of solution we are after. |
thanks, Martin. I'll give that a try and let you know how far I get. |
Sorry to say, I didn't get very far as this doesn't seem to even want to load ...
|
More tests, more strange behavior...
On the working one I also get details with the size:
Not on the not working version:
does that help? |
So it certainly shows the difference, that server-reset happened during size query for the second instance, but I don't know what the system kernel has to do with it. This is a conda install of python? |
yes. it's a conda install with default channel conda-forge. I am attaching the yaml to build the kernel. |
One other tidbit of detail, again where we don't really make a connection readily I assume: Non working machine is in AWS us-west-2, same region of the bucket where the zip is stored. Working machine is located in us-east-1. Who knows if that might make a difference? |
Upon further testing I am honing in on the AWS region difference! I cloned my amazon machine image from us-east-1 to be used in us-west-2. The machine gave me the same environment in both regions. Result: my fsspec/zip code works in us-east-1, but fails with the |
Since you are not using the S3 API actually, but getting an HTTP URL from an intermediary, I really don't know why region would matter, except of course that latencies are different. -EDIT- after writing the text below, I wonder whether the cloudfront cache is skipped for intra-region requests? I suspect that it's the intermediary which is causing the hangup, not S3 - so HTTPFileSystem should notice on the first call that a redirect is happening, and thereafter use the new URL rather than keep hitting the redirect server. I don't have proof for this... Actually the initial 307 redirect is to cloudfront, which gives a 303 redirect to another cloudfront URL embedding an S3 signed URL. The final response header has "'X-Cache': 'Miss from cloudfront'" - could the difference be whether the call is cached or not? Also note that only the "datapool" version requires any credentials. In the case of auth failure, the final requests generates yet another 303 redirect, which points to a "DENIED" html page. Note that by playing with this chain of redirects, I was able to get a |
Thanks, Martin for keeping at it. Latency differences and possible skipping of the cloudfront cache are good leads. Do you think that this can and should be caught within fsspec or should we talk to the NASA folks who set up the servers? I know the group at ASF quite well and would be happy to make an introduction.
… On Mar 3, 2021, at 1:37 PM, Martin Durant ***@***.***> wrote:
Since you are not using the S3 API actually, but getting an HTTP URL from an intermediary, I really don't know why region would matter, except of course that latencies are different. -EDIT- after writing the text below, I wonder whether the cloudfront cache is skipped for intra-region requests?
I suspect that it's the intermediary which is causing the hangup, not S3 - so HTTPFileSystem should notice on the first call that a redirect is happening, and thereafter use the new URL rather than keep hitting the redirect server. I don't have proof for this... Actually the initial 307 redirect is to cloudfront, which gives a 303 redirect to another cloudfront URL embedding an S3 signed URL. The final response header has "'X-Cache': 'Miss from cloudfront'" - could the difference be whether the call is cached or not?
Also note that only the "datapool" version requires any credentials. In the case of auth failure, the final requests generates yet another 303 redirect, which points to a "DENIED" html page. Note that by playing with this chain of redirects, I was able to get a Bad Gateway response, so the server(s) are certainly not playing quite well.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#550 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPXSEIQL3YPMXIX3M4WOODTBZ6VVANCNFSM4YMXLS2Q>.
|
It would be good to have their input, see if anything suspicious is appearing in the logs for instance. |
ok, I just pinged you on Gitter.
… On Mar 3, 2021, at 3:45 PM, Martin Durant ***@***.***> wrote:
It would be good to have their input, see if anything suspicious is appearing in the logs for instance.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#550 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPXSEPWXVRAABJHKDEFQQLTB2NYDANCNFSM4YMXLS2Q>.
|
The original exception was ServerDisconnectedError, which I would tentatively connect with the "throttling" part of the diagram you provide. Is it the case that two subsequent calls to the original (auth) server could cause a disconnect or other service refusal; i.e., if we has stashed the redirect URL and used it directly, would you think the problem goes away?
This would mean intercepting the redirect chain, which is certainly doable, but not something that easily fits into the fsspec implementation. Perhaps the workaround would be to directly hit the auth host with a requests call (i.e., not using fsspec) and then finding the resultant final URL and use that with fsspec. |
Thanks, Martin and Brian. Seems like we are on a good track. Martin, please recall that the `ServerDisconnectedError` happened with the access to a different DAAC. The ASF DAAC access attempt threw the `ValueError: Cannot seek streaming HTTP file`.
Martin: How would you attempt the request call ideally first to get to the final URL?
Brian: Once we go that route can we use the s3 url directly?
…On Mar 4, 2021, at 1:58 PM, Martin Durant ***@***.***> wrote:
The original exception was ServerDisconnectedError, which I would tentatively connect with the "throttling" part of the diagram you provide. Is it the case that two subsequent calls to the original (auth) server could cause a disconnect or other service refusal; i.e., if we has stashed the redirect URL and used it directly, would you think the problem goes away?
I cannot say for certain this is the problem you're experiencing, but It is something to look out for. Mitigating this can be as easy as ensuring that the Authorization: Basic XXXXXX= header is ONLY provided to that authentication host (urs.earthdata.nasa.gov) and never any other intermediate stops along the way.
This would mean intercepting the redirect chain, which is certainly doable, but not something that easily fits into the fsspec implementation. Perhaps the workaround would be to directly hit the auth host with a requests call (i.e., not using fsspec) and then finding the resultant final URL and use that with fsspec.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#550 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPXSEICQUY4N7X5R7EPMG3TB7J6LANCNFSM4YMXLS2Q>.
|
Out-of-region throttling from CloudFront is a transparent slowdown of the bandwidth, its also pretty rare. Ideally you'd never see it, because you'll never be throttled. In region (from within If you're seeing issues with the authentication host ( For optimum performance, I'd suggest these two tweaks if you're not currently doing them:
Other than the double-auth problem, there shouldn't be be any region change issues that are directly coming into play. |
If you're operating in This issue with in-region/double-auth for accessing S3 data is not unique to ASF. This architecture is the same across all NASA DAAC's and will affect access to all https datasets that have been migrated into AWS (though the cookie/authentication details can and will vary). It is also not limited to just DAAC data. AWS S3 will ALWAYS complain about being provided double auth. |
This happened when fsspec was unable to complete a HEAD request to determine the file's size, very likely because of ServerDisconnect.
and manually check the |
I haven't had a chance yet to check that the cookie is being passed each time. If someone has the time, would appreciate it. |
Hi, I'm experimenting the same problem. |
Since this issue is rather old and many things have changed since, please can you @avalentino write a full description of your situation: fsspec and aiohttp/requests versions, code executed and exception returned. |
OS: GNU/Linux - Ubuntu 22.10 - x86_64 import netrc
import fsspec
import aiohttp
db = netrc.netrc()
user, _, pwd = db.hosts['https://api.daac.asf.alaska.edu']
auth = aiohttp.BasicAuth(user, pwd)
url = "https://datapool.asf.alaska.edu/SLC/SA/S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.zip"
fd = fsspec.open(url, auth=auth)
fs = fsspec.filesystem('zip', fo=fd) Output:
|
Same results with the latest version of requests, aiohttp, fsspec installed using pip. |
I note that this is not the exception that this thread is talking about. Given "unauthorized", does a straight-forward get work? import requests
r = requests.get(url, auth=auth)
r.content |
Your code results in an authorization error. import requests
auth = (user, pwd)
r = requests.get(url, auth=auth, allow_redirects=True, stream=True)
print(r)
print(r.raw.read(10)) the result is
|
Insetaad of
, please try
|
Yes, this works. >>> fd2 = fsspec.open(url, client_kwargs={"auth": auth})
>>> fs2 = fsspec.filesystem('zip', fo=fd2)
>>> fs2.listdir("")
[{'orig_filename': 'S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.SAFE/',
'filename': 'S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.SAFE/',
'date_time': (2022, 10, 30, 15, 44, 14),
'compress_type': 0,
'_compresslevel': None,
'comment': b'',
'extra': b'\n\x00 \x00\x00\x00\x00\x00\x01\x00\x18\x00\x80\x84\x8euv\xec\xd8\x01\x80\xd3\x85\\v\xec\xd8\x01\x80\x84\x8euv\xec\xd8\x01',
'create_system': 3,
'create_version': 63,
'extract_version': 20,
'reserved': 3,
'flag_bits': 0,
'volume': 0,
'internal_attr': 0,
'external_attr': 1106083856,
'header_offset': 0,
'CRC': 0,
'compress_size': 0,
'file_size': 0,
'_raw_time': 32135,
'name': 'S1A_IW_SLC__1SSH_20221030T141520_20221030T141547_045672_057648_393A.SAFE/',
'size': 0,
'type': 'directory'}] |
I was about to say how auth is specifically mentioned in the docstring f HTTPFileSystem, but I see that it is inexplicably missing from the API docs page. |
I wonder if @rsignell-usgs or @martindurant have insights into this issue:
I have the following code snippet that worked in fsspec version 0.8.4 (for TEST 2):
There are two issues:
TEST 1 produces a
ServerDisconnectedError
, but on a second run returns the expected output. This happens in both tested versions, 0.8.4 and 0.8.7TEST 2 produces the correct output in version 0.8.4 (on the first run) but fails with e
ValueError: Cannot seek streaming HTTP file
in version 0.8.7Should I approach this differently?
I attach the tracebacks below. Thanks!
ServerDisconnectedError
ValueError: Cannot seek streaming HTTP file
The text was updated successfully, but these errors were encountered: