Gateway opening HAMT dir produces: internalWebError: context canceled #8420

stephendonner · 2021-09-09T17:16:49Z

Checklist

This is a bug report, not a question. Ask questions on discuss.ipfs.io.
I have searched on the issue tracker for my bug.
I am running the latest go-ipfs version or have an issue updating.

Installation method

third-party binary

Version

go-ipfs version: 0.10.0-rc1

Config

No response

Description

DNS lookup failures for en.wikipedia-on-ipfs.org over localhost/IPNS

clean profile
launch Brave release (1.29.79) using --enable-logging=stderr --vmodule="*/variations/*"=6,"*/bat-native-ledger/*"=6,"*/brave_rewards/*"=6,"*/bat-native-ads/*"=6,"*/bat-native-confirmations/*"=6,"*/brave_ads/*"=9,"*/brave_user_model/*"=6 --brave-ads-staging --brave-ads-debug --rewards=staging=true,reconcile-interval=3 --use-dev-goupdater-url
load ipns://en.wikipedia-on-ipfs.org
click on Use a local node
that loads http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/wiki/
click on Reload

Actual Results with 0.10.0-rc1: page does not load, internalWebError: context canceled is displayed

0.9.1	0.10.0-rc1

[5200:8036:0908/141344.875:VERBOSE1:ipfs_utils.cc(320)] [IPFS] TranslateIPFSURI new URL: http://localhost:48084/ipns/en.wikipedia-on-ipfs.org
[5200:9580:0908/141344.879:VERBOSE1:media_stream_manager.cc(705)] RFAOSF::Core() [process_id=38, frame_id=4]
[5200:8036:0908/141344.889:VERBOSE1:ipfs_utils.cc(320)] [IPFS] TranslateIPFSURI new URL: http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/
[1420:7496:0908/141344.903:VERBOSE1:network_delegate.cc(34)] NetworkDelegate::NotifyBeforeURLRequest: http://localhost:45005/api/v0/swarm/peers
[5200:8036:0908/141344.907:VERBOSE1:mutable_profile_oauth2_token_service_delegate.cc(326)] MutablePO2TS::RefreshTokenIsAvailable
[5200:8036:0908/141344.907:VERBOSE1:account_consistency_mode_manager.cc(225)] Desktop Identity Consistency disabled as sign-in to Chromeis not allowed
[1420:7496:0908/141344.908:VERBOSE1:network_delegate.cc(34)] NetworkDelegate::NotifyBeforeURLRequest: http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/wiki/
[5200:9580:0908/141344.956:VERBOSE1:media_stream_manager.cc(705)] RFAOSF::Core() [process_id=38, frame_id=4]
[5200:8036:0908/141344.957:VERBOSE1:ipfs_host_resolver.cc(86)] DNS resolving error:net::ERR_NAME_NOT_RESOLVED for host: _dnslink.en.wikipedia-on-ipfs.org.ipns.localhost

Expected Results:

ipns://en.wikipedia-on-ipfs.org / http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/wiki/ should resolve

/cc @spylogsster @lidel

The text was updated successfully, but these errors were encountered:

lidel · 2021-09-09T18:49:52Z

Thank you @stephendonner, this one is tricky.

I confirmed the issue exists, however it is not related to DNS nor Brave, but seems to be related to the way this specific HAMT directory is rendered by the go-ipfs gateway when requested CIDs are not in the local datastore:

Minimal test case:

//  start empty repo (without wikipedia blocks)
$ docker run --rm -it --net=host ipfs/go-ipfs:v0.10.0-rc1  

//  then attempt to load it in the other terminal:
$ curl 'http://en.wikipedia-on-ipfs.org.ipns.localhost:8080/wiki/'
internalWebError: context canceled

Update: same problem with other HAMT-sharded websites, eg. http://tr.wikipedia-on-ipfs.org.ipns.localhost:8080/wiki/

BigLep · 2021-09-10T21:24:11Z

I've pasted in comments from internal Slack below.

We're going to meet sync on 2021-09-13 to determine the plan.

aschmahmann 1 day ago
The context cancelled issue seems to be coming from https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L62, since https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L76 creates and cancels a new context for a linksystem which is then used in https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L91

aschmahmann 1 day ago
I'm going to do a little more looking around since I think there may be some other bugs in that area as well.

biglep 1 day ago
Thanks Adin. Please @mention folks once we have specific areas to have others involved in.

aschmahmann 1 day ago
cc @hannah @warpfork if you've got any thoughts on a better way to rework this. Perhaps it's just changing https://github.com/ipfs/go-fetcher/blob/8349e49b3fb0c15b9d27846bd398dd9cb4821cb7/impl/blockservice/fetcher.go#L137 to use a LinkContext instead of a more general one, but I'm not sure how these things were designed to interact. (edited)
fetcher.go
blk, err := bs.GetBlock(ctx, cidLink.Cid)
https://github.com/ipfs/go-fetcher|ipfs/go-fetcheripfs/go-fetcher | Added by GitHub

biglep 22 hours ago
I don't know when we'll get @hannah engagement given DataSystems onsite.
@warpfork: if you're able to engage when start Friday that would be great.
We can post next steps in the issue so it's public.

warpfork 15 hours ago
I don't know where any of the possibly-inappropriate context cancellation comes from, but it's certainly somewhere up towards application level. go-ipld-prime just takes context and passes it straight through. It doesn't interact with it at all.
LinkContext is just a wrapper around Context. It has a helper behavior that sets the context to context.Background() if you gave it nil; that's all. (And also, that's clearly not being engaged, since the background context can't be cancelled.) (edited)

warpfork 15 hours ago
Not sure where or how to begin debugging this in the application level. Not in my wheelhouse.
major key to fixing a bug around context cancellation is having a holistic policy for what context lifetimes are supposed to be. Trying to fix things without that is gonna be peek-and-poke/whack-a-mole.
I'm not in the wheelhouse of this application enough to drop all my other mental state and develop that policy, if it doesn't exist yet
'run a docker container' is not a great amount of leverage for testing or debugging whatever's going on here
(edited)

aschmahmann 14 hours ago
develop that policy, if it doesn't exist yet
Not sure exactly what you mean by policy here, but go-ipfs and its dependencies have basically two strategies in place (and it should probably be one)
The top level caller (e.g. the CLI) controls the context and everyone takes their time
As a sanity thing subcomponents will define their own timeouts for what should be reasonable
These are generally overestimates and hard coded as constants, but if they remain should likely be configurable
While 1 is a little easier to reason about, it's proven error prone in the past so we tend to use 2.
'run a docker container' is not a great amount of leverage for testing or debugging whatever's going on here
For things that are easily reproducible, like this I find it's really easy to just spin up go-ipfs with a debugger and drop a breakpoint in the relevant function to trace exactly where things go wrong.

warpfork 14 hours ago
Update: we're pretty sure https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L222-L225 is the source of the pain, and we're pretty sure the reason it's a problem is because the context seen there ends up getting persisted into a hamt or the unixfs adl node, if a reifier has taken action and we're getting those kind of complex nodes returned from this function... which means having a cancel take effect when the function returns kinda borks any of the adls that captured it.
We're still debating what the appropriate fix to that is.
One likely fix direction is to just move the timeouts and cancellations further up the stack.
Sidenote: Some of these cancellations and timeouts were kept, on purpose, during the code review for this very area, because there was fear of changing them. But that was conservativism applied at a syntactic level, and that... does not appear to have helped: I don't know if I'd say it outright caused problems, but at the very least it short-circuited us out from looking for them early; and it definitely didn't solve anything. Probably a lesson to learn, there. (edited)
resolver.go
ctx, cancel := context.WithTimeout(ctx, time.Minute)
defer cancel()

session := r.FetcherFactory.NewSession(ctx)

https://github.com/ipfs/go-path|ipfs/go-pathipfs/go-path | Added by GitHub

aschmahmann 13 hours ago
It seems like the workable solutions here are:
Just bump the 1min timeout + context cancellation up a level so that each call is 1min instead of each link load being 1min
Modify go-fetcher so that we can make a fetcher with per-blockload timeouts

aschmahmann 13 hours ago
Aside from the above we've got a few questions to address:
Why didn't go-ipfs CI pick this up (I suspect this could be a bit of a pain for @warpfork so probably needs to be someone whose messed around with the sharness tests before, i.e. me, Petar, or Lidel)?
What is the mental model/warning labels that should be associated with ipld Nodes and ADLs?
It's possible that we should be telling people to be careful around what you put into a linksystem due to propagation
How should we be telling people to deal with wanting some cleverness (e.g. minimum throughputs, timeouts-per-block, etc.) around block loaders?
Could we have problems elsewhere in go-ipfs due to this problem?
Is this only going to be an ADL problem? If so does this include all DagPB nodes or just UnixFS HAMTs?
Do we ever return IPLD nodes that we then operate on?
Is going with option 2 the safer thing to do here to protect ourselves from other potential bugs in this area?

aschmahmann 13 hours ago
@warpfork would appreciate your thoughts on the above and I can respond and add my thoughts later today.

hannah 12 hours ago
Hey all, sounds like the issue is that go-unixfsnode captures context from the LinkContext passed to Reify, and uses that when it does hidden linkloading inside LookupByString. Should go-unixfsnode not capture this context? Should it capture a different context?

warpfork 12 hours ago
Yep.
I don't think we've got a different context for it to capture, and it does need something.
I think the context we gave it is just not the right one, or is being cancelled way too aggressively.
We could also have an API that lets the context be re-settable later, and that could be part of a solution. (We should have that available; but also I'm not sure it's technically on the critical path for fixing this.) If we made a solution based on this, we'd still have to make changes somewhere to actually do that re-setting of the context. (edited)

hannah 12 hours ago
cool @aschmahmann @warpfork I think you have as much knowledge of this as I do at this point — I’m on site with data systems, so I’m going to assume you all have this one unless you tell me otherwise. (edited)

biglep < 1 minute ago
Hi all. Thanks for engaging. I know this isn't what anyone wants to be dealing with coming off and heading to team offsites. It's also a bummer that we have the go-ipfs release blocked before the Stewards onsite where we want to pause on the normal day-to-day as much as possible.
In order to decide how much to engage on this for next week during the Stewards onsite, I'd like to determine what the steps are that we're going to take to resolve this (including the corresponding regression test coverage). Given it's already late Friday, I assume we won't get this decided until Monday. Unless there are other suggestions, I'll schedule time for Adin's Monday morning. I'll plan on including me, @aschmahmann, @warpfork, and @hannah. Let me know if others should be included.

BigLep · 2021-09-13T18:51:37Z

2021-09-13 verbal with @aschmahmann and @warpfork, we are going to:

Do the quick fix of moving the timeout up to a higher layer.
Investigate why this wasn't caught in our existing tests.
Determine if there are additional tests that should be added.
This is the team's top priority for this week to ideally unblock the go-ipfs release for the week of 2021-09-20.

aschmahmann · 2021-09-15T13:51:03Z

Proposed fix in fix: give one minute timeouts to function calls instead of block retrievals go-path#44
This error only occurs when an intermediate node in the HAMT is missing locally and when the daemon is run in online mode, which was not tested. It's also kind of a pain to test in sharness since the alternative non-bugged behavior is to run until the query times out searching for the missing block, which means disambiguating between error types (both of which look basically like context cancellation)
Probably a Go test in go-path or go-ipfs would be useful to reduce the likelihood of regression. This is a pretty niche case though.

warpfork · 2021-09-15T16:49:49Z

Nit: I don't think a good hunk of #2 there is true...? The timeout isn't actually happening. The problem occurs, afaict, because code in go-path now returned a Node with HAMT logic in it, and if the HAMT logic attempted any block loads, then instantly failed, because of the cancel() call that had already happened by the time that node was returned to the caller.

I don't think this distinction necessarily matters to the fix, but (unless I'm mistaken about this, and the timeouts were actually in play somehow), it does leave many of the questions open about why tests didn't find this.

BigLep · 2021-09-24T22:20:43Z

Corresponding tests are in #8436

stephendonner added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Sep 9, 2021

This comment has been minimized.

Sign in to view

lidel changed the title ~~DNS lookup failures for en.wikipedia-on-ipfs.org over localhost/IPNS~~ go-ipfs 0.10.0-rc1 in Brave produces: internalWebError: context canceled Sep 9, 2021

lidel changed the title ~~go-ipfs 0.10.0-rc1 in Brave produces: internalWebError: context canceled~~ Gateway opening HAMT dir produces: internalWebError: context canceled Sep 9, 2021

stephendonner mentioned this issue Sep 9, 2021

Manual pass for go-ipfs 0.10.0 on Windows brave/brave-browser#17922

Closed

BigLep added this to the go-ipfs 0.10 milestone Sep 9, 2021

aschmahmann mentioned this issue Sep 14, 2021

fix: give one minute timeouts to function calls instead of block retrievals ipfs/go-path#44

Merged

spylogsster mentioned this issue Sep 17, 2021

Update ipfs node to 0.10 brave/brave-browser#17934

Closed

BigLep assigned aschmahmann Sep 21, 2021

aschmahmann mentioned this issue Sep 27, 2021

v0.10.0-rc2 #8459

Merged

5 tasks

BigLep closed this as completed Sep 28, 2021

stephendonner mentioned this issue Oct 6, 2021

IPFS node 0.10.0 release brave/qa-resources#365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway opening HAMT dir produces: internalWebError: context canceled #8420

Gateway opening HAMT dir produces: internalWebError: context canceled #8420

stephendonner commented Sep 9, 2021 •

edited by lidel

Loading

This comment has been minimized.

lidel commented Sep 9, 2021 •

edited

Loading

BigLep commented Sep 10, 2021

BigLep commented Sep 13, 2021

aschmahmann commented Sep 15, 2021

warpfork commented Sep 15, 2021 •

edited

Loading

BigLep commented Sep 24, 2021

Gateway opening HAMT dir produces: internalWebError: context canceled #8420

Gateway opening HAMT dir produces: internalWebError: context canceled #8420

Comments

stephendonner commented Sep 9, 2021 • edited by lidel Loading

Checklist

Installation method

Version

Config

Description

This comment has been minimized.

lidel commented Sep 9, 2021 • edited Loading

BigLep commented Sep 10, 2021

BigLep commented Sep 13, 2021

aschmahmann commented Sep 15, 2021

warpfork commented Sep 15, 2021 • edited Loading

BigLep commented Sep 24, 2021

stephendonner commented Sep 9, 2021 •

edited by lidel

Loading

lidel commented Sep 9, 2021 •

edited

Loading

warpfork commented Sep 15, 2021 •

edited

Loading