Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway opening HAMT dir produces: internalWebError: context canceled #8420

Closed
3 tasks done
stephendonner opened this issue Sep 9, 2021 · 7 comments
Closed
3 tasks done
Assignees
Labels
kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization P0 Critical: Tackled by core team ASAP topic/gateway Topic gateway
Milestone

Comments

@stephendonner
Copy link

stephendonner commented Sep 9, 2021

Checklist

Installation method

third-party binary

Version

go-ipfs version: 0.10.0-rc1

Config

No response

Description

DNS lookup failures for en.wikipedia-on-ipfs.org over localhost/IPNS

  1. clean profile
  2. launch Brave release (1.29.79) using --enable-logging=stderr --vmodule="*/variations/*"=6,"*/bat-native-ledger/*"=6,"*/brave_rewards/*"=6,"*/bat-native-ads/*"=6,"*/bat-native-confirmations/*"=6,"*/brave_ads/*"=9,"*/brave_user_model/*"=6 --brave-ads-staging --brave-ads-debug --rewards=staging=true,reconcile-interval=3 --use-dev-goupdater-url
  3. load ipns://en.wikipedia-on-ipfs.org
  4. click on Use a local node
  5. that loads http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/wiki/
  6. click on Reload

Actual Results with 0.10.0-rc1: page does not load, internalWebError: context canceled is displayed

0.9.1 0.10.0-rc1
2021-09-09_20-28 2021-09-09_20-33
[5200:8036:0908/141344.875:VERBOSE1:ipfs_utils.cc(320)] [IPFS] TranslateIPFSURI new URL: http://localhost:48084/ipns/en.wikipedia-on-ipfs.org
[5200:9580:0908/141344.879:VERBOSE1:media_stream_manager.cc(705)] RFAOSF::Core() [process_id=38, frame_id=4]
[5200:8036:0908/141344.889:VERBOSE1:ipfs_utils.cc(320)] [IPFS] TranslateIPFSURI new URL: http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/
[1420:7496:0908/141344.903:VERBOSE1:network_delegate.cc(34)] NetworkDelegate::NotifyBeforeURLRequest: http://localhost:45005/api/v0/swarm/peers
[5200:8036:0908/141344.907:VERBOSE1:mutable_profile_oauth2_token_service_delegate.cc(326)] MutablePO2TS::RefreshTokenIsAvailable
[5200:8036:0908/141344.907:VERBOSE1:account_consistency_mode_manager.cc(225)] Desktop Identity Consistency disabled as sign-in to Chromeis not allowed
[1420:7496:0908/141344.908:VERBOSE1:network_delegate.cc(34)] NetworkDelegate::NotifyBeforeURLRequest: http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/wiki/
[5200:9580:0908/141344.956:VERBOSE1:media_stream_manager.cc(705)] RFAOSF::Core() [process_id=38, frame_id=4]
[5200:8036:0908/141344.957:VERBOSE1:ipfs_host_resolver.cc(86)] DNS resolving error:net::ERR_NAME_NOT_RESOLVED for host: _dnslink.en.wikipedia-on-ipfs.org.ipns.localhost

Expected Results:

ipns://en.wikipedia-on-ipfs.org / http://en.wikipedia-on-ipfs.org.ipns.localhost:48084/wiki/ should resolve

/cc @spylogsster @lidel

@stephendonner stephendonner added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Sep 9, 2021
@welcome

This comment has been minimized.

@lidel lidel changed the title DNS lookup failures for en.wikipedia-on-ipfs.org over localhost/IPNS go-ipfs 0.10.0-rc1 in Brave produces: internalWebError: context canceled Sep 9, 2021
@lidel
Copy link
Member

lidel commented Sep 9, 2021

Thank you @stephendonner, this one is tricky.

I confirmed the issue exists, however it is not related to DNS nor Brave, but seems to be related to the way this specific HAMT directory is rendered by the go-ipfs gateway when requested CIDs are not in the local datastore:

Minimal test case:

//  start empty repo (without wikipedia blocks)
$ docker run --rm -it --net=host ipfs/go-ipfs:v0.10.0-rc1  

//  then attempt to load it in the other terminal:
$ curl 'http://en.wikipedia-on-ipfs.org.ipns.localhost:8080/wiki/'
internalWebError: context canceled

Update: same problem with other HAMT-sharded websites, eg. http://tr.wikipedia-on-ipfs.org.ipns.localhost:8080/wiki/

@lidel lidel changed the title go-ipfs 0.10.0-rc1 in Brave produces: internalWebError: context canceled Gateway opening HAMT dir produces: internalWebError: context canceled Sep 9, 2021
@lidel lidel added P0 Critical: Tackled by core team ASAP need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization topic/gateway Topic gateway and removed need/triage Needs initial labeling and prioritization P0 Critical: Tackled by core team ASAP labels Sep 9, 2021
@BigLep BigLep added this to the go-ipfs 0.10 milestone Sep 9, 2021
@BigLep
Copy link
Contributor

BigLep commented Sep 10, 2021

I've pasted in comments from internal Slack below.

We're going to meet sync on 2021-09-13 to determine the plan.


aschmahmann 1 day ago
The context cancelled issue seems to be coming from https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L62, since https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L76 creates and cancels a new context for a linksystem which is then used in https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L91

aschmahmann 1 day ago
I'm going to do a little more looking around since I think there may be some other bugs in that area as well.

biglep 1 day ago
Thanks Adin. Please @mention folks once we have specific areas to have others involved in.

aschmahmann 1 day ago
cc @hannah @warpfork if you've got any thoughts on a better way to rework this. Perhaps it's just changing https://github.com/ipfs/go-fetcher/blob/8349e49b3fb0c15b9d27846bd398dd9cb4821cb7/impl/blockservice/fetcher.go#L137 to use a LinkContext instead of a more general one, but I'm not sure how these things were designed to interact. (edited)
fetcher.go
blk, err := bs.GetBlock(ctx, cidLink.Cid)
https://github.com/ipfs/go-fetcher|ipfs/go-fetcheripfs/go-fetcher | Added by GitHub

biglep 22 hours ago
I don't know when we'll get @hannah engagement given DataSystems onsite.
@warpfork: if you're able to engage when start Friday that would be great.
We can post next steps in the issue so it's public.

warpfork 15 hours ago
I don't know where any of the possibly-inappropriate context cancellation comes from, but it's certainly somewhere up towards application level. go-ipld-prime just takes context and passes it straight through. It doesn't interact with it at all.
LinkContext is just a wrapper around Context. It has a helper behavior that sets the context to context.Background() if you gave it nil; that's all. (And also, that's clearly not being engaged, since the background context can't be cancelled.) (edited)

warpfork 15 hours ago
Not sure where or how to begin debugging this in the application level. Not in my wheelhouse.
major key to fixing a bug around context cancellation is having a holistic policy for what context lifetimes are supposed to be. Trying to fix things without that is gonna be peek-and-poke/whack-a-mole.
I'm not in the wheelhouse of this application enough to drop all my other mental state and develop that policy, if it doesn't exist yet
'run a docker container' is not a great amount of leverage for testing or debugging whatever's going on here
(edited)

aschmahmann 14 hours ago
develop that policy, if it doesn't exist yet
Not sure exactly what you mean by policy here, but go-ipfs and its dependencies have basically two strategies in place (and it should probably be one)
The top level caller (e.g. the CLI) controls the context and everyone takes their time
As a sanity thing subcomponents will define their own timeouts for what should be reasonable
These are generally overestimates and hard coded as constants, but if they remain should likely be configurable
While 1 is a little easier to reason about, it's proven error prone in the past so we tend to use 2.
'run a docker container' is not a great amount of leverage for testing or debugging whatever's going on here
For things that are easily reproducible, like this I find it's really easy to just spin up go-ipfs with a debugger and drop a breakpoint in the relevant function to trace exactly where things go wrong.

warpfork 14 hours ago
Update: we're pretty sure https://github.com/ipfs/go-path/blob/ca1fe7e14d5c8437849ebf47e8a6bea72830cb6c/resolver/resolver.go#L222-L225 is the source of the pain, and we're pretty sure the reason it's a problem is because the context seen there ends up getting persisted into a hamt or the unixfs adl node, if a reifier has taken action and we're getting those kind of complex nodes returned from this function... which means having a cancel take effect when the function returns kinda borks any of the adls that captured it.
We're still debating what the appropriate fix to that is.
One likely fix direction is to just move the timeouts and cancellations further up the stack.
Sidenote: Some of these cancellations and timeouts were kept, on purpose, during the code review for this very area, because there was fear of changing them. But that was conservativism applied at a syntactic level, and that... does not appear to have helped: I don't know if I'd say it outright caused problems, but at the very least it short-circuited us out from looking for them early; and it definitely didn't solve anything. Probably a lesson to learn, there. (edited)
resolver.go
ctx, cancel := context.WithTimeout(ctx, time.Minute)
defer cancel()

session := r.FetcherFactory.NewSession(ctx)

https://github.com/ipfs/go-path|ipfs/go-pathipfs/go-path | Added by GitHub

aschmahmann 13 hours ago
It seems like the workable solutions here are:
Just bump the 1min timeout + context cancellation up a level so that each call is 1min instead of each link load being 1min
Modify go-fetcher so that we can make a fetcher with per-blockload timeouts

aschmahmann 13 hours ago
Aside from the above we've got a few questions to address:
Why didn't go-ipfs CI pick this up (I suspect this could be a bit of a pain for @warpfork so probably needs to be someone whose messed around with the sharness tests before, i.e. me, Petar, or Lidel)?
What is the mental model/warning labels that should be associated with ipld Nodes and ADLs?
It's possible that we should be telling people to be careful around what you put into a linksystem due to propagation
How should we be telling people to deal with wanting some cleverness (e.g. minimum throughputs, timeouts-per-block, etc.) around block loaders?
Could we have problems elsewhere in go-ipfs due to this problem?
Is this only going to be an ADL problem? If so does this include all DagPB nodes or just UnixFS HAMTs?
Do we ever return IPLD nodes that we then operate on?
Is going with option 2 the safer thing to do here to protect ourselves from other potential bugs in this area?

aschmahmann 13 hours ago
@warpfork would appreciate your thoughts on the above and I can respond and add my thoughts later today.

hannah 12 hours ago
Hey all, sounds like the issue is that go-unixfsnode captures context from the LinkContext passed to Reify, and uses that when it does hidden linkloading inside LookupByString. Should go-unixfsnode not capture this context? Should it capture a different context?

warpfork 12 hours ago
Yep.
I don't think we've got a different context for it to capture, and it does need something.
I think the context we gave it is just not the right one, or is being cancelled way too aggressively.
We could also have an API that lets the context be re-settable later, and that could be part of a solution. (We should have that available; but also I'm not sure it's technically on the critical path for fixing this.) If we made a solution based on this, we'd still have to make changes somewhere to actually do that re-setting of the context. (edited)

hannah 12 hours ago
cool @aschmahmann @warpfork I think you have as much knowledge of this as I do at this point — I’m on site with data systems, so I’m going to assume you all have this one unless you tell me otherwise. (edited)

biglep < 1 minute ago
Hi all. Thanks for engaging. I know this isn't what anyone wants to be dealing with coming off and heading to team offsites. It's also a bummer that we have the go-ipfs release blocked before the Stewards onsite where we want to pause on the normal day-to-day as much as possible.
In order to decide how much to engage on this for next week during the Stewards onsite, I'd like to determine what the steps are that we're going to take to resolve this (including the corresponding regression test coverage). Given it's already late Friday, I assume we won't get this decided until Monday. Unless there are other suggestions, I'll schedule time for Adin's Monday morning. I'll plan on including me, @aschmahmann, @warpfork, and @hannah. Let me know if others should be included.

@BigLep
Copy link
Contributor

BigLep commented Sep 13, 2021

2021-09-13 verbal with @aschmahmann and @warpfork, we are going to:

  1. Do the quick fix of moving the timeout up to a higher layer.
  2. Investigate why this wasn't caught in our existing tests.
  3. Determine if there are additional tests that should be added.
    This is the team's top priority for this week to ideally unblock the go-ipfs release for the week of 2021-09-20.

@aschmahmann
Copy link
Contributor

  1. Proposed fix in fix: give one minute timeouts to function calls instead of block retrievals go-path#44
  2. This error only occurs when an intermediate node in the HAMT is missing locally and when the daemon is run in online mode, which was not tested. It's also kind of a pain to test in sharness since the alternative non-bugged behavior is to run until the query times out searching for the missing block, which means disambiguating between error types (both of which look basically like context cancellation)
  3. Probably a Go test in go-path or go-ipfs would be useful to reduce the likelihood of regression. This is a pretty niche case though.

@warpfork
Copy link
Member

warpfork commented Sep 15, 2021

Nit: I don't think a good hunk of #2 there is true...? The timeout isn't actually happening. The problem occurs, afaict, because code in go-path now returned a Node with HAMT logic in it, and if the HAMT logic attempted any block loads, then instantly failed, because of the cancel() call that had already happened by the time that node was returned to the caller.

I don't think this distinction necessarily matters to the fix, but (unless I'm mistaken about this, and the timeouts were actually in play somehow), it does leave many of the questions open about why tests didn't find this.

@BigLep
Copy link
Contributor

BigLep commented Sep 24, 2021

Corresponding tests are in #8436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding need/triage Needs initial labeling and prioritization P0 Critical: Tackled by core team ASAP topic/gateway Topic gateway
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants