Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: lookup context cancellation race condition #656

Merged
merged 1 commit into from
May 27, 2020

Conversation

aschmahmann
Copy link
Contributor

This fixes the rare race from ipfs/kubo#7312.

The bug was that with particular timings if a lookup's context was cancelled then the lookup followup queries could continue even after the routing layer function (e.g. FindProvidersAsync) had returned to the called. This enabled the lookup followup for functions like FindProvidersAsync to try and write to a channel that had been closed, causing the panic.

This PR doesn't return control to the caller until after all the followup goroutines are cleaned up.

Note: I tried writing a test for this, but triggering this bug has proven pretty difficult since we were cancelling the followup query context when we returned to the user which means the window for this bug appearing seems quite small. Any suggestions are welcome.

@@ -130,10 +132,17 @@ processFollowUp:
}
case <-ctx.Done():
lookupRes.completed = false
cancelFollowUp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we actually need this? The context should be canceled anyways.
  • On a related note, can we reliably detect that the query failed? If the context is canceled, we may read everything off the done channel before we check this.
  • It looks like we're treating completed inconsistently. Why are we marking the query as incomplete if the user explicitly stops it early?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this? The context should be canceled anyways.

You're right we shouldn't need that.

It looks like we're treating completed inconsistently. Why are we marking the query as incomplete if the user explicitly stops it early?

Completed is really more like "have we learned enough about the DHT from this query that we no longer need to refresh this bucket". In some ways this is a little weird because if our query targets a low cpl peer we probably don't need to worry about this in the followup, but if the query targets a high cpl peer we probably want to connect to all of them so we have the opportunity to add them to our routing table.

On a related note, can we reliably detect that the query failed? If the context is canceled, we may read everything off the done channel before we check this.

Probably if we rework things a little, although if we're concerned we could just check ctx.Done again at the end. If the query fails for any reason other than context cancellation then we're not worried since our job for the followup was to attempt to run the query against each of the K nodes that weren't yet queried, not to succeed. If we cancel the context though then we want to not mark the bucket as queried.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, LGTM for now. I'd prefer to just drain this channel then check to see if the context was canceled after we finish draining the channel, but this fixes the bug.

@Stebalien Stebalien merged commit 98c5089 into master May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants