Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour improvements when the ipfs daemon is unavailable #1762

Merged
merged 4 commits into from
Sep 15, 2022

Conversation

hsanjuan
Copy link
Collaborator

@hsanjuan hsanjuan commented Sep 15, 2022

This PR introduces a series of improvements in how the cluster peer deals with non-responding IPFS daemons:

  • On boot, the cluster will wait for IPFS to be available
  • The pintracker will return an error on StatusAll/RecoverAll calls if ipfs is down
  • The API will return an error rather than NoContent when there are no elements to stream and there has been an error (bug)
  • In case of 10 requests failures to IPFS, a rate limit of 1 req/s until a request suceeds is introduced.

This should improve the cluster peer behaviour during IPFS issues, which right now results in endless retrying and associated CPU/memory increases, which do not help IPFS recover or restart swiftly.

Fixes #1733.

This commit introduces unlimited waiting on start until a request to `ipfs id`
succeeds.

Waiting has some consequences:

* State watching (recover/sync) and metrics publishing does not start until ipfs is ready
* swarm/connect is not triggered until ipfs is ready.

Once the first request to ipfs succeeds everything goes to what it was before.

This alleviates trying operations like sending our IDs in metrics when IPFS is
simply not there.
Unfortunately we were not paying attentions to errors while rpc-streaming pins
in the pintracker. The result is that the StatusAll operation would list all
the pins as unexpectedly unpinned when ipfs is offline, and this would result in
recover/requeing operations for all pins when ipfs is offline.

This commits changes the behaviour so that if IPFS Pin/ls has resulted in an
error, then the StatusAll operation cannot complete at all.
This fixes a bug in API code that made it return 204-No content when the RPC
methods failed with an error before any items were returned on the channel.
When IPFS starts failing or doesn't respond (i.e. during a restart), cluster
is likely to start sending requests at very fast rates. i.e. if there are 100k
items to be pinned, and pins start failing immediately, cluster will consume
the pin queue really fast and it will all be failures. At the same time, ipfs
is hammered non-stop until recover, which may make it harder.

This commits introduces a rate-limit when requests to IPFS fail. After 10
failed requests, requests will be sent at most at 1req/s rate. Once a requests
succeeds, the rate-limit is raised.

This should prevent hammering the IPFS daemon, but also increased CPU in
cluster as it burns through pinning queues when IPFS is offline, making the
situation in machines worse (and emitting way more logs).
@hsanjuan hsanjuan added this to the Release v1.0.3 milestone Sep 15, 2022
@hsanjuan hsanjuan merged commit c9895bf into master Sep 15, 2022
@hsanjuan hsanjuan deleted the fix/1733-ipfs-error-handling branch September 15, 2022 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IPFS being offline causes cascades of errors and potential memory usage bubble
1 participant