Behaviour improvements when the ipfs daemon is unavailable #1762

hsanjuan · 2022-09-15T15:46:18Z

This PR introduces a series of improvements in how the cluster peer deals with non-responding IPFS daemons:

On boot, the cluster will wait for IPFS to be available
The pintracker will return an error on StatusAll/RecoverAll calls if ipfs is down
The API will return an error rather than NoContent when there are no elements to stream and there has been an error (bug)
In case of 10 requests failures to IPFS, a rate limit of 1 req/s until a request suceeds is introduced.

This should improve the cluster peer behaviour during IPFS issues, which right now results in endless retrying and associated CPU/memory increases, which do not help IPFS recover or restart swiftly.

Fixes #1733.

This commit introduces unlimited waiting on start until a request to `ipfs id` succeeds. Waiting has some consequences: * State watching (recover/sync) and metrics publishing does not start until ipfs is ready * swarm/connect is not triggered until ipfs is ready. Once the first request to ipfs succeeds everything goes to what it was before. This alleviates trying operations like sending our IDs in metrics when IPFS is simply not there.

Unfortunately we were not paying attentions to errors while rpc-streaming pins in the pintracker. The result is that the StatusAll operation would list all the pins as unexpectedly unpinned when ipfs is offline, and this would result in recover/requeing operations for all pins when ipfs is offline. This commits changes the behaviour so that if IPFS Pin/ls has resulted in an error, then the StatusAll operation cannot complete at all.

This fixes a bug in API code that made it return 204-No content when the RPC methods failed with an error before any items were returned on the channel.

When IPFS starts failing or doesn't respond (i.e. during a restart), cluster is likely to start sending requests at very fast rates. i.e. if there are 100k items to be pinned, and pins start failing immediately, cluster will consume the pin queue really fast and it will all be failures. At the same time, ipfs is hammered non-stop until recover, which may make it harder. This commits introduces a rate-limit when requests to IPFS fail. After 10 failed requests, requests will be sent at most at 1req/s rate. Once a requests succeeds, the rate-limit is raised. This should prevent hammering the IPFS daemon, but also increased CPU in cluster as it burns through pinning queues when IPFS is offline, making the situation in machines worse (and emitting way more logs).

hsanjuan added 4 commits September 15, 2022 16:40

api: return errors on stream response requests with 0 items

2286ee7

This fixes a bug in API code that made it return 204-No content when the RPC methods failed with an error before any items were returned on the channel.

hsanjuan added this to the Release v1.0.3 milestone Sep 15, 2022

hsanjuan merged commit c9895bf into master Sep 15, 2022

hsanjuan deleted the fix/1733-ipfs-error-handling branch September 15, 2022 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour improvements when the ipfs daemon is unavailable #1762

Behaviour improvements when the ipfs daemon is unavailable #1762

hsanjuan commented Sep 15, 2022 •

edited

Loading

Behaviour improvements when the ipfs daemon is unavailable #1762

Behaviour improvements when the ipfs daemon is unavailable #1762

Conversation

hsanjuan commented Sep 15, 2022 • edited Loading

hsanjuan commented Sep 15, 2022 •

edited

Loading