Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS being offline causes cascades of errors and potential memory usage bubble #1733

Closed
hsanjuan opened this issue Jul 8, 2022 · 0 comments · Fixed by #1762
Closed

IPFS being offline causes cascades of errors and potential memory usage bubble #1733

hsanjuan opened this issue Jul 8, 2022 · 0 comments · Fixed by #1762
Labels
exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked

Comments

@hsanjuan
Copy link
Collaborator

hsanjuan commented Jul 8, 2022

Our call to "ipfs pin ls" streams pins which we collect in a map.

If the request "dies" half-way (because IPFS dies), we end up with a map that does not have all the things it should and an error in the logs.

If this happens during a regular RecoverAll() check, the code will potentially think that a huge amount of IPFS pins are missing. This will result in all those items to be queued for pinning (so they go into memory).

While we are doing that, we will be attempting to pin things too, opening requests to IPFS obviously immediately fail, causing huge load and errors, while the queue is getting filled and memory ballooning.

Cluster should be aware if IPFS is not reachable (connection refused) and introduce some sort of delay / retry logic so that it is not possible to hammer a dead-node like now. Probably the ipfsconnector is the best place to have this logic, as it is the place that makes requests to IPFS and has common methods for that.

The problem with too many things being queued due to missing ipfs-pins entries in the pintracker is separate and involves surfacing and acting on pin-ls streaming errors, so that we abort StatusAll calls when they happen.

@hsanjuan hsanjuan added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization exp/intermediate Prior experience is likely helpful status/ready Ready to be worked P1 High: Likely tackled by core team if no one steps up and removed need/triage Needs initial labeling and prioritization labels Jul 8, 2022
@hsanjuan hsanjuan added this to the Release v1.0.3 milestone Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/ready Ready to be worked
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant