IPFS being offline causes cascades of errors and potential memory usage bubble #1733
Labels
exp/intermediate
Prior experience is likely helpful
kind/bug
A bug in existing code (including security flaws)
P1
High: Likely tackled by core team if no one steps up
status/ready
Ready to be worked
Milestone
Our call to "ipfs pin ls" streams pins which we collect in a map.
If the request "dies" half-way (because IPFS dies), we end up with a map that does not have all the things it should and an error in the logs.
If this happens during a regular RecoverAll() check, the code will potentially think that a huge amount of IPFS pins are missing. This will result in all those items to be queued for pinning (so they go into memory).
While we are doing that, we will be attempting to pin things too, opening requests to IPFS obviously immediately fail, causing huge load and errors, while the queue is getting filled and memory ballooning.
Cluster should be aware if IPFS is not reachable (connection refused) and introduce some sort of delay / retry logic so that it is not possible to hammer a dead-node like now. Probably the ipfsconnector is the best place to have this logic, as it is the place that makes requests to IPFS and has common methods for that.
The problem with too many things being queued due to missing ipfs-pins entries in the pintracker is separate and involves surfacing and acting on pin-ls streaming errors, so that we abort StatusAll calls when they happen.
The text was updated successfully, but these errors were encountered: