RFC: js-ipfs Garbage Collection #2012

dirkmc · 2019-05-01T08:20:53Z

This issue is to discuss how best to implement Garbage Collection in js-ipfs

The go-ipfs Garbage Collector

We would like to learn from the experience of go-ipfs when building a Garbage Collector for js-ipfs.

Triggers

Manual: ipfs repo gc
Daemon flag: --enable-gc causes GC to run
- only if storage exceeds StorageGCWatermark% of StorageMax (90% of 10G by default)
- periodically every GCPeriod (1 hour by default)
- ~~when a file is added to unixfs~~ (currently disabled)

Algorithm

// GC performs a mark and sweep garbage collection of the blocks in the blockstore
// first, it creates a 'marked' set and adds to it the following:
// - all recursively pinned blocks, plus all of their descendants (recursively)
// - bestEffortRoots, plus all of its descendants (recursively)
// - all directly pinned blocks
// - all blocks utilized internally by the pinner
//
// The routine then iterates over every block in the blockstore and
// deletes any block that is not found in the marked set.
func GC(ctx context.Context, bs bstore.GCBlockstore, dstor dstore.Datastore, pn pin.Pinner, bestEffortRoots []cid.Cid) <-chan Result {

Note that bestEffortRoutes currently only contains the MFS root

Proposal for a js-ipfs Garbage Collector

Requirements

Manually Garbage Collect all blocks that are not pinned, not in MFS and not used internally
ipfs repo gc: mark and sweep - remove all unreachable blocks
Passively Garbage Collect old blocks when new blocks are added to the blockstore
ipfs daemon --enable-gc causes GC to run
- whenever a block is added to blockstore
- only if storage exceeds StorageGCWatermark% of StorageMax (90% of 10G by default)
- remove only excess blocks, least recently accessed blocks first

Algorithm for passive GC

Track reference counts using persistent reference counting algorithm
- reference count increment:
  - each time a block is pinned directly or indirectly
    - pinner must ensure that multiple pins of same hash are only registered once
  - on mfs add
- reference count decrement:
  - each time a block is unpinned directly or indirectly
  - on mfs rm
- maintain last block access time

The text was updated successfully, but these errors were encountered:

dirkmc · 2019-05-01T08:25:53Z

@Stebalien @magik6k could I ask for your input on the above proposal, and to correct anything I misunderstood about how go GC works

@alanshaw suggested using a tiered datastore. It would track

references to a block
the last access time of a block
which blocks are used by GC internally

Please CC anyone else who you think might have knowledge of this area

Stebalien · 2019-05-01T19:24:01Z

Overall, that looks like a good proposal to me.

Note: ipfs repo gc doesn't have to use mark and sweep assuming you have passive reference counts.

maintain last block access time

That may not scale well. We can make this approximate ("accessed within the last day") to reduce the overhead of repeated access but this will still turn reads into writes.

remove only excess blocks, least recently accessed blocks first

Ideally, this would use a combination of frequency and recency.

dirkmc · 2019-05-02T06:31:59Z

Note: ipfs repo gc doesn't have to use mark and sweep assuming you have passive reference counts.

@Stebalien I was thinking that to minimize overhead I wouldn't maintain reference counts for blocks that are not pinned and not in MFS. Because ipfs repo gc is triggered manually it seems less of a problem for it to stop the world. Based on your experience with go-ipfs, which approach would you recommend for ipfs repo gc between mark-and-sweep (with less overhead) vs passive reference counts (faster but more overhead)?

That may not scale well. We can make this approximate ("accessed within the last day") to reduce the overhead of repeated access but this will still turn reads into writes.

You're right, scaling will be a challenge, even just for reference counts. A couple of ideas off the top of my head:

GC runs in a background process

Batch updates to reference counts / last access
GC runs continuously but
- checks these in-memory queued updates in addition to persisted reference count / last access information
- writes batches every N ref count changes and P period of time
If there is a crash
- reference counts can be re-calculated
- last access is just a heuristic for caching so it's not imperative that it be recoverable

Move GC into datastore-interface

Datastore itself probably has access to metadata like file access time
May simplify accounting

alanshaw · 2019-05-03T08:25:43Z

remove only excess blocks, least recently accessed blocks first

We should ensure that a block sticks around for an hour at minimum even if not accessed. This allows us to manage expecations on how long data will persist on a preload node/gateway.

Access time would really help to more intelligently clean up blocks, ideally we wouldn't collect a block that is often accessed but not pinned. However, storing creation time instead would at least allow us to clean up blocks not pinned and older than an hour without writing on read.

on mfs add

Remember it's basically any MFS command that you'd need to recalc references and that when you make changes anywhere in the MFS tree you propogate all the way back to the root.

Datastore itself probably has access to metadata like file access time

I'm sure you know but the datastore is pluggable - not everything is stored as a file - s3, url, in memory.

I would really love to have passive GC in JS IPFS. Personally I'd focus on the minimum viable (stop the world) first and then when you have that working you'll have a better idea of what's possible or not and then you can move on to a more advanced algorithm.

Stebalien · 2019-05-03T19:18:20Z

I was thinking that to minimize overhead I wouldn't maintain reference counts for blocks that are not pinned and not in MFS.

Unless I'm mistaken, we shouldn't need to maintain reference counts for un-pinned blocks regardless. How does passive GC differ from active ipfs repo gc in this case?

Batch updates to reference counts / last access
...
reference counts can be re-calculated

True. For reference counts, we could also try some kind of write-ahead log. That should be pretty easy to maintain without too much overhead (one more write per block).

Move GC into datastore-interface

I'm not sure if the datastore has enough information for this. In go, at least, the datastore just deals with blobs of data and knows nothing about IPLD, references, etc.

@alanshaw

We should ensure that a block sticks around for an hour at minimum even if not accessed. This allows us to manage expecations on how long data will persist on a preload node/gateway.

We should definitely make some configurable grace period but, IMO, we should set this to 0 on our gateways. Users shouldn't expect us to store stuff for them.

However, storing creation time instead would at least allow us to clean up blocks not pinned and older than an hour without writing on read.

We'd probably want both: sparse access time updates (i.e., record an access every hour) and a creation date. We can reduce the frequency of access time updates as the content ages.

My rational is that data is usually ephemeral or permanent. The longer data sticks around, the longer it's likely to stick around so we don't want to just remove old data.

Remember it's basically any MFS command that you'd need to recalc references and that when you make changes anywhere in the MFS tree you propogate all the way back to the root.

We can do this lazily on GC.

I would really love to have passive GC in JS IPFS. Personally I'd focus on the minimum viable (stop the world) first and then when you have that working you'll have a better idea of what's possible or not and then you can move on to a more advanced algorithm.

Yeah, that's probably the best approach. I'd just keep alternatives in mind before over-optimizing that approach.

dirkmc · 2019-05-04T03:14:06Z

Unless I'm mistaken, we shouldn't need to maintain reference counts for un-pinned blocks regardless. How does passive GC differ from active ipfs repo gc in this case?

You're right I don't think we need to maintain reference counts for blocks that are not pinned and not in MFS. If they are pinned or in MFS we will need to, because for example a block could be part of a file in MFS and also part of another file that's pinned.

Passive GC differs from active, manually run GC in that

it is triggered when a block is added to the blockstore (if we're over the GC highwater mark)
it only removes enough blocks to drop below the GC highwater mark
it doesn't stop the world, it just removes the most eligible blocks for deletion

For reference counts, we could also try some kind of write-ahead log

I think that's a good idea 👍

I'd focus on the minimum viable (stop the world) first

Sounds like the consensus is to implement stop-the-world mark and sweep first, as we'll need it anyway, and then look at adding a more sophisticated reference counting algorithm that takes into account creation time and last access time. Does that sound right?

dirkmc · 2019-05-04T07:09:42Z

Unless I'm mistaken, we shouldn't need to maintain reference counts for un-pinned blocks regardless. How does passive GC differ from active ipfs repo gc in this case?

Sorry @Stebalien I just realized what you meant - you're saying that if we're already reference counting we don't need to mark and sweep, we just remove all blocks that are not reference counted, is that right?

If that's the case then maybe it does make sense to skip the mark and sweep implementation, and go straight for the more sophisticated implementation with reference counting. Thoughts @alanshaw?

Stebalien · 2019-05-06T22:58:31Z

That's what I meant however, going the simpler route first is probably still the better solution. On the other hand, that depends on how pressing fixing this is.

dirkmc · 2019-05-10T15:29:33Z

I've implemented mark-and-sweep in #2022 but there is currently no locking.

go-ipfs uses a special Blockstore that implements the GCLocker interface:

// GCLocker abstract functionality to lock a blockstore when performing
// garbage-collection operations.
type GCLocker interface {
	// GCLock locks the blockstore for garbage collection. No operations
	// that expect to finish with a pin should ocurr simultaneously.
	// Reading during GC is safe, and requires no lock.
	GCLock() Unlocker

	// PinLock locks the blockstore for sequences of puts expected to finish
	// with a pin (before GC). Multiple put->pin sequences can write through
	// at the same time, but no GC should happen simulatenously.
	// Reading during Pinning is safe, and requires no lock.
	PinLock() Unlocker

	// GcRequested returns true if GCLock has been called and is waiting to
	// take the lock
	GCRequested() bool
}

@alanshaw @achingbrain do you think it makes sense for us to follow a similar approach? It would mean modifying js-ipfs-repo's interface to add methods for locking

Stebalien · 2019-05-10T16:47:16Z

This doesn't actually have to go on the repo, you just need to make sure to take a read-lock when adding blocks and a write lock when garbage collecting.

alanshaw · 2019-05-13T14:55:10Z

@dirkmc can you flesh out what it might look like and where we'd have to place code to lock/unlock? This sounds sensible on the surface but I don't have the time to dig in right now.

Note that MFS has locking, it might be worth checking out how that is implemented. @achingbrain would be able ot answer questions you have about it.

dirkmc · 2019-05-13T15:43:10Z

The go implementation uses a read/write mutex composed of:

GC lock, a write lock taken when
- GC runs
- Blocks are removed (ipfs block rm)
Pin Lock, a read lock taken by
- block put
- dag add
- dag add many
- object put
- pin add
- pin update
- mfs add

When adding a file with mfs add it waits for GC to complete (by checking the GC lock)

As @Stebalien points out this doesn't necessarily have to be on the blockstore interface, that's just how it's implemented in go.

mikeal · 2019-05-16T10:00:32Z

In the future, it might be a good idea to make more explicit separations between APIs that store data and those that do not. Right now, there are APIs that “happen to store data until there is a gc” and APIs that store data indefinitely (pinned). If instead, API’s that do not pin data also did not store that data at all you wouldn’t be violating any assumptions users might have with smarter GC strategies. As things are now you’re painted into a bit of a corner and have limited opportunities for automated GC.

satazor · 2019-05-24T06:04:30Z

A very small note to when this is released: warn the OrbitDB people once this is done because they aren’t calling ipfs.pin.add, meaning data can be lost.

dirkmc · 2019-05-24T12:16:46Z

Thanks @satazor. Could you point me to the piece of code where that happens? Note that ipfs.add() will pin by default

satazor · 2019-05-24T13:33:44Z

@dirkmc They are using ipfs.dag.put which doesn’t pin by default (I think). Here it is: https://github.com/orbitdb/orbit-db-io/blob/master/index.js

alanshaw · 2019-07-16T14:50:56Z

We should get #1867 implemented asap.

lidel · 2019-07-17T14:32:20Z

@daviddias and @alanshaw regarding your comments in #2022 (comment)

Very quick question while I'm on the Core Implementations call: Does js-ipfs GC take into account the ephemerallity of IndexedDB? Check the LRU policy to see how this will affect pinning, gc and even fetching of large datasets https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Browser_storage_limits_and_eviction_criteria#LRU_policy

LRU deletes storage for an entire origin at a time, so we won't randomly lose blocks of data from our datastore which, in short means that this policy doesn't really effect our
implementation of GC.

Quick reminder: there are different types of data storage on web platform. We use Temporary by default to avoid annoying user prompts at the cost of theoretical data purge due to LRU policy. There could be a configuration option to enable opt-in for Persistent storage if app developer is okay with trading user convenience for persistence guarantees, but that is a separate topic for other time/PR.

By default, JS IPFS in browser context should not display any prompts, but ship with smart GC that is aware of how Temporary storage works and ensure GC happens before Origin storage limits are hit, to maximize the lifetime of IPFS cache.

We don't control when LRU purge is done by the browser, but we can observe storage limit for current Origin via navigator.storage.estimate() as note by @wehriam in Implement Garbage Collection #2022 (comment)
repo init should enable GC and set StorageMax to some safe default
code that triggers GC should be based on both StorageMax and navigator.storage.estimate (whichever hits own limits first)

resolves #2012 Depends on - [x] #2004 - [x] ipfs-inactive/js-ipfs-http-client#992 - [x] ipfs-inactive/interface-js-ipfs-core#462 - [x] achingbrain/mortice#1 TODO: - [x] Core (mark and sweep) - [x] CLI - [x] http interface - [x] interface-js-ipfs-core tests ipfs-inactive/interface-js-ipfs-core#462 - [x] nodejs-specific tests - [x] Locking - [x] Tests for locking

alanshaw added exp/wizard Extensive knowledge (implications, ramifications) required status/in-progress In progress P0 Critical: Tackled by core team ASAP labels May 3, 2019

alanshaw assigned alanshaw, dirkmc and achingbrain May 3, 2019

alanshaw mentioned this issue May 3, 2019

How to clear browser local stored data? #2005

Closed

dirkmc mentioned this issue May 8, 2019

Implement Garbage Collection #2022

Merged

11 tasks

alanshaw closed this as completed in #2022 Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: js-ipfs Garbage Collection #2012

RFC: js-ipfs Garbage Collection #2012

dirkmc commented May 1, 2019

dirkmc commented May 1, 2019

Stebalien commented May 1, 2019

dirkmc commented May 2, 2019

alanshaw commented May 3, 2019

Stebalien commented May 3, 2019

dirkmc commented May 4, 2019

dirkmc commented May 4, 2019

Stebalien commented May 6, 2019

dirkmc commented May 10, 2019

Stebalien commented May 10, 2019

alanshaw commented May 13, 2019

dirkmc commented May 13, 2019

mikeal commented May 16, 2019

satazor commented May 24, 2019

dirkmc commented May 24, 2019

satazor commented May 24, 2019

alanshaw commented Jul 16, 2019

lidel commented Jul 17, 2019

RFC: js-ipfs Garbage Collection #2012

RFC: js-ipfs Garbage Collection #2012

Comments

dirkmc commented May 1, 2019

The go-ipfs Garbage Collector

Triggers

Algorithm

Proposal for a js-ipfs Garbage Collector

Requirements

Algorithm for passive GC

dirkmc commented May 1, 2019

Stebalien commented May 1, 2019

dirkmc commented May 2, 2019

alanshaw commented May 3, 2019

Stebalien commented May 3, 2019

dirkmc commented May 4, 2019

dirkmc commented May 4, 2019

Stebalien commented May 6, 2019

dirkmc commented May 10, 2019

Stebalien commented May 10, 2019

alanshaw commented May 13, 2019

dirkmc commented May 13, 2019

mikeal commented May 16, 2019

satazor commented May 24, 2019

dirkmc commented May 24, 2019

satazor commented May 24, 2019

alanshaw commented Jul 16, 2019

lidel commented Jul 17, 2019