Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC the CAS storage #5136

Closed
stuhood opened this issue Nov 27, 2017 · 10 comments
Closed

GC the CAS storage #5136

stuhood opened this issue Nov 27, 2017 · 10 comments

Comments

@stuhood
Copy link
Member

stuhood commented Nov 27, 2017

We should add a method to Scheduler or Graph to garbage collect any entries in the Store that are not currently mentioned in the Graph, and which either:

  1. are older than X
  2. have not been accessed in Y

... depending on whether storing access times has a significant impact on performance.

@stuhood
Copy link
Member Author

stuhood commented Nov 27, 2017

...but actually, if the Store database is going to be shared across workspaces, using access time might be necessary to avoid nuking things out from under other daemons.

@illicitonion
Copy link
Contributor

Given the atomic update nature of lmdb, we could probably do something clever where a daemon marks a blob as being "leased", and is responsible for re-leasing at least every n hours or something.

@stuhood
Copy link
Member Author

stuhood commented Nov 28, 2017

+1 on leases... so at read time, if it looks like the lease will expire in X time, renew it.

@stuhood
Copy link
Member Author

stuhood commented Jan 12, 2018

@illicitonion : I think that we'll want a relatively short gap between #5106 landing and this change, as avoiding needing to deal with backwards compatibility of the LMDB store quite yet would be nice.

@stuhood stuhood modified the milestones: 1.4.x, 1.5.x Jan 12, 2018
@illicitonion
Copy link
Contributor

Yeah, that's a pretty reasonable concern - will try to get this put together today :)

@illicitonion
Copy link
Contributor

Ok, I've pushed the mechanics of a possible lease and garbage collect implementation to https://github.com/twitter/pants/tree/dwagnerhall/fs/cas-garbage-collection - the commit message has some discussion around how to actually wire that up to be useful. I'm going to put together an implementation of a, because it should be quick and easy, but I could be convinced that b is worthwhile...

@illicitonion
Copy link
Contributor

Actually, I think I'll wait until we have a chat before implementing a - it's not entirely clear to me how the best way to write a "walk the graph to find all DigestFile nodes" block would be.

I suspect that the correct thing to do is to give Core a Set<Digest> field, and have Core kick off a background thread which does the whole timer-at-interval thing to lease each thing in that Set. DigestFile::run would then add to the Set via its reference to Context (yay handy context objects!), but I'm not sure what would remove things from that Set; maybe DigestFile would get a Drop impl which does the removal? I believe that Nodes get dropped when they're invalidated...

Alternatively, I could add a kind of hacky single-use Walk implementation on Graph which uses all roots, rather than requiring them to be specified? Or an even hackier single-use get_all_nodes_of_kind implementation to Graph which just iterates over the Nodes HashMap, and have the interval-timer thread poll Graph for the nodes it should be looking at, and extending the leases on those...

WDYT?

@stuhood
Copy link
Member Author

stuhood commented Jan 16, 2018

but I'm not sure what would remove things from that Set; maybe DigestFile would get a Drop impl which does the removal? I believe that Nodes get dropped when they're invalidated...

Nodes are cloned in various places, so I think that that Drop impl would be a bit too clever.

Or an even hackier single-use get_all_nodes_of_kind implementation to Graph which just iterates over the Nodes HashMap, and have the interval-timer thread poll Graph for the nodes it should be looking at, and extending the leases on those...

I think this is pretty reasonable, and is what I had in mind when I mentioned "walking the graph to find all DigestFile nodes".


Regarding A/B from twitter@880d7ff : it's not clear how B actually results in cleanup occurring... ie, what's looking for expired leases?

@illicitonion
Copy link
Contributor

We had some discussion on Slack; we settled on:

  1. When a DigestFile node is created, leasing its Digest.
  2. Kicking off a background thread from Python which periodically acquires the fork lock, and calls into the Rust to acquire the Graph lock and traverse the Graph re-leasing any DigestFile nodes. The traversal will be very special-cased.
  3. Kicking off a background thread from Python which periodically acquires the fork lock, and calls into the Rust to initiate a GC pass.

2 and 3 may end up being the same service (which can be modelled on the FSEventService), or may end up being separate.

We're going to punt on things which don't appear in the graph is DigestFile nodes for now (e.g. files in Snapshots received from remote execution).

@stuhood
Copy link
Member Author

stuhood commented Jan 16, 2018

That looks great, thanks.

We're going to punt on things which don't appear in the graph is DigestFile nodes for now (e.g. files in Snapshots received from remote execution).

I think all Digests that are live in the graph would be candidates for this... I expect that the "does this represent digests" method would be quite similar to

///
/// If this NodeKey represents an FS operation, returns its Path.
///
pub fn fs_subject(&self) -> Option<&Path> {
match self {
&NodeKey::ReadLink(ref s) => Some((s.0).0.as_path()),
&NodeKey::Scandir(ref s) => Some((s.0).0.as_path()),
&NodeKey::Stat(ref s) => Some(s.0.as_path()),
_ => None,
}
}
}
(which you've updated in one of your branches to be less fragile).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants