Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Selective cache invalidation #700

Merged
merged 59 commits into from
May 3, 2017
Merged

Selective cache invalidation #700

merged 59 commits into from
May 3, 2017

Conversation

smashwilson
Copy link
Contributor

@smashwilson smashwilson commented Apr 24, 2017

Selectively invalidate the cached Repository state when (a) performing write operations ourselves and (b) receiving filesystem events. The goal here is to reduce the number of times that we need to shell out to git at all to avoid the performance penalty of launching a new process.

Fixes #671. Related to #201 and #627.

Approach

The setup for the actual caching of operations was done back in #654, where both the Cache object and @invalidate() decoration were introduced. This pull request expands on that foundation by:

  • Adding an argument to @invalidate() that's a function that's expected to return an array of cache keys that should be evicted after the Promise returned by that function completes (successfully or unsuccessfully).
  • Reworking the Cache to track "groups" of keyed items for easy eviction of subsets of the cache without needing an O(n) key scan.
  • Pass filesystem events to the Repository rather than just a "hey something changed somewhere" event. Used the modified files to invalidate parts of the cache.
  • Writing a shit-ton of tests to guard against any operations caching data that we shouldn't. More on that later.

Along the way:

  • I tinkered with getCurrentBranch(), because the filesystem event tests revealed a race condition. If an external process changes HEAD between the file content check and the git symbolic-ref call, it would fail with an error.
  • I moved .isPartiallyStaged() from Present to Repository and rewrote it in terms of .getStatusesForChangedFiles(), so that it would take advantage of cached status results. Also, it was annoying to test for with the test harness.

Cache implementation

All of the actual Cache implementation lives in lib/models/repository-states/present.js. It uses a pair of Maps: one that maps a unique string to the Promise generated the last time an operation was invoked, and one that maps a group name to a set of CacheKey objects that belong to that group.

Externally, callers populate the cache by calling .getOrSet() with a CacheKey. Each CacheKey contains a primary (globally-unique) key, often generated from a filename, and zero or more groups to which that key can belong. The primary key is used to do the lookup for an existing result; if none is present, the actual operation proceeds.

To evict items from the cache, @invalidate() passes it a collection of either CacheKey or GroupKey instances. A CacheKey evicts an item that matches its primary key exactly; a GroupKey evicts all keys that belong to the group it names.

To avoid leaving "magic strings" around (and having things fall out of sync), cache keys and groups can be generated by functions and constants in the Keys object. Key functions are either (a) a single constant (Keys.changedFiles, Keys.filePatch.all), (b) a function that generates a single key based on some parameter (naming convention: ".oneWith()", Keys.index.oneWith(fileName)), (c) a function that generates many keys (naming convention: ".eachWithXyz()", Keys.filePatch.eachWithOpts(...)), or (d) a function that generates an array of related keys that are commonly evicted together (Keys.headOperationKeys()).

Test harness

The highest risk is not evicting a cache key when the underlying operation has changed, while what we want to optimize is evicting the smallest set of cache keys that could change as the result of a specific operation. To make it easier to capture these, I wrote the assertCorrectInvalidation harness, which:

  • Uses a Map that gives friendly names to example calls of every Repository accessor that contains a this.cache.getOrSet() call.
  • Executes every call in the map once and remembers the Promises returned by each (before).
  • Executes the code under test. Ideally, this executes a single repository method with arguments and pre-existing state set up to result in as many changed Repository accessors as possible.
  • Executes every accessor in the map again and stores the Promises returned this time. If the call was not evicted from the cache by the operation, the Promise returned will be === to the ones collected before. (cached)
  • Clears the Repository's cache.
  • Executes the accessors one more time to determine what accessors actually changed output during the operation. (after)

To perform the actual assertion, the harness computes (a) after and before to generate the set of accessors that were actually modified by the operation and (b) cached and before to generate the set of accessors that were evicted from the cache.

If an accessor should have been evicted, but was not, the harness throws an error and fails the testcase; that's the bad case. If it should not have been evicted but was, that's okay, it just means that we might have been able to keep something in the cache, so it only causes a failure if {strict: true} is specified. If {verbose: true} is specified, both accessor sets are dumped to the console to make it easier to see what's going on and fine-tune the eviction.

I have test cases in this PR that cover every Repository action method that is marked with the @invalidate() decorator, and for the filesystem events generated by the underlying git operations that they use.

Left to do

  • Add arguments to the @invalidate() decorator
  • Invalidate state based on filesystem events
  • Use cached status results for isPartiallyStaged()
  • More efficient wildcard eviction from the cache
  • Rename Keys. properties to clarify at a glance which are constants, which are functions that generate single keys, which are functions that generate multiple keys...
  • Fill in a lot of tests
    • Pending tests for writing a merge conflict to the index.
  • Docs and writeup

@smashwilson
Copy link
Contributor Author

For reference, here's a waterfall I collected before starting this work:

screen shot 2017-04-24 at 10 58 21 am

waterfall data

@smashwilson
Copy link
Contributor Author

Here's how it looks with some fairly aggressive caching in place:

screen shot 2017-04-24 at 4 18 21 pm

waterfall data

I still have a bunch left to do -- mostly related to putting tests in place that ensure we're invalidating and not invalidating the correct state for each operation. But I think that's a promising start.

@smashwilson smashwilson requested a review from kuychaco April 27, 2017 19:43
@smashwilson
Copy link
Contributor Author

Haha, yep. The render from the filesystem event does trample the one from the autorefresh, but both finish right around the 100ms mark:

screen shot 2017-04-27 at 3 46 09 pm

waterfall data

I might actually implement the "expected filesystem update" feature I talked about with @BinaryMuse back when we were first talking about autorefresh. In another PR, though, this is already under the 100ms mark 😁

@smashwilson
Copy link
Contributor Author

On reflection: expected updates might be more important to this than I'd thought, because git status touches the index, which creates an index filesystem event which invalidates the status cache... you can see the uncached git status call in the second waterfall. 🤔

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants