Implement Test caching #4587

ity · 2017-05-12T23:46:10Z

Once #4586 lands, devise a strategy for the implementation of test caching.

There are at least 2 layers of caching - by means of analysis files from zinc. Or if not found locally, then being able to invoke RemoteProcessExecution.

As step 1, need to find a way of running locally if remote execution doesn't exist or is not hooked up yet.

jsirois · 2017-06-13T14:46:42Z

Some notes on invalidation vs caching. Here invalidation refers to executing task work in Task.invalidated blocks against invalid targets. Caching refers to storing the results of that work in the artifact cache using VersionedTargetSet.results_dir. One further bit of terminology is partition, which is the name for the set of targets passed to the Task.invalidated block:

Caching results for len(partition) > 1: This is trivial iff we always run all targets in the partition, but running just invalid targets in the partition is a nicer experience (you can whittle away at failures in a loop of ::-style runs). Running just invalid though requires being able to merge prior results for the partition; ie: knowing the details of junit xml, coverage data, or using tools that do, to merge data files. The alternative is to always run all targets in a partition if even 1 target is invalid. In this way data files corresponding to the full partition are always generated, and so on a green partition, the cached data files will always represent the full green run.

A compromise might be to only cache when all_vts == invalid_vts; ie when the partition goes green and the run was against the full partition. A common scenario would then be:

Jane makes changes / adds new code and iterates ./pants test tests/python/stuff:: gradually getting greener until finally all test targets in the tests/python/stuff:: set pass. She commits the green change, but there is no cached result for it since green state for the partition was approached incrementally.
Jake pulls in Jane's green change and runs ./pants test tests/python/stuff::. There is a cache miss and he does a full local run, but since tests/python/stuff:: is green, all_vts == invalid_vts and the result is now cached for others.

In this scenario, Jake will likely be a CI process, in which case human others will see a cached result from Jane's commit.

In code form, is this approach generically advisable?

class AnyTask(Task):
  def check_artifact_cache_for(self, invalidation_check):
    return VersionedTargetSet.from_versioned_targets(invalidation_check.all_vts)

  def execute(self):
    # Other strategies could be used here, the key bit under consideration are the cases
    # where the resulting partition has more than one actionable target for this task.
    partition = self.context.targets()

    with self.invalidated(partition) as invalidation_check:
      invalid_targets = [invalid_target
                         for vts in invalidation_check.invalid_vts
                         for invalid_target in vts.targets]
      results_dir = self._execute_task_work(invalid_targets)
      if invalidation_check.all_vts == invalidation_check.invalid_vts:
        # We executed against all targets in the partition
        # successfully (`_execute_task_work` did not raise), so cache data files for
        # others to see.
        cache_vts = self.check_artifact_cache_for(invalidation_check)
        self._copy_results(results_dir, cache_vts.results_dir)
      else:
        # We don't cache results; so others will need to re-run this partition.
        # NB: We will presumably commit this change now though and so others will get this
        # partition in a state that executes successfully; so when the 1st of the others
        # executes against this partition; they will hit `all_vts == invalid_vts` and
        # cache the results. That 1st of others is hopefully CI!
        pass

ity · 2017-06-14T17:53:39Z

To me, the compromise sounds like it would be the sanest first step. Internally for buildcache, we do something similar where we only write artifacts to the cache from a CI job and read from everywhere else (laptops, CI). This ensures repeatability and then abstracts us from the issues of different configurations/environments or just unreliability in caching results of intermediate work being performed. So its comes into play only if you are building/running code thats already in master.

Having said that, this sounds like it would provide little to no benefit while iterating on tests and would only benefit me in CI or when I am running tests locally (with no changes). A thought as a next step, what if we looked at invalidating just the target roots when there is a change so
set(all_vts) - set(invalid_vts) = set(valid_vts) and then the set(valid_vts) would be read from the cache? This might be similar to what you suggest in the first bullet point but instead of merging, demarcating the output from each target root so that they can be easily compose-able for the subsequent runs.

macripps · 2017-06-14T18:37:50Z

It is perenially surprising to me that running ./pants test [targetA] [targetB] composes both target's classpaths and then runs the tests in that shared classpath. If each of multiple test targets was run on its own (i.e. this would decompose into ./pants test [targetA] and ./pants test [targetB]), the caching would be a true reflection of that target's dependencies.

stuhood · 2017-06-14T20:30:25Z

Running just invalid though requires being able to merge prior results for the partition; ie: knowing the details of junit xml, coverage data, or using tools that do, to merge data files.

I'd be strongly opposed to doing any merging or splitting of outputs, so I think running only the invalid tests in a partition should not be a goal.

If that means that we can only actually utilize the cache when there are lots of partitions (cache usage would be optimal when len(partitions) == len(targets)), then so be it, IMO. That's the case that I think we should be optimizing by applying more parallelism, etc.

So, I think your example works verbatim, with the exception that there are multiple partitions involved?

partitions = self.partition(self.context.targets())
for partition in partitions:
  with self.invalidated(partition) as invalidation_check:
    ...

EDIT: Actually, trying to read a bit further between the lines here, I think you're probably pointing to the fact that both Ivy and Pex are using the entire target context, and that invalidate_dependents=True is insufficient in that case to determine whether the test needs to rerun? For that reason, jvm_compile currently additionally uses an InvalidationStrategy that includes the resolved ivy versions in the cache key... this correctly incorporates the results of ivy into the key.

I had forgotten that this would be an issue with test caching. The issue is essentially that target fingerprints are fingerprints of the content on disk, rather than a fingerprint of the resulting products (ie, the inputs to the test run).

Looking at this issue again after a while, I wonder whether a better strategy for achieving test caching might be to skip attempting to use the existing fingerprinting/cache mechanism, and double-down on what we have planned for remoting. Namely, we would begin implementing what we had planned there in terms of invoking the pytest/junit processes in isolated process sandboxes, and fingerprinting all of their inputs with the API described here.

stuhood · 2017-06-14T22:28:22Z

Namely, we would begin implementing what we had planned there in terms of invoking the pytest/junit processes in isolated process sandboxes, and fingerprinting all of their inputs with the API described here.

More concretely, the exact issues would be #4586, #4585, and then a new issue to invoke pex on Snapshot inputs using the isolated process abstraction available here: https://github.com/pantsbuild/pants/blob/master/tests/python/pants_test/engine/test_isolated_process.py

EDIT: As a temporary alternative, doing what ResolvedJarAwareFingerprintStrategy does for Ivy for pex requirements and using it in the pytest runner might work. It's clearly not the long term solution since it's a whitelist of things to inputs to fingerprint rather than a blacklist, but it might be sufficient here.

Given that both #remoting and test caching are currently goals, I think we'd be happy to have either solution.

EDIT2: @kwlzn pointed out that implementing what you proposed and only supporting caching in a very specific case ("opportunistic, root-level caching on the task side": ie, when all_vts == invalid_vts) would be fine for now.

We'll work on getting a more complete definition of #4586 (or perhaps a followup?) posted, which will describe executing pex invokes in the engine as described above.

benjyw · 2017-06-15T18:29:49Z

Just to be sure I understand the problem, the issue is that what you need to cache is not just a pass/fail bit for each target, but other state, such as coverage? What exactly is the current universe of such "other state"?

stuhood · 2017-06-15T19:47:01Z

@benjyw : This is the root of the complexity, IMO:

Actually, trying to read a bit further between the lines here, I think you're probably pointing to the fact that both Ivy and Pex are using the entire target context, and that invalidate_dependents=True is insufficient in that case to determine whether the test needs to rerun? For that reason, jvm_compile currently additionally uses an InvalidationStrategy that includes the resolved ivy versions in the cache key... this correctly incorporates the results of ivy into the key.

In essence: because pex and ivy look at "the entire context", and then create products that later tasks will consume, just looking at the transitive deps for any particular target is insufficient to get an accurate cache key (because something outside your transitive deps might have caused a different pex/ivy resolution).

jsirois · 2017-06-20T17:59:06Z

@ity, re:

Having said that, this sounds like it would provide little to no benefit while iterating on tests and would only benefit me in CI or when I am running tests locally (with no changes).

As a local developer, this strategy provides the status-quo for iteration since the invalidation (not caching) machinery is still in play. The only upside offered beyond the status quo is on cache hits (usually from CI runs) that would no-op or speed up the cold case. This strategy takes the conservative tack and does no "math" such that non-identical target sets; ie ./pants test.pytest --fast tests:: vs ./pants test.pytest --no-fast tests:: vs ./pants test.pytest --fast tests/python/pants_test/base:: are considered inequivalent and only exact matches are cache hits. This is "obviously" safe - no knowledge of what the task actually does is needed beyond the invariant that it acts as a true function. The downside, is developers will not get cache hits unless CI tests the world fast and the dev does too (unlikely!), or CI tests the world slow and the dev tests a subset of the world slow (much more likely). We need to move towards the latter for sanity sake. Other schemes require deep knowledge of / special code for the task at hand afaict.

@agenticarus agreed

@stuhood lots of edits there, and they all revolve around thinking more deeply about how tasks behave - namely the global resolution tasks. Per my response to Ity above, I'd love to ignore those problematic deep behaviors and setup caching initially to do the only obvious safe thing.

@benjyw coverage and the junit xml (some folks point CI's at these) for this specific py.test case

jsirois · 2017-06-20T22:15:25Z

Noting the results of an offline meeting re this discussion to-date. It's agreed that the minimal correct caching strategy makes sense to apply for now to both pytest and junit test runners. The next step after that is completed will be #4676

benjy · 2017-07-20T02:16:23Z

@jsirois hey, not sure if you're aware but the participant here is "benjyw" not "benjy" and I keep getting pinged for this issue. :)

jsirois · 2017-08-02T21:06:46Z

@ity I'm going to close this since I think the motivating cases - python and jvm - are now covered. That's not to say they're aren't problems with using test caching in the real world today, for example as pointed out by #4780 wrt loose test file access. There may be some follow-up refactors to extract commonality in the PytestRun and JUnitRun approaches to caching, but I'll file new issues to track those things. Please re-open though with comment if this doesn't make sense.

ity added the remoting label May 12, 2017

ity added this to the 1.4.0 milestone May 12, 2017

stuhood assigned jsirois Jun 6, 2017

jsirois added the in-progress label Jun 6, 2017

This was referenced Jun 29, 2017

The -c form of --pants-config-files does not work for options bootstrapping. #4717

Closed

TaskBase#check_artifact_cache_for and TaskBase#maybe_write_artifact do not pair well. #4719

Closed

This was referenced Jul 10, 2017

Implement pytest (successful) test result caching. #4734

Merged

There is no primitive for marking an option / attribute as never cacheable #4748

Closed

This was referenced Jul 24, 2017

Small cleanups in the JunitRun codebase. #4767

Merged

Simplify argutil::ensure_arg. #4768

Merged

Add support for junit (successful) test caching. #4771

Merged

jsirois closed this as completed Aug 2, 2017

jsirois removed the in-progress label Jan 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Test caching #4587

Implement Test caching #4587

ity commented May 12, 2017

jsirois commented Jun 13, 2017 •

edited

Loading

ity commented Jun 14, 2017

macripps commented Jun 14, 2017 •

edited

Loading

stuhood commented Jun 14, 2017 •

edited

Loading

stuhood commented Jun 14, 2017 •

edited

Loading

benjyw commented Jun 15, 2017

stuhood commented Jun 15, 2017

jsirois commented Jun 20, 2017

jsirois commented Jun 20, 2017

benjy commented Jul 20, 2017

jsirois commented Aug 2, 2017

Implement Test caching #4587

Implement Test caching #4587

Comments

ity commented May 12, 2017

jsirois commented Jun 13, 2017 • edited Loading

ity commented Jun 14, 2017

macripps commented Jun 14, 2017 • edited Loading

stuhood commented Jun 14, 2017 • edited Loading

stuhood commented Jun 14, 2017 • edited Loading

benjyw commented Jun 15, 2017

stuhood commented Jun 15, 2017

jsirois commented Jun 20, 2017

jsirois commented Jun 20, 2017

benjy commented Jul 20, 2017

jsirois commented Aug 2, 2017

jsirois commented Jun 13, 2017 •

edited

Loading

macripps commented Jun 14, 2017 •

edited

Loading

stuhood commented Jun 14, 2017 •

edited

Loading

stuhood commented Jun 14, 2017 •

edited

Loading