Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse input map and merkle tree for tools with lots of inputs #10875

Closed
JaredNeil opened this issue Feb 29, 2020 · 26 comments
Closed

Reuse input map and merkle tree for tools with lots of inputs #10875

JaredNeil opened this issue Feb 29, 2020 · 26 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@JaredNeil
Copy link
Contributor

Description of the feature request:

Create a separate input map and merkle tree for each tool used in an action, then combine those to calculate the final action digest. Each tool's input map and merkle tree should be reused between actions that depend on the same tool.

Feature requests: what underlying problem are you trying to solve with this feature?

These two function calls cause remote caching to be really slow for actions with lots of inputs. Having a way to re-use the parts that don't change between multiple actions that use the same tool could make these significantly faster.
Tools written in JavaScript will often have a high number of runfiles (>10,000) because they depend on the node_modules folder. This is also sometimes true of Python tools.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

In our specific case, we have an action per source file that uses a nodejs_binary tool. So we have 15,000 actions with the same ~10,000 runfiles for the tool and 1 unique input file. Each of these takes ~200ms to calculate the cache key, so that's 50 CPU-minutes of work just to check if the actions are cached.

Screenshot of profile showing Merkle Tree build times

@irengrig irengrig added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Mar 4, 2020
@moroten
Copy link
Contributor

moroten commented Mar 26, 2020

I've been profiling our Bazel and remote execution when building C/C++ code. We have experimented with a packaged C/C++ toolchain, but 10000 extra input files made MerkleTree.build very slow. The numbers below are with a system installed toolchain.

Bazel 2.2.0 gives me average self time of (listed numbers above 1 ms):

Remote.download                       35 ms (network latency)
check cache hit                       17 ms (network latency)
MerkleTree.build                      11 ms (cpu)
AbstractSpawnStrategy.getInputMapping  3 ms (cpu)

14ms/action * 100k actions / 4 CPU cores = 6 minutes for a fully cached build.

I tried to modify and cache parts of MerkleTree.build, but I find myself in the fundamental problem of always having to loop through all the files. Just that loop (10k inputs) will be too slow for 100k actions.

Instead, I tried to create a quick hash directly on the NestedSet returned by spawn.getInputFiles(). NestedSets are frozen after creation, so I added a weak hash map NestedSet.children (Object[]) -> hash. Note that the same input file might be included multiple times in the final hash, but that is okay.

Further, I use the hash for spawn.getInputFiles() to map to the result of MerkleTree.build. If I trigger discarding analysis cache with e.g. bazel query, the next run will then make use of the populated getInputFiles() -> MerkleTree hash map.

The result is then, given a populated getInputFiles() -> MerkleTree hash map (modified Bazel 3.0):

Remote.download                          35 ms (network latency)
check cache hit                          17 ms (network latency)
MerkleTree.build                        n/a ms (already cached)
AbstractSpawnStrategy.getInputMapping   n/a ms (already cached)
hashing of spawn.getInputFiles()       0.03 ms (cpu)

Perfect! All the cpu time has gone. 0.03ms/action * 100k actions = 3s.

As the spawn.getInputFiles() hash should be stable across Bazel clients (given the same Bazel version), I suggest that the getInputFiles() -> MerkleTree is actually uploaded to the remote cache. To avoid another network round trip, maybe the Remote Execution API should be updated to allow alias hashes for the action cache.


Now, Remote.download turns out to download the .d files which are listing all include used headers. Disabling the .d handling (in the Bazel code base) resulted in:

Remote.download                           0 ms (network latency)
check cache hit                          20 ms (network latency)
MerkleTree.build                        n/a ms (already cached)
AbstractSpawnStrategy.getInputMapping   n/a ms (already cached)
hashing of spawn.getInputFiles()       0.03 ms (cpu)
postprocessing.run                        9 ms (cpu <- What is this?)

What is postprocessing.run doing for a CppCompileAction?

Maybe the best way would be to cache the .d files on disk, and until any update of RE API, also store the getInputFiles() -> MerkleTree cache on disk.

@meisterT
Copy link
Member

cc @buchgr @ulfjack

@ulfjack
Copy link
Contributor

ulfjack commented Mar 30, 2020

@moroten Great analysis!

One option that we considered was building a merkle tree based on the nestedset structure, and merging them at subsequent levels. Did you try that?

@moroten
Copy link
Contributor

moroten commented Mar 30, 2020

@ulfjack My assumption is that the NestedSet structure doesn't reassemble a directory structure. The dependency graph might also include diamond structures. Do you want to make use of the fact that in most cases, each level of the NestedSet should only contain files from a single directory? I have not been able to conclude any good approach on this.

Another test I did was to call AbstractSpawnStrategy.getInputMapping and then cache merkle trees for chunks of the list (i.e. cache each directory). This cut the MerkleTree.build time by 50-75%, but that is far from enough.

@moroten
Copy link
Contributor

moroten commented Apr 13, 2020

I've been optimizing and profiling Bazel regarding this topic. My test example is running about 50000 actions of which most are compiling C/C++ code using remote execution with --remote_download_minimal to BuildBarn. When profiling, I've set --jobs to half of the number of logical cores, to avoid seeing thread stalls in the profile.

For each test, I run bazel test //... followed by bazel build //:foo --platform=something-else, to trigger "discarding analysis cache", and at last profiling bazel test //.... This way, I know that I will receive 100% remote cache hit.

My base line of Bazel 2.2.0 runs the execution phase in 34 s. Looking at the profile below show that most CPU is spent on getInputMapping and MerkleTree.build. This is with a system installed gcc, an packaged gcc would make the figures much worse.

2020-04-13-bazel-original

The fix is to calculate a fast hash, based on NestedSets/depsets. An action with fast hash is stored in the action cache to map to the real action key, based on the merkle tree. (The action cache storage is used in a hacky way.) A build based on Bazel 3.0.0rc2 shows that the CPU time has been converted into network latency, which is good, but the execution phase still takes 34 s.

2020-04-13-bazel-action-alias

To remove the latency, I added an in memory cache with eviction after one hour. This cuts the execution phase to 13 s. Most of the profile graph was empty, so I added more profiling points, see the red selection in the image below. (This test is based on Bazel 3.1.0rc1. The screen dump still covers 10 ms)

2020-04-13-bazel-in-memory-cache

  • Applying --use_action_cache=false saves 0.2 ms/action = 10 s in total.
  • A big chunk of the "compiling" task (0.3/0.7 ms) is spent on .d file parsing and handling. This seems unnecessary under remote execution, as the environment is sandboxed anyway. Being able to disable that handling would be good.
  • The discoverInputs and addDiscoveredInputs are slow (my added profiling points). I suspect that some NestedSet.toList() and other set operations takes time here. Would it be possible to just take the values straight from the BUILD.bazel files and not "discover" dependencies dynamically?
  • What is happening after the action has finished? Something is slow in AbstractParallelEvaluator.run().

@ulfjack
Copy link
Contributor

ulfjack commented Apr 16, 2020

C++ dependency discovery is complicated. Google internally uses the include scanner (which is open source, but not hooked up in Bazel), and that is a hard requirement for C++ modules (to reduce the number of inputs based on which modules are actually necessary rather than shipping the full set of transitive inputs).

I seem to remember that there is a way to disable .d file parsing, but keep in mind that it's problematic to do that if you use dynamic execution (the action doesn't know whether it's used with local or remote execution, or both). It's also incompatible with include scanning (which is different from what shows up as dependency discovery in the profile).

This would require a C++ expert to look into it.

@moroten
Copy link
Contributor

moroten commented Apr 21, 2020

For reference in my investigation, the sizes of the input lists are distributed as follows among the 44479 actions:

Number of input files Number of actions
1-10 26451
11-30 1399
31-100 2891
101-300 8775
301-1000 3028
1001-3000 572
3001-10000 217
10001-30000 1146

@moroten
Copy link
Contributor

moroten commented Apr 21, 2020

Below follows another view of my comment with profiling pictures above, but with summarized numbers from the profiling.

After removing all the actions with more than 10000 inputs in my C/C++ build, I did some more profiling. The numbers below were measured using --jobs=1 and with an in memory AC+CAS inside Bazel. The AC+CAS was fully populated to avoid an idling CPU due to network latency.

An adapted Bazel 3.0 gave
total execution phase duration was 50.24 s.
ActionContinuation.execute took 22.67 s of which 16.11 s is AbstractSpawnStrategy.getInputMapping+MerkleTree.build(ActionInput).
postprocessing.run took 3.47 s.

Using the fast cache based on depsets,
the total execution phase duration shrank to 35.26 s.
ActionContinuation.execute took 7.38 s of which 0.961 s is the fast hashing.
postprocessing.run took 3.27 s.

Saving 30 % CPU of for a fully cached build is not bad. The time outside the action execution seems to be updating the build graph, i.e. code related to "discover inputs" and the Skyframe framework (which does depset.toList() in a few places).

Also, plotting a graph over number of inputs vs. MerkleTree.build duration was roughly 0.1 ms/action + 0.002 ms/input in cpu time.

@moroten
Copy link
Contributor

moroten commented May 11, 2020

I've uploaded a branch with the code I'm testing right now (rebased version from 3.1.0, hopefully compiles): https://github.com/moroten/bazel/commits/optimize-remote-execution
The code abuses the remote action cache for uploading fast aliases. It also has in-memory CAS+AC, also integrated with findMissingBlobs(). The code is missing tests and is not implemented in any good structure, so use it only to find your own bottlenecks.

@buchgr
Copy link
Contributor

buchgr commented May 12, 2020

@moroten thanks! Can you also provide the repository you used to benchmark your fork?

@moroten
Copy link
Contributor

moroten commented May 12, 2020

@buchgr Unfortunately no. It is a proprietary repository with mostly C and C++ code.

When testing a bit more, it seems like activating the in-memory action cache gives a good speedup (roughly 40%), but additionally activating the in-memory CAS for .d files does add much more (roughly 10%). This is with a low latency connection to the Buildbarn server.

@moroten
Copy link
Contributor

moroten commented May 12, 2020

The findMissingBlobs() caching reduces my upload bandwidth usage by 10x, which is good when working remotely these days.

@buchgr
Copy link
Contributor

buchgr commented May 12, 2020

The findMissingBlobs() caching reduces my upload bandwidth usage by 10x, which is good when working remotely these days.

Nice! Skyframe tracks the files modified between builds. A nice optimization to make for incremental builds would be to not call findMissingBlobs() eagerly but instead proactively upload the modified files and assume that all other files already exist remotely.

@moroten
Copy link
Contributor

moroten commented May 12, 2020

The findMissingBlobs() caching reduces my upload bandwidth usage by 10x, which is good when working remotely these days.

Nice! Skyframe tracks the files modified between builds. A nice optimization to make for incremental builds would be to not call findMissingBlobs() eagerly but instead proactively upload the modified files and assume that all other files already exist remotely.

@buchgr Does your idea with Skyframe data idea hold when "discarding analysis cache"? In our CI builds, we are switching configuration (e.g. dbg vs. opt). (Very annoying when we know that our outputs go to different folders anyway.)

@buchgr
Copy link
Contributor

buchgr commented May 12, 2020

I don't know. Probably not. You would need to try it out. I was more thinking about incremental interactive builds anyway. Do you find that optimizing findMissingBlobs() makes a difference for CI builds?

@ulfjack
Copy link
Contributor

ulfjack commented May 12, 2020

I believe that the analysis cache does not include the file system cache, so it should work (for source files; output files are special in another way, so that shouldn't be a problem). Reducing the number of findMissingBlobs entries seems like a nice improvement - ideally, it would integrate with Skyframe somehow, but I don't see how that could be done.

@tomlu
Copy link
Contributor

tomlu commented May 14, 2020

Hi guys, I looked into this extensively a couple years ago. My conclusion was that generalised merkle tree caching doesn't really work. There aren't that many directories that are entirely shared between many actions, and the cache gets really really big.

I did however look into whether you could have so-called "input sets", which would be additional pre-cached sets of inputs supplied to action execution alongside the regular set of inputs. Rules would explicitly request caching for these based on knowledge that they would be large, frequently used sets. I think the idea was to do it for toolchains, runfiles, and middlemen.

Unfortunately the doc with the investigation contains internal information (since it related to our internal build distribution system) so I can't link it here verbatim. I can link any googler that wants it, and I am also happy to discuss general numbers. At a glance it looked like we could achieve something like 50% reduction in repeated inputs.

@JaredNeil
Copy link
Contributor Author

I think the idea was to do it for toolchains, runfiles, and middlemen.

That's what I was trying to suggest in the original issue when I said "each tool's input map and merkle tree should be reused between actions that depend on the same tool".

My conclusion was that generalized merkle tree caching doesn't really work.

Maybe I'm misunderstanding how merkle trees are used, but I don't think we need generalized caching. Couldn't you add a merkleHash field to whatever this "input set" object is and use that to cache the result of the merkle tree hash for that set of files? Once you have the hash, you don't need to keep the merkle tree around, you just calculate it once for each "input set", and then re-use that "input set" for every action that uses that tool.

@tomlu
Copy link
Contributor

tomlu commented May 21, 2020

Once you have the hash, you don't need to keep the merkle tree around, you just calculate it once for each "input set", and then re-use that "input set" for every action that uses that tool.

That's the idea, but one thing you may possibly be missing is that our remote execution protocols only support a single tree. They would need to support multiple. If not, you'd have to merge the input sets with the other inputs, which is relatively expensive.

@JaredNeil
Copy link
Contributor Author

As a sort of follow up to our specific problem that led me to file the original issue, I used vercel/ncc to bundle the node script and all of its dependencies into a single file. The average duration for MerkelTree.build went from 94ms to 0.15ms (625x speedup) and getInputMapping went from 104ms to 0.037ms (2810x speedup). A 100% disk cache hit build of the 40,000 actions that use this tool went from 138s to 14s (9.86x speedup).

@achew22
Copy link
Member

achew22 commented Dec 17, 2020

@JaredNeil do you have something like ncc_binary that generates that? Do you check in the binary, or do you build it dynamically? Also, any chance you could drop a gist of what you're doing to do the bundling?

@JaredNeil
Copy link
Contributor Author

We generate it at build time. It takes a little over 60s, but since it's basically always cached, that's not a big problem for us. The ncc_bundle rule I wrote is pretty naive, but should give you something to work from. ncc doesn't work for all of our scripts since it takes too long (at least 30 minutes, never let it finish, I think it got stuck) when bundling some of them. Not a priority for me to investigate further since our other tools are used in less than 1/10th the number of actions. Getting it working in the first place was a kind of hackathon project.

@achew22
Copy link
Member

achew22 commented Dec 21, 2020

@JaredNeil thank you so much for sharing these. In my local tests (admittedly in a synthetic test repo) I'm seeing pretty significant improvements. This is a great trick!

@glukasiknuro
Copy link
Contributor

glukasiknuro commented Jul 14, 2021

Would be possible to speed up computation of input digest by splitting input into disjoint NestedSets? Big libraries, or tools, would land in a single disjoint set each, and digest need to be computed only once for those.

Conceptually, for each NestedSet we could compute common path of all files within this set. To compute disjoint sets, first we could start with all input nested sets, and for nested sets that have unique path, we stop. For NestedSets that may contribute files to the same directories, need to expand those NestedSets to their children, and perform check and possibly expansion again.

I will give it a try, but let me know if there are obvious issues with this solution - @tomlu post above may suggest that in practice it may not give enough benefits, but in principle maybe it could be comparable to @moroten solution with fast digest.

@moroten
Copy link
Contributor

moroten commented Jul 15, 2021

It sounds like an interesting idea, but if I'm thinking of our code structure it would probably not work very well. All code includes basics/stuff.h so all NestedSets will have basics/ in their path. As the library to build resides outside basics/, the unique path will be the root. (Hopefully I understood your idea correctly.)

I hope to get time to experiment with bazelbuild/remote-apis#141 during August, implementing patches for Bazel and Buildbarn.

moroten added a commit to moroten/bazel that referenced this issue Aug 19, 2021
MerkleTree calculations are now cached for each node in the input
NestedSets. This drastically improves the speed when checking for cache
hit. One example reduced the calculation time from 78 ms to 3 ms for
3000 inputs.

This caching can be disabled using --remote_cache_merkle_trees=false
which will reduce the memory footprint. The caching is discarded after
each build to free up memory, the cache setup time is negligible.

Fixes bazelbuild#10875.
moroten added a commit to moroten/bazel that referenced this issue Aug 19, 2021
MerkleTree calculations are now cached for each node in the input
NestedSets (depsets). This drastically improves the speed when checking
for remote cache hits. One example reduced the Merkle tree calculation
time from 78 ms to 3 ms for 3000 inputs.

This caching can be disabled using --remote_cache_merkle_trees=false
which will reduce the memory footprint. The caching is discarded after
each build to free up memory, the cache setup time is negligible.

Fixes bazelbuild#10875.
@moroten
Copy link
Contributor

moroten commented Sep 14, 2021

The experiments have resulted in #13879 which is caching Merkle trees built from each NestedSet. This works great, but there is still some bug to sort out. With that patch, the Remote Execution API can stay as it is.

coeuvre pushed a commit to coeuvre/bazel that referenced this issue Oct 28, 2021
When --experimental_remote_merkle_tree_cache is set, Merkle tree calculations are cached for each node in the input NestedSets (depsets). This drastically improves the speed when checking for remote cache hits. One example reduced the Merkle tree calculation time from 78 ms to 3 ms for 3000 inputs.

The memory foot print of the cache is controlled by --experimental_remote_merkle_tree_cache_size.

The caching is discarded after each build to free up memory, the cache setup time is negligible.

Fixes bazelbuild#10875.

Closes bazelbuild#13879.

PiperOrigin-RevId: 405793372
Wyverald pushed a commit that referenced this issue Oct 28, 2021
When --experimental_remote_merkle_tree_cache is set, Merkle tree calculations are cached for each node in the input NestedSets (depsets). This drastically improves the speed when checking for remote cache hits. One example reduced the Merkle tree calculation time from 78 ms to 3 ms for 3000 inputs.

The memory foot print of the cache is controlled by --experimental_remote_merkle_tree_cache_size.

The caching is discarded after each build to free up memory, the cache setup time is negligible.

Fixes #10875.

Closes #13879.

PiperOrigin-RevId: 405793372

Co-authored-by: Fredrik Medley <fredrik.medley@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants