Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] Project Skymeld: Merging Analysis and Execution Phases of Skyframe #14057

Open
joeleba opened this issue Sep 28, 2021 · 21 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams type: feature request

Comments

@joeleba
Copy link
Member

joeleba commented Sep 28, 2021

Description of the problem / feature request:

In a regular build, Bazel loads and analyzes the target patterns with Skyframe to form the ActionGraph, this is the Loading-and-Analysis phase. It then performs some extra-Skyframe setup before commencing executing the actions, or the Execution Phase.

Our hypothesis: by allow interleaving the loading/analysis and execution phases, we could improve the build performance, especially for multi-target builds.

By removing the barrier between the phases, we allow targets which have finished analyzing to immediately start with execution. There tends to be many dormant threads towards the end of the analysis phase, and we could make use of those resources for the execution phase.

More details to follow.

@joeleba joeleba self-assigned this Sep 28, 2021
@joeleba joeleba added team-Performance Issues for Performance teams P1 I'll work on this now. (Assignee required) type: feature request labels Sep 28, 2021
@joeleba
Copy link
Member Author

joeleba commented Apr 12, 2022

A quick progress update:

  1. "Merging Analysis and Execution Phases of Skyframe" is a mouthful so we renamed the project to "Skymeld". If you see Skymeld in the code, it's referring to this project.

  2. The prototype is now basically able to handle build and test commands with a single --package_path. Things that work:

  • Basic build and test commands with a single --package_path.
  • Action conflict checking
  • Action cache
  • Error Reporting (sort of, we still need to redefine the meaning of exceptions where applicable)

Things that still need work/verification (non-exhaustive):

  • Dynamic/Remote execution
  • Progress reporting
  • Profiling
  • Resource management

@joeleba joeleba changed the title [Experimental] Merging Analysis and Execution Phases of Skyframe [Experimental] Project Skymeld: Merging Analysis and Execution Phases of Skyframe Apr 12, 2022
@saraadams
Copy link
Contributor

If I understood correctly at your BazelCon talk (thanks!), 6.0 will include this, guarded by an experimental flag. Can you share the flag(s) to set to enable this? The command-line reference for 6.0 isn't up yet, but I'd like to give it a try.

I found https://bazel.build/reference/command-line-reference#flag--experimental_skymeld_ui - but the documentation suggests this only updates the terminal output. Or does this flag have the undocumented side-effect of also enabling skymeld?

@benjaminp
Copy link
Collaborator

--experimental_merged_skyframe_analysis_execution

@joeleba
Copy link
Member Author

joeleba commented Feb 3, 2023

Hi everyone,

Exciting news! Skymeld has reached a certain level of maturity that we're now comfortable calling for its dogfooding in Bazel!

How to dogfood

  • Run bazel from HEAD, or with a rolling release (this is mostly so that you receive the latest fixes), and
  • Add --experimental_merged_skyframe_analysis_execution to your build and test commands.

Measuring impacts

Skymeld is expected to improve the end-to-end wall time of multi-target builds with remote execution. All the wall time wins should come from the analysis phase time.

It's best if you already have your own mechanism to track build performance. Using bazel-bench to benchmark your builds also provides a good estimate. To ensure a controlled environment, we recommend using a dedicated machine for this.

We're also very interested in any performance wins or issues (e.g. OOM) that you encounter.

Bug Reporting

Please file a bug with:

  • the tag [skymeld-dogfood] in the bug title, and
  • mention @joeleba

Excited to fix all the incoming bugs. Thanks in advance, adventurous dogfooders!

!gif

Edit 2023.07.24: remove the reference to --experimental_skymeld_ui (now no-op).

copybara-service bot pushed a commit that referenced this issue Jun 12, 2023
This is unnecessary. We don't have a use case now where this flag is enabled and
the skymeld flag isn't.

#14057

PiperOrigin-RevId: 539598912
Change-Id: I5e2bda47085c606728b3a4a19d38ee0afa214812
traversaro pushed a commit to traversaro/bazel that referenced this issue Jun 24, 2023
This is unnecessary. We don't have a use case now where this flag is enabled and
the skymeld flag isn't.

bazelbuild#14057

PiperOrigin-RevId: 539598912
Change-Id: I5e2bda47085c606728b3a4a19d38ee0afa214812
@joeleba
Copy link
Member Author

joeleba commented Jul 24, 2023

--experimental_skymeld_ui is now no-op and will be removed. Please remove it from your rc files if you have it enabled.

@sluongng
Copy link
Contributor

I did a small benchmark today against BuildBuddy code base

Building //server (our FOSS version) without remote cache

> hyperfine --prepare 'bazel clean --async' \
            --warmup 1 \
            'bazel build --config=x -k --remote_instance_name="$RANDOM" server' \
            'bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server'

Benchmark 1: bazel build --config=x -k --remote_instance_name="$RANDOM" server
  Time (mean ± σ):     241.279 s ± 42.673 s    [User: 0.134 s, System: 0.149 s]
  Range (min … max):   169.694 s … 318.005 s    10 runs

Benchmark 2: bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server
  Time (mean ± σ):     213.551 s ± 75.335 s    [User: 0.118 s, System: 0.129 s]
  Range (min … max):   148.260 s … 400.258 s    10 runs

Summary
  bazel build --config=x -k --remote_instance_name="$RANDOM" --config=skymeld server ran
    1.13 ± 0.45 times faster than bazel build --config=x -k --remote_instance_name="$RANDOM" server

Building //server with remote cache

> hyperfine --prepare 'bazel clean --async' \
            --warmup 1 \
            'bazel build --config=x -k server' \
            'bazel build --config=x -k --config=skymeld server'

Benchmark 1: bazel build --config=x -k server
  Time (mean ± σ):     19.282 s ±  0.473 s    [User: 0.014 s, System: 0.023 s]
  Range (min … max):   18.656 s … 20.218 s    10 runs

Benchmark 2: bazel build --config=x -k --config=skymeld server
  Time (mean ± σ):     17.732 s ±  0.407 s    [User: 0.014 s, System: 0.023 s]
  Range (min … max):   17.118 s … 18.626 s    10 runs

Summary
  bazel build --config=x -k --config=skymeld server ran
    1.09 ± 0.04 times faster than bazel build --config=x -k server

And building + testing all targets (FOSS and Enterprise) with remote cache

> hyperfine --prepare 'bazel clean --async' \
                --warmup 1 \
                'bazel build --config=x -k //...' \
                'bazel build --config=x -k --config=skymeld //...'
Benchmark 1: bazel build --config=x -k //...
  Time (mean ± σ):     27.113 s ±  1.020 s    [User: 0.014 s, System: 0.020 s]
  Range (min … max):   25.938 s … 28.833 s    10 runs

Benchmark 2: bazel build --config=x -k --config=skymeld //...
  Time (mean ± σ):     25.349 s ±  1.155 s    [User: 0.014 s, System: 0.021 s]
  Range (min … max):   23.567 s … 27.314 s    10 runs

Summary
  bazel build --config=x -k --config=skymeld //... ran
    1.07 ± 0.06 times faster than bazel build --config=x -k //...

where --config=x is a set of cross-build config that let us run everything on BuildBuddy RBE Build without Bytes with --jobs=100.


So overall, the perf gain is about 7->13% when skymeld is enabled.

I have been using it on my personal setup in the past week and noticed no observable issues.
We have also seen several customers setting this up in their build against BuildBuddy.

Great job @joeleba!

@joeleba
Copy link
Member Author

joeleba commented Jul 27, 2023

That's great! Thanks for the benchmark @sluongng !

@sluongng
Copy link
Contributor

@joeleba One thing I am very excited about with Skymeld is that it makes a repository action to be very close / similar to a build action. I was wondering if you could comment on the feasibility of making them one: allowing build action to create more targets inside sky frame, similar to Buck2's anon_target.

Having this is important because it would help us eliminate the requirement to run all repository rules locally and instead, to run them remotely with RBE. User's workspace would not need to download any of the external dependencies to start a build and instead, delegate to RBE service to handle the download and caching remotely. Repository rules would then be the same as build rules with additional network dependencies.

Do you think this is something that would be easy/hard to achieve once we stabilize Skymeld?

@fmeum
Copy link
Collaborator

fmeum commented Jul 27, 2023

@sluongng Just note that fully running a repository rule remotely would require some kind of a standalone Starlark interpreter to run the full logic. From what I have seen, Buck2's anonymous targets, when applied to the typical situation repo rule's solve in Bazel, would look more like a genrule that runs curl followed by a SHA256 check. That's something you can do in Bazel today even though it's discouraged.

@sluongng
Copy link
Contributor

@fmeum you could already emulate the action execution (i.e. download the archive, unpack it, and populate it BUILD files by some sort of patching). However, you could not feed the downloaded content and generated BUILD files as build targets back to skyframe. AFAIK, this ability is currently unique to repository rules as skyframe is mostly fixed after the Analysis phase.

So the ask here is not to reimplement http_archive as genrule + curl + tar in build rules. The missing feature is the API to define new build targets, statically or fixed, as a result of running a build rule. Today, we can return fixed output groups, exposed via fixed providers. The content of the output groups could be dynamic, but impossible to use a reference of the dynamic content from another rule.

Just note that fully running a repository rule remotely would require some kind of a standalone Starlark interpreter to run the full logic.

The starlark interpreter could sit on the user's laptop, and missing starlark files could be fetched lazily from RBE cache similar to Build without Bytes. The pain point I am trying to solve here is: In a large repo/workspace with tens to hundreds of thousand external dependencies, asking user to download all of that locally before starting a build remotely is not feasible. As we begin to blur the boundary between Analysis phase and Execution phase, it seems like we should be able to run and cache most of the heavy parts of the Analysis phase remotely... by allowing Execution rules to return typical Analysis rules' results and feed them back into skyframe graph.

@meteorcloudy
Copy link
Member

In a large repo/workspace with tens to hundreds of thousand external dependencies, asking user to download all of that locally before starting a build remotely is not feasible.

Totally understand this pain point.

As we begin to blur the boundary between Analysis phase and Execution phase, it seems like we should be able to run and cache most of the heavy parts of the Analysis phase remotely

Repository rules and build rules are fundamentally different, those differences won't go away just because of enabling Skymeld.

But to solve the pain point, we don't have to merge repository rules and build rules. What we really need is to somehow bring remote cache to repository rules. @Wyverald will work on a true repository cache design, it's probably a good idea also consider how it could work efficiently with remote execution.

FYI @coeuvre, our remote execution expert.

@joeleba
Copy link
Member Author

joeleba commented Jul 28, 2023

@sluongng Sorry, I'm not super familiar with how external dependencies are handled in Bazel. In particular, it's unclear to me how Skymeld could help this. My guess is "Skymeld would allow actions which don't require external deps to be run while the fetching is ongoing". Is that what you meant?

@sluongng
Copy link
Contributor

What we really need is to somehow bring remote cache to repository rules.

I could totally compromise on the solutions as long as the pain point is solved 🤗
I think we already have remotable repository rules today, where a repository rule could execute actions against RBE server to probe for capability (i.e. checking which linker is available).
If that could be extended to a degree where Bazel could skip most of the downloads on the host machine, and lazily load the needed files (i.e. starlark files), then we should be able to buy ourselves another few years before needing the FUSE solution(?)

My guess is "Skymeld would allow actions which don't require external deps to be run while the fetching is ongoing". Is that what you meant?

I was thinking that Skymeld + (some non-trivial amount of changes) might enable us to setup build rules and targets that:
a. Remote executable and network dependant
b. Fetch, unpack and patch archives
c. Expose the contents as new repositories / new sub build targets

The most critical part here is (c).
For example: I could have a target like this in buck2:

go_repo(name="foo", version = "0.0.1")

So when I run bazel build //:foo, after it executed, I could get "sub-targets" like this bazel build :foo[path/to/dir:bar_lib] and other rules could depend on my sub-target.

@Wyverald
Copy link
Member

@sluongng

I was thinking that Skymeld + (some non-trivial amount of changes) might enable us to setup build rules and targets that:
a. Remote executable and network dependant
b. Fetch, unpack and patch archives
c. Expose the contents as new repositories / new sub build targets

As you said, part (c) is very much non-trivial. It essentially asks the action graph and the configured target graph to be intertwined. This is likely to be a multi-year project; for reference, Skymeld has taken multiple years despite having arguably a less ambitious goal (it doesn't shuffle the "phases" around).

Even part (b) is not that trivial. It's basically saying that a build rule should be able to run a repo rule. It's easy to conceptualize a "build" version of http_archive, but in the general case, you need to give the remote worker the capability to run all repo rule APIs (downloads, executes, etc). Either that, or you somehow hand the remote worker a "standalone repo rule runner", which is deceptively hard to do (in short, you'd be handing over an entire Bazel). This is probably what @fmeum meant by his comment above.

And part (a) is also opening the floodgates... Bazel prides itself on hermeticity and reproducibility of builds, and that's what enables a lot of the caching. Repo rules are a clearly demarcated API to introduce nonhermeticity. If you blurred the lines, it'd be unclear when we could cache a build rule.


@meteorcloudy

But to solve the pain point, we don't have to merge repository rules and build rules. What we really need is to somehow bring remote cache to repository rules. @Wyverald will work on a true repository cache design, it's probably a good idea also consider how it could work efficiently with remote execution.

It's not very clear to me what "bringing remote cache to repository rules" means. I could imagine a couple of interpretations:

  1. Every time we want to fetch a repo, we check if the remote cache contains the fetched repo already. If so, we just retrieve the entire repo from the remote cache; if not, we fetch it locally and upload it to the remote cache (or do part (b) above, if we get there).

    • This is IMO of limited value. Put in other words, this is choosing to "download extracted contents from the remote cache" instead of "download archive from original source and extracting locally". It's unclear that trading bandwidth for CPU time is going to be a win.
  2. We no longer try to fetch repos wholesale when a remote cache is used. Instead, we check the remote cache on a per-file level. That is, when we want to load the package @foo//bar, we first try to fetch the @foo//bar:BUILD file from the remote cache. (Similarly for .bzl files.)

    • This is likely what Yun meant, but it's also a much grander scheme. IIUC this is on the same level of implementing an NFS inside Bazel. Any time we try to use a file from an external repo, we'd need to check the remote cache first. IMO this is way beyond the purview of the "true repository cache" design 😅 @lberki will probably have some things to say about that.

@meteorcloudy
Copy link
Member

  1. We no longer try to fetch repos wholesale when a remote cache is used.

Yes, this is what I meant.

IIUC this is on the same level of implementing an NFS inside Bazel.

Thanks for pointing this out! I agree with the conclusion, it is indeed beyond the scope of the "true repo cache" design. I think the key reason is that the current Bazel remote cache works for Bazel generated build outputs, however external repos are not build outputs, instead there are actually sources prepared before the build.

@sluongng
Copy link
Contributor

I agree with the analysis. Much appreciated.

Here is my understanding so far: we agreed on the problem that the analysis phase could be costly to run locally.

In general, there are 2 approaches to solving this:

  1. Make repo rules more similar to build rules: more remote cache / remote exec friendly
  2. Enhance build rules with repo rules capabilities: allow build rules to return "analysis result" and create more nodes in the sky frame's graph.

I was hoping that Skymeld would make (2) more feasible, but it seems like we are on the road of exploring (1) already.

But both approaches would then depend on the final critical component: the Starlark loading during the analysis phase must happen locally where Bazel's JVM run. Hence the need to be able to lazily load Starlark files and other needed dependencies. There are 2 ways to go about it:

a. Lazily load within Bazel, similar to Build without Bytes today.
b. Lazily load outside Bazel, implement a config that would allow loading from a FUSE/NFS mount.

I guess I will wait to see how (b) gona work out with #12823 (comment)

copybara-service bot pushed a commit that referenced this issue Oct 5, 2023
Bazel@HEAD + Downstream: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3355#_

The above run didn't perform worse than the latest [automated daily run](https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/3353#_)

#14057

PiperOrigin-RevId: 570950844
Change-Id: I0c38839cb2b9e979265bbe3bff903a54162da4f4
copybara-service bot pushed a commit that referenced this issue Oct 11, 2023
#14057

PiperOrigin-RevId: 572493968
Change-Id: Ifbb1840eb54cea6777e142408b784703382eb4f0
@joeleba
Copy link
Member Author

joeleba commented Oct 16, 2023

This is coming in Bazel 7.0, if nothing changes 😄

@joeleba
Copy link
Member Author

joeleba commented Nov 6, 2023

Note to the consumers of BEP metrics: please be aware of the new meanings of analysis_phase_time_in_ms and execution_phase_time_in_ms.

message TimingMetrics {
// For Skymeld,
// analysis_phase_time_in_ms + execution_phase_time_in_ms >= wall_time_in_ms
//
// The CPU time in milliseconds consumed during this build.
int64 cpu_time_in_ms = 1;
// The elapsed wall time in milliseconds during this build.
int64 wall_time_in_ms = 2;
// The elapsed wall time in milliseconds during the analysis phase.
// When analysis and execution phases are interleaved, this measures the
// elapsed time from the first analysis work to the last.
int64 analysis_phase_time_in_ms = 3;
// The elapsed wall time in milliseconds during the execution phase.
// When analysis and execution phases are interleaved, this measures the
// elapsed time from the first action execution to the last.
int64 execution_phase_time_in_ms = 4;
}
TimingMetrics timing_metrics = 5;

@saraadams
Copy link
Contributor

Note to the consumers of BEP metrics: please be aware of the new meanings of analysis_phase_time_in_ms and execution_phase_time_in_ms.

The new code documents:

   // For Skymeld, 
   // analysis_phase_time_in_ms + execution_phase_time_in_ms >= wall_time_in_ms 

If the sum is always >= than the wall time, does that mean the wall time is completely covered by only analysis and execution phase? How does this relate to

public enum ProfilePhase {
LAUNCH("launch", "Launch Blaze"),
INIT("init", "Initialize command"),
TARGET_PATTERN_EVAL("target pattern evaluation", "Evaluate target patterns"),
ANALYZE("interleaved loading-and-analysis", "Load and analyze dependencies"),
ANALYZE_AND_EXECUTE(
"interleaved loading, analysis and execution",
"Load, analyze dependencies and build artifacts"),
LICENSE("license checking", "Analyze licenses"),
PREPARE("preparation", "Prepare for build"),
EXECUTE("execution", "Build artifacts"),
FINISH("finish", "Complete build"),
UNKNOWN("unknown", "unknown");

Just looking at these two files, I would have assumed the wall time includes all phases, which would then mean that
analysis_phase_time_in_ms + execution_phase_time_in_ms < wall_time_in_ms
is possible, if the time required for the other phases is larger than the overlap achieved through interleaving the analysis and execution phase.

@joeleba
Copy link
Member Author

joeleba commented Nov 23, 2023

If the sum is always >= than the wall time, does that mean the wall time is completely covered by only analysis and execution phase?

Thanks for spotting this. The wall time indeed covers more than only the analysis and execution phases and it's technically possible that analysis_phase_time_in_ms + execution_phase_time_in_ms < wall_time_in_ms.

For context: the comment was mainly to highlight that:

  • analysis_phase_time_in_ms = elapsed time since the first analysis work to the last, and
  • execution_phase_time_in_ms = elapsed time since the first execution work to the last.

These aren't super meaningful, but they sort of fit the reality where we don't have the hard divide between analysis/execution anymore. It's the overlapping of the phases that makes it possible that analysis_phase_time_in_ms + execution_phase_time_in_ms >= wall_time_in_ms, something that's not valid before. The documentation in the code is still wrong though. I'll update it.

copybara-service bot pushed a commit to bazelbuild/intellij that referenced this issue Nov 23, 2023
Context: bazelbuild/bazel#14057 (comment)
PiperOrigin-RevId: 584831759
copybara-service bot pushed a commit that referenced this issue Nov 23, 2023
Context: #14057 (comment)
PiperOrigin-RevId: 584831759
Change-Id: I87359df6551f4221a6e506c1f458ccbeb9b798f2
joeleba added a commit to joeleba/bazel that referenced this issue Nov 23, 2023
Context: bazelbuild#14057 (comment)
PiperOrigin-RevId: 584831759
Change-Id: I87359df6551f4221a6e506c1f458ccbeb9b798f2
meteorcloudy pushed a commit that referenced this issue Nov 23, 2023
Context:
#14057 (comment)
PiperOrigin-RevId: 584831759
Change-Id: I87359df6551f4221a6e506c1f458ccbeb9b798f2
@matts1
Copy link
Contributor

matts1 commented Nov 24, 2023

Relevant

@joeleba One thing I am very excited about with Skymeld is that it makes a repository action to be very close / similar to a build action. I was wondering if you could comment on the feasibility of making them one: allowing build action to create more targets inside sky frame, similar to Buck2's anon_target.

Having this is important because it would help us eliminate the requirement to run all repository rules locally and instead, to run them remotely with RBE. User's workspace would not need to download any of the external dependencies to start a build and instead, delegate to RBE service to handle the download and caching remotely. Repository rules would then be the same as build rules with additional network dependencies.

Do you think this is something that would be easy/hard to achieve once we stabilize Skymeld?

I just wrote a proposal for this - feel free to give feedback: https://docs.google.com/document/d/1OsEHpsJXXMC9SFAmAh20S42Dbmgdj4cNyYAsFOHMibo/edit

mai93 pushed a commit to bazelbuild/intellij that referenced this issue Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Performance Issues for Performance teams type: feature request
Projects
None yet
Development

No branches or pull requests

8 participants