Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux-sandbox is not available occasionally since Bazel 6.0.0 #18071

Closed
tsawada opened this issue Apr 13, 2023 · 2 comments
Closed

linux-sandbox is not available occasionally since Bazel 6.0.0 #18071

tsawada opened this issue Apr 13, 2023 · 2 comments
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug

Comments

@tsawada
Copy link
Contributor

tsawada commented Apr 13, 2023

Description of the bug:

We noticed that Bazel occasionally (about 5% in our env) fails due to linux-sandbox not being available.

ERROR: 'linux-sandbox' was requested for default strategies but no strategy with that identifier was registered. Valid values are: [processwrapper-sandbox, standalone, remote, worker, sandboxed, local]

It seems that recently a 1s timeout was introduced in checking if linux-sandbox available #15414, which might be too tight under load.

In our setup, we disable all other weaker sandboxes for hermeticity, which makes this fail reliably and easy to notice. I suspect this is happening on more environments, but people haven't noticed because ofprocesswrapper-sandbox fallback.

CC @meisterT

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

run Bazel >=6.0.0 with --spawn_strategy=worker,linux-sandbox under a heavy load many times.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@tsawada
Copy link
Contributor Author

tsawada commented Apr 13, 2023

the 1s timeout was introduced for #15373

@Pavank1992 Pavank1992 added the team-Local-Exec Issues and PRs for the Execution (Local) team label Apr 13, 2023
@tjgq tjgq added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Apr 18, 2023
@sluongng
Copy link
Contributor

sluongng commented May 9, 2023

I am slightly confused here:

  1. Why is the system under load? Bazel only runs this code upon startup (SandboxModule registration), before any action is executed. So the expectation here is the host system should have enough resources dedicated to running Bazel and downstream actions. Are you suffering from the noisy neighbor problem here? (i.e. running multiple Bazel JVM, isolated by containers without dedicated resources, on the same host?).

  2. If the system does not have adequate resources to execute /bin/true inside a linux-sandbox, it's a pretty good sign to fail the build right there immediately instead of waiting and letting downstream actions also suffer the same faith?

  3. Do you restart Bazel JVM a lot? Bazel JVM should be long-lived to retain the in-memory analysis cache. Restarting Bazel JVM from cold will make your build slower and increase the chance of this problem happening.

From the point above, I would imagine your current setup is something like this: 1 VM/baremetal host running Bazel in multiple containers. After each build, you are discarding your Bazel container, killing the Bazel JVM inside. You are overfitting the host system to saturate the CPU consumption which resulted in "noisy neighbor": builds from a few busier containers are consuming too many CPU, stopping new containers from being spun up successfully.

If that's indeed the case, my advice would be: to switch to using a set of "persistent containers" and re-use the containers + Bazel JVM inside between runs. Each container should be assigned a fixed set of CPU/RAM, enforced by the container runtime to ensure that they don't use more resources than they should.

I don't think the mitigation in #18151 will solve your issue at all. You are only delaying the noisy neighbor issue from Bazel's startup phase down to the action execution phase. The 5% failure rate would still retain as all of your action execution will be delayed in scheduling by the same wait time.

iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Jun 2, 2023
…ility

A 1s timeout was introduced in checking whether LinuxSandbox is available, to prevent a complete hangup on broken systems. However, it turned out that it occasionally results in misjudging that linux-sandbox being not available.
`local_termination_grace_seconds` defaults to 15s, which hopefully gives more headroom and configurability in various setups.

Fixes bazelbuild#18071

Closes bazelbuild#18151.

PiperOrigin-RevId: 536953768
Change-Id: I5d344ee5bf06cb9b13a2cba9d077f0981f4430a3
iancha1992 added a commit that referenced this issue Jun 6, 2023
…ility (#18568)

A 1s timeout was introduced in checking whether LinuxSandbox is available, to prevent a complete hangup on broken systems. However, it turned out that it occasionally results in misjudging that linux-sandbox being not available.
`local_termination_grace_seconds` defaults to 15s, which hopefully gives more headroom and configurability in various setups.

Fixes #18071

Closes #18151.

PiperOrigin-RevId: 536953768
Change-Id: I5d344ee5bf06cb9b13a2cba9d077f0981f4430a3

Co-authored-by: Takeo Sawada <myc.monad@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Local-Exec Issues and PRs for the Execution (Local) team type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants