Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manifest-based runfiles lookup is non-hermetic #14500

Closed
fmeum opened this issue Dec 30, 2021 · 16 comments · Fixed by phst/runfiles#2
Closed

Manifest-based runfiles lookup is non-hermetic #14500

fmeum opened this issue Dec 30, 2021 · 16 comments · Fixed by phst/runfiles#2
Assignees
Labels
P1 I'll work on this now. (Assignee required) type: bug

Comments

@fmeum
Copy link
Collaborator

fmeum commented Dec 30, 2021

Description of the problem / feature request:

The Bazel runfiles libraries behave non-hermetically with --enable_runfiles=false.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  1. Clone https://github.com/fmeum/bazel-non-hermetic-runfiles
  2. Run bazel run //parent:Parent and observe that the runfiles lookup succeeds.
  3. Comment out the data dependency of //child.
  4. Run bazel build //child.
  5. Revert the change made in step 3.
  6. Do not run bazel build //child.
  7. Run bazel run //parent:Parent and observe that the runfiles lookup fails.

What operating system are you running Bazel on?

Linux, but applies to all OSes. This is particularly important on Windows where --enable_runfiles defaults to false.

What's the output of bazel info release?

5.0.0rc3

 Have you found anything relevant by searching the web?

The change that introduced the regression is bfdfa6e. The same behavior can't be reproduced with Bazel 4.2.2, since the runfiles env variables are checked first there before the runfiles manifest is looked up next to the binary.

The root cause of the issue is that the first manifest-based lookup returns a path in the Bazel cache. But the contents of the cache are not hermetic and may contain outdated runfiles manifests for non-top-level binaries. If any of these binaries is invoked and first searches for the manifest next to itself instead of using the manifest set in RUNFILES_MANIFEST_FILE, it will behave non-hermetically.

The same issue can be reproduced with --enable_runfiles=true if for some reason the parent process resolves the runfiles tree symlink to the child process before launching it.

The process described in the original design doc is also affected. I do not know whether it has been updated after the breaking commit or has always been buggy in this way.

Any other information, logs, or outputs that you want to share?

The output of the reproducer with Bazel 4.2.2:

Runfiles variables in parent:
RUNFILES_MANIFEST_FILE=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles/MANIFEST
RUNFILES_DIR=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles
JAVA_RUNFILES=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles
RUNFILES_MANIFEST_ONLY=1
Starting child: /home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child

Child started with argv[0]=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child
Runfiles variables in child:
RUNFILES_MANIFEST_FILE=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles/MANIFEST
RUNFILES_DIR=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles
JAVA_RUNFILES=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles
Found runfile at: /home/fhenneke/git/bazel-non-hermetic-runfiles/data/foo.txt

The output of the last step of the reproducer with Bazel 5.0.0rc3:

Runfiles variables in parent:
RUNFILES_MANIFEST_FILE=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles/MANIFEST
RUNFILES_DIR=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles
JAVA_RUNFILES=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/parent/Parent.runfiles
RUNFILES_MANIFEST_ONLY=1
Starting child: /home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child

Child started with argv[0]=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child
Runfiles variables in child:
RUNFILES_MANIFEST_FILE=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child.runfiles/MANIFEST
RUNFILES_DIR=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child.runfiles
JAVA_RUNFILES=/home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child.runfiles
Failed to find runfile at: /home/fhenneke/.cache/bazel/_bazel_fhenneke/58c66360f3011c1a109b2b3996e84c32/execroot/__main__/bazel-out/k8-fastbuild/bin/child/child.runfiles/__main__/data/foo.txt
@fmeum fmeum mentioned this issue Dec 30, 2021
9 tasks
@fmeum
Copy link
Collaborator Author

fmeum commented Dec 30, 2021

I think that bfdfa6e has to be reverted as the resulting lookup process is inherently unsound. This will of course break #10598, but I don't think that issue should be solved by modifying the runfiles lookup process. Would it be possible to merge in runfiles of the top-level target when using --run_under?

fmeum added a commit to fmeum/runfiles that referenced this issue Dec 30, 2021
fmeum added a commit to fmeum/runfiles that referenced this issue Dec 30, 2021
@phst
Copy link
Contributor

phst commented Dec 30, 2021

AIUI bfdfa6e is inconsistent with https://docs.google.com/document/d/e/2PACX-1vSDIrFnFvEYhKsCMdGdD40wZRBX3m3aZ5HhVj4CtHPmiXKDCxioTUbYsDydjKtFDAzER5eg7OjJWs3V/pub. The section "Action items" in https://docs.google.com/document/d/e/2PACX-1vSDIrFnFvEYhKsCMdGdD40wZRBX3m3aZ5HhVj4CtHPmiXKDCxioTUbYsDydjKtFDAzER5eg7OjJWs3V/pub states (emphasis mine):

  • Update //src/tools/launcher:
    • to check for the envvars before checking for a runfiles manifest next to the binary

That statement in https://docs.google.com/document/d/e/2PACX-1vSDIrFnFvEYhKsCMdGdD40wZRBX3m3aZ5HhVj4CtHPmiXKDCxioTUbYsDydjKtFDAzER5eg7OjJWs3V/pub appears to be inconsistent with the "Runfiles discovery" section in the same document, which says to check for argv[0] first. I think checking argv[0] first is problematic for several reasons, in particular, it doesn't work well with Bazel-built binaries invoking other Bazel-built binaries: in that case the runfiles should have been merged by the Bazel rule producing the outermost binary, and the inner binaries should use the outermost runfiles tree. Checking argv[0] should really only be used as a fallback when running a Bazel binary directly without bazel run, when no environment variables are available.

@fmeum
Copy link
Collaborator Author

fmeum commented Dec 30, 2021

I think that the runfiles libraries were indeed working as intended before bfdfa6e and the issue in #10598 is not an issue in Bazel, but in the particular target used with --run_under: If a parent process wants a child process to behave as if it were a top-level binary with respect to runfiles lookup, it has to ensure that none of the runfiles variables (RUNFILES_DIR, JAVA_RUNFILES, RUNFILES_MANIFEST_FILE) are forwarded to the child. The run under script in #10598 doesn't do this, which lead to the observed failure on Windows where the launcher sets the runfiles variables.

@phst
Copy link
Contributor

phst commented Dec 30, 2021

Yeah, I agree that at the very least bfdfa6e should be reverted (and an integration test case be added so that it stays reverted).

phst added a commit to phst/rules_elisp that referenced this issue Dec 30, 2021
fmeum added a commit to fmeum/rules_runfiles that referenced this issue Dec 30, 2021
fmeum added a commit to fmeum/rules_runfiles that referenced this issue Dec 30, 2021
@meteorcloudy
Copy link
Member

/cc @mai93

@meteorcloudy
Copy link
Member

Thanks for reporting this. Taking a closer look, I agree we shouldn't have changed the logic of FindManifestFileImpl to fix #10598. We'll rollback the change.

@meteorcloudy meteorcloudy added breakage P1 I'll work on this now. (Assignee required) release blocker type: bug and removed breakage labels Jan 3, 2022
bazel-io pushed a commit that referenced this issue Jan 4, 2022
*** Reason for rollback ***

Due to #14500

*** Original change description ***

Update find runfiles manifest file logic

This pull request updates the `FindManifestFileImpl` function to look for the runfiles manifest file of the binary first. If it is not found, it can use the manifest file specified in the environment variables `RUNFILES_MANIFEST_FILE` if it exists. The same logic is applied in `runfiles_src.cc` to find runfiles manifest file / directory of cc binaries.

This is done to avoid using the manifest file...

***

PiperOrigin-RevId: 419548921
meteorcloudy added a commit to meteorcloudy/bazel that referenced this issue Jan 5, 2022
*** Reason for rollback ***

Due to bazelbuild#14500

*** Original change description ***

Update find runfiles manifest file logic

This pull request updates the `FindManifestFileImpl` function to look for the runfiles manifest file of the binary first. If it is not found, it can use the manifest file specified in the environment variables `RUNFILES_MANIFEST_FILE` if it exists. The same logic is applied in `runfiles_src.cc` to find runfiles manifest file / directory of cc binaries.

This is done to avoid using the manifest file...

***

PiperOrigin-RevId: 419548921
@meteorcloudy
Copy link
Member

OK I sent #14515 to cherry-pick the rollback into 5.0

Wyverald pushed a commit that referenced this issue Jan 7, 2022
*** Reason for rollback ***

Due to #14500

*** Original change description ***

Update find runfiles manifest file logic

This pull request updates the `FindManifestFileImpl` function to look for the runfiles manifest file of the binary first. If it is not found, it can use the manifest file specified in the environment variables `RUNFILES_MANIFEST_FILE` if it exists. The same logic is applied in `runfiles_src.cc` to find runfiles manifest file / directory of cc binaries.

This is done to avoid using the manifest file...

***

PiperOrigin-RevId: 419548921
@Wyverald Wyverald closed this as completed Jan 7, 2022
@Wyverald Wyverald added this to the Bazel 5.0 Release Blockers milestone Jan 7, 2022
@fmeum
Copy link
Collaborator Author

fmeum commented Jan 10, 2022

@Wyverald Could you update the doc at https://docs.google.com/document/d/e/2PACX-1vSDIrFnFvEYhKsCMdGdD40wZRBX3m3aZ5HhVj4CtHPmiXKDCxioTUbYsDydjKtFDAzER5eg7OjJWs3V/pub with the (previous and now again current) behavior of the runfiles discovery process implemented in Bazel? Specifically, the points 1 and 2 under

The unified runfiles discovery strategy is to:

have to be interchanged. This would help me get this fix into third-party runfiles libraries, e.g. the one for Go.

@Wyverald
Copy link
Member

I don't actually have write access to that doc (I can't find the edit link at all, in fact). @meteorcloudy do you have write access?

@meteorcloudy
Copy link
Member

Hmmm, unfortunately nope, not even comment access..

@fmeum
Copy link
Collaborator Author

fmeum commented Jan 11, 2022

Since this is more or less the only canonical place where the behavior is documented (except for the code itself) and is linked to from a number of commits and third-party resources, it would be great if you could identify the author and regain edit rights.

@Wyverald
Copy link
Member

@laszlocsomor do you by any chance know what happened to your doc? :)

@laszlocsomor
Copy link
Contributor

Hey @Wyverald , I think @lberki obtained access to it. If not, you should be able to request ownership access if the doc is owned by a xoogler. @meteorcloudy I hope you know how this works :)

@meteorcloudy
Copy link
Member

OK, we'll figure it out!

@Wyverald
Copy link
Member

Yeah the problem is really that we can't find the actual link to the doc, just the /pub link. But I'll ask Lukács anyhow. Thanks!

@meteorcloudy
Copy link
Member

Thanks for @laszlocsomor's help, we have recovered the access and fixed the documentation!

fmeum added a commit to fmeum/runfiles that referenced this issue Jan 17, 2022
phst added a commit to phst/rules_elisp that referenced this issue Jan 22, 2022
Now that bazelbuild/bazel#14500 is fixed it should no
longer be necessary.
phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2022
Now that bazelbuild/bazel#14500 is fixed it should no
longer be necessary.
phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2022
This addresses bazelbuild/bazel#14343.

This is only possible because bazelbuild/bazel#14500
has been fixed in all supported Bazel versions.
phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2022
This addresses bazelbuild/bazel#14343.

This is only possible because bazelbuild/bazel#14500
has been fixed in all supported Bazel versions.
phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2022
Now that bazelbuild/bazel#14500 is fixed it should no
longer be necessary.
phst added a commit to phst/rules_elisp that referenced this issue Mar 5, 2022
Now that bazelbuild/bazel#14500 is fixed it should no
longer be necessary.
phst added a commit to phst/rules_elisp that referenced this issue Feb 23, 2024
This reverts commit ca3ba21.

With bazelbuild/bazel#14500
and bazelbuild/bazel#14515,
this should now be resolved in all supported Bazel versions.
phst added a commit to phst/rules_elisp that referenced this issue Feb 23, 2024
This reverts commit ca3ba21.

With bazelbuild/bazel#14500
and bazelbuild/bazel#14515,
this should now be resolved in all supported Bazel versions.
phst added a commit to phst/rules_elisp that referenced this issue Feb 23, 2024
This reverts commit ca3ba21.

With bazelbuild/bazel#14500
and bazelbuild/bazel#14515,
this should now be resolved in all supported Bazel versions.
phst added a commit to phst/rules_elisp that referenced this issue Feb 23, 2024
This reverts commit ca3ba21.

With bazelbuild/bazel#14500
and bazelbuild/bazel#14515,
this should now be resolved in all supported Bazel versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants