Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow runfiles to be easily/efficiently passed as action inputs #15486

Open
rickeylev opened this issue May 12, 2022 · 7 comments
Open

Allow runfiles to be easily/efficiently passed as action inputs #15486

rickeylev opened this issue May 12, 2022 · 7 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: feature request

Comments

@rickeylev
Copy link
Contributor

Description of the feature request:

If you have an action that needs to take a collection of runfiles as input, then it's very hard, error prone, and inefficient to correctly do this.

To correctly do this, the inputs passed to the action must include:

  • runfiles.files.
  • The Files within runfiles.symlinks. Getting these requires flattening the depset during analysis.
  • The Files within runfiles.root_symlinks. Also requires flattening the depset during analysis.
  • The symlink path strings of runfiles.symlinks. These either must be passed as args to the action, or a second file must be created to store their values and then that file must be passed to the action.
  • The symlink path strings of runfiles.root_symlinks. Same problem as above.
  • The path strings of runfiles.empty_filenames. Accessing this attribute triggers execution of the underlying RunfilesSupplier, which is normally delayed until execution time. Also same problem as with the symlink paths.

The above is possible using pure-starlark, but is non-ideal because it has to flatten depsets (extract all the files and pass them as inputs; put the symlink paths and empty_filenames paths into Args objects).

Normally the answer to "i need runfiles passed to an action" is to use tools and pass FilesToRunProvider. However, that isn't always possible because:

  1. FilesToRun can't be constructed by Starlark
  2. Getting a reference to FilesToRun requires a target, i.e. a macro has to wire together multiple targets
  3. (2) isn't always possible. Some examples: (a) if you want to properly implement --stamp behavior, embed build information, and trigger re-generation correctly, a target must pass all its runfiles et al to the action that generates that information; (b) for things like "build a zip file of myself" or some other packaging-esq type of thing.
  4. FilesToRun is more expensive: it means "materialize this set of paths when the action is run", when the minimum necessary is "make sure any changes to the runfiles re-trigger the action execution" and/or "I just need the underlying inputs available and their symlink path-mapping" (e.g. a manifest)

From looking around, I think Java-native rules basically deal with this problem by using SourceManifestAction or middleman files[1]. The both work about the same: create a file and write the runfiles as a manifest, then have that file be an input to another action. All that work gets deferred until execution, time, though, and all analysis phase pays is passing the runfiles and registering an action.

So, a couple half-baked ideas:

  1. Add a runfiles_inputs arg to ctx.actions.run. This would accept a list of runfiles objects. Like inputs, changes to them would re-trigger action execution. Whether they are materialized, and how they are, I don't know.
  2. Expose some SourceManifestAction-equivalent, i.e. a way to correctly and efficiently transform a runfiles object into a file that can be passed to inputs and and Args objects.

(2) Seems better overall as it's more flexible. If you have a file, you can then pass it to inputs or add it to args.

[1] Small aside: middleman files don't seem to play well with Starlark implemented rules. From what I could tell, all Starlark rules call some RunfilesSupport helper logic during their Java-phase, which registers various actions and outputs. So interacting with runfiles support in the middle of the Starlark phase of a rule tends to create an action/output conflict with what happens in the Java phase.

What underlying problem are you trying to solve with this feature?

Making it easy to correctly and efficiently pass runfiles information to actions.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

google build

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

google build

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

n/a

Have you found anything relevant by searching the web?

#15164 -- "i want to see the runfiles passed to an action"

Any other information, logs, or outputs that you want to share?

No response

@brandjon brandjon added team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts and removed team-Build-Language labels Nov 5, 2022
@comius comius added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed untriaged labels Sep 14, 2023
@comius
Copy link
Contributor

comius commented Sep 14, 2023

Either P2 or P4. It's a pain point for existing rules, that used this natively. I'm split between making things easier, but also maintaining a new thing that was so far not possible in Starlark and rules implemented from scratch in Starlark didn't need that.

@dgoldstein0
Copy link

I am a little confused on the expected semantics, and also not entirely clear on the possibilities with FilesToRunProvider though #7187 makes it sound like the latter isn't really an option for starlark-written rules anyway. Anyhow I have a use case to throw into the mix here: rules for javascript bundlers (think webpack, rollup, esbuild, rspack, turbopack, etc.). JS bundlers are typically a lot more like linkers from other languages - take in lots of files, which could be compiled js (e.g. from typescript, jsx, etc.) or source files, and transform into either a single output or perhaps a directory (when code splitting is used to output multiple files); however unlike e.g. c++ linking, the filesystem layout of the inputs matters a lot. One of the big challenges of this use case today is that JS bundlers typically want to believe that all inputs - generated or not - are part of the same source tree - or else relative imports in them tend to not resolve correctly. Runfiles seem to be the right abstraction for this - they let us hide the difference between the source and generated files - making the File.short_path visible to the bundler rather than the full File.path.

The problem: the rules I've written for js bundlers currently are constructing a binary of the bundler plus all the inputs, and then executing that. which is inefficient - every input change results in rebuilding this binary before doing the actual build - and also makes bazel think that all of our input code is actually part of the build tool (so bzl query --notool_deps gives the wrong answers). If there were some way to pass runfiles objects as inputs to the action, I suspect these rules would be more efficient? But at the very least, it'd let bazel query understand these are inputs and not tool dependencies. And it's probably a blocker for having a useful implementation of persistent workers for these tools. An alternative is copying all the source input files, but copying is noticeably slower than constructing symlinks.

@rickeylev
Copy link
Contributor Author

This topic came up again and I came up with a more concrete API proposal:

runfiles_input = ctx.runfiles_to_action_input(runfiles)
ctx.actions.run(inputs = [runfiles_input], ...)

The runfiles_to_action_input method converts the runfiles to some object that represents all the underlying File objects of the passed-in runfiles. Perhaps a special proxy object, or a custom File implementation based upon e.g. Middleman. Whatever it is, it efficiently ensures the underlying File objects within runfiles are made available as inputs to the action.

I think this is about the most minimal API necessary.

To do something useful, e.g. put all runfiles into a zip, you have to create a manifest file manually (there are efficient user space ways to do this using map_each).

A couple other ideas that just occurred to me:

manifest, runfile_input = ctx.blabla(runfiles) -- generate a manifest for you

Or, model it as a TreeArtifact. Something like:

runfiles_root = ctx.declare_directory("foo_runfiles_root")
ctx.runfiles_to_directory(runfiles, output=runfiles_root)
ctx.actions.run(inputs=[runfiles_root], arguments=[runfiles_root.path])

@dgoldstein0
Copy link

I like the tree artifact idea, as that answers the "how do we translate the runfiles to an equivalent directory structure on disk". Though I wonder how performant that'll be, it'll definitely solve the --notool_deps problem with the "build and then execute a binary" strategy that makes the inputs look like part of the build tools.

That said it's also plausible that we have two separate use cases here - possibly not all actions that want to consume runfiles need the runfiles organization of the files materialized in the filesystem? So we may want two different apis, rather than one.

For the case of not materializing the filesystem directly, one option two new methods:

  1. A new function runfiles.get_all_files() that would return all File objects inside the runfiles regardless of which of the fields of runfiles (files, symlinks, root_symlinks, or empty_files) each file comes from. (I suppose this isn't strictly necessary, but it's a pretty substantial simplification)
  2. Args.add_runfiles() could give more flexibility in how runfiles are passed.

E.g. something like

def _runfile_to_arg(runfilePath, file):
  return "--path={}:{}".format(runfilePath, file.path)
...
args = ctx.actions.args()
args.add_runfiles(runfiles, format_each=_runfile_to_arg)
ctx.actions.run(inputs=[runfiles.get_all_files()], args=[args])

I think this most clearly competes to your manifest idea. Here you can put the manifest information into your Args object, whereas there, the manifest would presumably have a predetermined format and just be another file fed to the action.

@rickeylev
Copy link
Contributor Author

Yeah, given that runfiles are, essentially, representing a directory, using directory artifacts for them feels right.

re: runfiles.get_all_files: Yeah, this would simplify. The trouble is that, under the hood, symlinks/root_symlinks are depsets of SymlinkEntry objects (basically a (path, File) pair). So to get just the File objects the depsets have to be flattened. Or flattening has to be deferred, somehow (which puts us back to "special file to represent runfiles")

re: runfiles.add_runfiles: Oh interesting idea. You can already use (add_all + map_each) to convert a runfiles object to e.g. --path=... args. But perhaps a separate function would be more efficient?

@rickeylev
Copy link
Contributor Author

Thought of an alternative to/variation of runfiles.get_all_files. So under the hood, the problem is there's a depset of SymlinkEntry objects, which aren't File objects, but are wrappers around File objects.

So just have an adapter that unwraps them. e.g. runfiles.get_all_files() returns:

class SymlinkArtifactNestedSet implements NestedSet<Artifact> {
   constructor(NestedSet<SymlinkEntry> symlinks) { this.symlinks = symlinks }
   Artifact getNext() {
     symlinks.getNext().getArtifact();
   }
}

I don't know the exact NestedSet API, but I hope the gist is clear.

@dgoldstein0
Copy link

Huh, those SymlinkEntry don't seem to be exposed to starlark yet, at least not that I could find. Are there plans for that?

github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Oct 3, 2024
…2891)

After #2887 I started
finding similar linker errors in environments that supported runfiles
and symlinks but have yet to see this issue in environments without
them. I believe my understanding of how the
[DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles)
objects worked and in the previous change had attempted to rely on it
for a runfiles directory. However, there is no guarantee a directory
will be rendered and no way to force it to be rendered for actions
either (bazelbuild/bazel#15486). The
consequence is that creating a custom runfiles directory is no longer a
contingency and explicitly creating it is the correct thing to do to
ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the
consuming actions.

With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into
`Rustc` actions, a new flag is also introduced to prune unnecessary
files out of the directory to reduce bloat and potential invalidation in
downstream actions,
`--@rules_rust//cargo/settings:cargo_manifest_dir_retain_list`. This
flag accepts a list of path suffixes used to determine what (if
anything) in `CARGO_MANIFEST_DIR` should be retained and passed to
downstream actions.

Note that the failure mode this addresses is hard to observe as it
requires a crate which defines a linker path to something in
`CARGO_MANIFEST_DIR` and then for a clean build to be done with a change
to a dependent of said crate. If the runfiles directory for the build
script is never rendered then there will be no file to link into the
crate.
github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Oct 3, 2024
…2891)

After #2887 I started
finding similar linker errors in environments that supported runfiles
and symlinks but have yet to see this issue in environments without
them. I believe my understanding of how the
[DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles)
objects worked and in the previous change had attempted to rely on it
for a runfiles directory. However, there is no guarantee a directory
will be rendered and no way to force it to be rendered for actions
either (bazelbuild/bazel#15486). The
consequence is that creating a custom runfiles directory is no longer a
contingency and explicitly creating it is the correct thing to do to
ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the
consuming actions.

With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into
`Rustc` actions, a new flag is also introduced to prune unnecessary
files out of the directory to reduce bloat and potential invalidation in
downstream actions,
`--@rules_rust//cargo/settings:cargo_manifest_dir_filename_suffixes_to_retain`.
This flag accepts a list of path suffixes used to determine what (if
anything) in `CARGO_MANIFEST_DIR` should be retained and passed to
downstream actions.

Note that the failure mode this addresses is hard to observe as it
requires a crate which defines a linker path to something in
`CARGO_MANIFEST_DIR` and then for a clean build to be done with a change
to a dependent of said crate. If the runfiles directory for the build
script is never rendered then there will be no file to link into the
crate.

---------

Co-authored-by: Daniel Wagner-Hall <dawagner@gmail.com>
github-merge-queue bot pushed a commit to bazelbuild/rules_rust that referenced this issue Oct 3, 2024
…2891)

After #2887 I started
finding similar linker errors in environments that supported runfiles
and symlinks but have yet to see this issue in environments without
them. I believe my understanding of how the
[DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles)
objects worked and in the previous change had attempted to rely on it
for a runfiles directory. However, there is no guarantee a directory
will be rendered and no way to force it to be rendered for actions
either (bazelbuild/bazel#15486). The
consequence is that creating a custom runfiles directory is no longer a
contingency and explicitly creating it is the correct thing to do to
ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the
consuming actions.

With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into
`Rustc` actions, a new flag is also introduced to prune unnecessary
files out of the directory to reduce bloat and potential invalidation in
downstream actions,
`--@rules_rust//cargo/settings:cargo_manifest_dir_filename_suffixes_to_retain`.
This flag accepts a list of path suffixes used to determine what (if
anything) in `CARGO_MANIFEST_DIR` should be retained and passed to
downstream actions.

Note that the failure mode this addresses is hard to observe as it
requires a crate which defines a linker path to something in
`CARGO_MANIFEST_DIR` and then for a clean build to be done with a change
to a dependent of said crate. If the runfiles directory for the build
script is never rendered then there will be no file to link into the
crate.

---------

Co-authored-by: Daniel Wagner-Hall <dawagner@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: feature request
Projects
None yet
Development

No branches or pull requests

5 participants