Allow runfiles to be easily/efficiently passed as action inputs #15486

rickeylev · 2022-05-12T19:45:08Z

Description of the feature request:

If you have an action that needs to take a collection of runfiles as input, then it's very hard, error prone, and inefficient to correctly do this.

To correctly do this, the inputs passed to the action must include:

runfiles.files.
The Files within runfiles.symlinks. Getting these requires flattening the depset during analysis.
The Files within runfiles.root_symlinks. Also requires flattening the depset during analysis.
The symlink path strings of runfiles.symlinks. These either must be passed as args to the action, or a second file must be created to store their values and then that file must be passed to the action.
The symlink path strings of runfiles.root_symlinks. Same problem as above.
The path strings of runfiles.empty_filenames. Accessing this attribute triggers execution of the underlying RunfilesSupplier, which is normally delayed until execution time. Also same problem as with the symlink paths.

The above is possible using pure-starlark, but is non-ideal because it has to flatten depsets (extract all the files and pass them as inputs; put the symlink paths and empty_filenames paths into Args objects).

Normally the answer to "i need runfiles passed to an action" is to use tools and pass FilesToRunProvider. However, that isn't always possible because:

FilesToRun can't be constructed by Starlark
Getting a reference to FilesToRun requires a target, i.e. a macro has to wire together multiple targets
(2) isn't always possible. Some examples: (a) if you want to properly implement --stamp behavior, embed build information, and trigger re-generation correctly, a target must pass all its runfiles et al to the action that generates that information; (b) for things like "build a zip file of myself" or some other packaging-esq type of thing.
FilesToRun is more expensive: it means "materialize this set of paths when the action is run", when the minimum necessary is "make sure any changes to the runfiles re-trigger the action execution" and/or "I just need the underlying inputs available and their symlink path-mapping" (e.g. a manifest)

From looking around, I think Java-native rules basically deal with this problem by using SourceManifestAction or middleman files[1]. The both work about the same: create a file and write the runfiles as a manifest, then have that file be an input to another action. All that work gets deferred until execution, time, though, and all analysis phase pays is passing the runfiles and registering an action.

So, a couple half-baked ideas:

Add a runfiles_inputs arg to ctx.actions.run. This would accept a list of runfiles objects. Like inputs, changes to them would re-trigger action execution. Whether they are materialized, and how they are, I don't know.
Expose some SourceManifestAction-equivalent, i.e. a way to correctly and efficiently transform a runfiles object into a file that can be passed to inputs and and Args objects.

(2) Seems better overall as it's more flexible. If you have a file, you can then pass it to inputs or add it to args.

[1] Small aside: middleman files don't seem to play well with Starlark implemented rules. From what I could tell, all Starlark rules call some RunfilesSupport helper logic during their Java-phase, which registers various actions and outputs. So interacting with runfiles support in the middle of the Starlark phase of a rule tends to create an action/output conflict with what happens in the Java phase.

What underlying problem are you trying to solve with this feature?

Making it easy to correctly and efficiently pass runfiles information to actions.

Which operating system are you running Bazel on?

Linux

What is the output of `bazel info release`?

google build

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

google build

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

n/a

Have you found anything relevant by searching the web?

#15164 -- "i want to see the runfiles passed to an action"

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

comius · 2023-09-14T13:58:41Z

Either P2 or P4. It's a pain point for existing rules, that used this natively. I'm split between making things easier, but also maintaining a new thing that was so far not possible in Starlark and rules implemented from scratch in Starlark didn't need that.

dgoldstein0 · 2024-01-20T09:00:59Z

I am a little confused on the expected semantics, and also not entirely clear on the possibilities with FilesToRunProvider though #7187 makes it sound like the latter isn't really an option for starlark-written rules anyway. Anyhow I have a use case to throw into the mix here: rules for javascript bundlers (think webpack, rollup, esbuild, rspack, turbopack, etc.). JS bundlers are typically a lot more like linkers from other languages - take in lots of files, which could be compiled js (e.g. from typescript, jsx, etc.) or source files, and transform into either a single output or perhaps a directory (when code splitting is used to output multiple files); however unlike e.g. c++ linking, the filesystem layout of the inputs matters a lot. One of the big challenges of this use case today is that JS bundlers typically want to believe that all inputs - generated or not - are part of the same source tree - or else relative imports in them tend to not resolve correctly. Runfiles seem to be the right abstraction for this - they let us hide the difference between the source and generated files - making the File.short_path visible to the bundler rather than the full File.path.

The problem: the rules I've written for js bundlers currently are constructing a binary of the bundler plus all the inputs, and then executing that. which is inefficient - every input change results in rebuilding this binary before doing the actual build - and also makes bazel think that all of our input code is actually part of the build tool (so bzl query --notool_deps gives the wrong answers). If there were some way to pass runfiles objects as inputs to the action, I suspect these rules would be more efficient? But at the very least, it'd let bazel query understand these are inputs and not tool dependencies. And it's probably a blocker for having a useful implementation of persistent workers for these tools. An alternative is copying all the source input files, but copying is noticeably slower than constructing symlinks.

rickeylev · 2024-08-14T18:32:43Z

This topic came up again and I came up with a more concrete API proposal:

runfiles_input = ctx.runfiles_to_action_input(runfiles)
ctx.actions.run(inputs = [runfiles_input], ...)

The runfiles_to_action_input method converts the runfiles to some object that represents all the underlying File objects of the passed-in runfiles. Perhaps a special proxy object, or a custom File implementation based upon e.g. Middleman. Whatever it is, it efficiently ensures the underlying File objects within runfiles are made available as inputs to the action.

I think this is about the most minimal API necessary.

To do something useful, e.g. put all runfiles into a zip, you have to create a manifest file manually (there are efficient user space ways to do this using map_each).

A couple other ideas that just occurred to me:

manifest, runfile_input = ctx.blabla(runfiles) -- generate a manifest for you

Or, model it as a TreeArtifact. Something like:

runfiles_root = ctx.declare_directory("foo_runfiles_root")
ctx.runfiles_to_directory(runfiles, output=runfiles_root)
ctx.actions.run(inputs=[runfiles_root], arguments=[runfiles_root.path])

dgoldstein0 · 2024-08-24T19:38:54Z

I like the tree artifact idea, as that answers the "how do we translate the runfiles to an equivalent directory structure on disk". Though I wonder how performant that'll be, it'll definitely solve the --notool_deps problem with the "build and then execute a binary" strategy that makes the inputs look like part of the build tools.

That said it's also plausible that we have two separate use cases here - possibly not all actions that want to consume runfiles need the runfiles organization of the files materialized in the filesystem? So we may want two different apis, rather than one.

For the case of not materializing the filesystem directly, one option two new methods:

A new function runfiles.get_all_files() that would return all File objects inside the runfiles regardless of which of the fields of runfiles (files, symlinks, root_symlinks, or empty_files) each file comes from. (I suppose this isn't strictly necessary, but it's a pretty substantial simplification)
Args.add_runfiles() could give more flexibility in how runfiles are passed.

E.g. something like

def _runfile_to_arg(runfilePath, file):
  return "--path={}:{}".format(runfilePath, file.path)
...
args = ctx.actions.args()
args.add_runfiles(runfiles, format_each=_runfile_to_arg)
ctx.actions.run(inputs=[runfiles.get_all_files()], args=[args])

I think this most clearly competes to your manifest idea. Here you can put the manifest information into your Args object, whereas there, the manifest would presumably have a predetermined format and just be another file fed to the action.

rickeylev · 2024-08-25T17:19:06Z

Yeah, given that runfiles are, essentially, representing a directory, using directory artifacts for them feels right.

re: runfiles.get_all_files: Yeah, this would simplify. The trouble is that, under the hood, symlinks/root_symlinks are depsets of SymlinkEntry objects (basically a (path, File) pair). So to get just the File objects the depsets have to be flattened. Or flattening has to be deferred, somehow (which puts us back to "special file to represent runfiles")

re: runfiles.add_runfiles: Oh interesting idea. You can already use (add_all + map_each) to convert a runfiles object to e.g. --path=... args. But perhaps a separate function would be more efficient?

rickeylev · 2024-08-29T19:08:44Z

Thought of an alternative to/variation of runfiles.get_all_files. So under the hood, the problem is there's a depset of SymlinkEntry objects, which aren't File objects, but are wrappers around File objects.

So just have an adapter that unwraps them. e.g. runfiles.get_all_files() returns:

class SymlinkArtifactNestedSet implements NestedSet<Artifact> {
   constructor(NestedSet<SymlinkEntry> symlinks) { this.symlinks = symlinks }
   Artifact getNext() {
     symlinks.getNext().getArtifact();
   }
}

I don't know the exact NestedSet API, but I hope the gist is clear.

dgoldstein0 · 2024-08-30T05:31:23Z

Huh, those SymlinkEntry don't seem to be exposed to starlark yet, at least not that I could find. Are there plans for that?

…2891) After #2887 I started finding similar linker errors in environments that supported runfiles and symlinks but have yet to see this issue in environments without them. I believe my understanding of how the [DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles) objects worked and in the previous change had attempted to rely on it for a runfiles directory. However, there is no guarantee a directory will be rendered and no way to force it to be rendered for actions either (bazelbuild/bazel#15486). The consequence is that creating a custom runfiles directory is no longer a contingency and explicitly creating it is the correct thing to do to ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the consuming actions. With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into `Rustc` actions, a new flag is also introduced to prune unnecessary files out of the directory to reduce bloat and potential invalidation in downstream actions, `--@rules_rust//cargo/settings:cargo_manifest_dir_retain_list`. This flag accepts a list of path suffixes used to determine what (if anything) in `CARGO_MANIFEST_DIR` should be retained and passed to downstream actions. Note that the failure mode this addresses is hard to observe as it requires a crate which defines a linker path to something in `CARGO_MANIFEST_DIR` and then for a clean build to be done with a change to a dependent of said crate. If the runfiles directory for the build script is never rendered then there will be no file to link into the crate.

…2891) After #2887 I started finding similar linker errors in environments that supported runfiles and symlinks but have yet to see this issue in environments without them. I believe my understanding of how the [DefaultInfo.default_runfiles](https://bazel.build/rules/lib/providers/DefaultInfo#default_runfiles) objects worked and in the previous change had attempted to rely on it for a runfiles directory. However, there is no guarantee a directory will be rendered and no way to force it to be rendered for actions either (bazelbuild/bazel#15486). The consequence is that creating a custom runfiles directory is no longer a contingency and explicitly creating it is the correct thing to do to ensure `CARGO_MANIFEST_DIR` linker inputs are always available to the consuming actions. With the propagation of `CARGO_MANIFEST_DIR` (now an action output) into `Rustc` actions, a new flag is also introduced to prune unnecessary files out of the directory to reduce bloat and potential invalidation in downstream actions, `--@rules_rust//cargo/settings:cargo_manifest_dir_filename_suffixes_to_retain`. This flag accepts a list of path suffixes used to determine what (if anything) in `CARGO_MANIFEST_DIR` should be retained and passed to downstream actions. Note that the failure mode this addresses is hard to observe as it requires a crate which defines a linker path to something in `CARGO_MANIFEST_DIR` and then for a clean build to be done with a change to a dependent of said crate. If the runfiles directory for the build script is never rendered then there will be no file to link into the crate. --------- Co-authored-by: Daniel Wagner-Hall <dawagner@gmail.com>

sgowroji added type: feature request untriaged team-Build-Language labels May 13, 2022

aherrmann mentioned this issue Sep 2, 2022

Add rule to define hermetic POSIX toolchain tweag/rules_sh#34

Merged

fmeum mentioned this issue Oct 27, 2022

Non-runfile files from an executable get included in the input for ctx.actions.run() #16573

Closed

brandjon added team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts and removed team-Build-Language labels Nov 5, 2022

comius added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) and removed untriaged labels Sep 14, 2023

This was referenced Sep 20, 2024

Update cargo_build_script to always render custom runfiles dirs. bazelbuild/rules_rust#2891

Merged

Cargo incompatible_runfiles_cargo_manifest_dir bazelbuild/rules_rust#2898

Open

rickeylev mentioned this issue Sep 24, 2024

Efficient way to relocate files in runfiles #23728

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow runfiles to be easily/efficiently passed as action inputs #15486

Allow runfiles to be easily/efficiently passed as action inputs #15486

rickeylev commented May 12, 2022

comius commented Sep 14, 2023

dgoldstein0 commented Jan 20, 2024

rickeylev commented Aug 14, 2024

dgoldstein0 commented Aug 24, 2024

rickeylev commented Aug 25, 2024

rickeylev commented Aug 29, 2024

dgoldstein0 commented Aug 30, 2024

Allow runfiles to be easily/efficiently passed as action inputs #15486

Allow runfiles to be easily/efficiently passed as action inputs #15486

Comments

rickeylev commented May 12, 2022

Description of the feature request:

What underlying problem are you trying to solve with this feature?

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

comius commented Sep 14, 2023

dgoldstein0 commented Jan 20, 2024

rickeylev commented Aug 14, 2024

dgoldstein0 commented Aug 24, 2024

rickeylev commented Aug 25, 2024

rickeylev commented Aug 29, 2024

dgoldstein0 commented Aug 30, 2024

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?