Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*_binary args get splitted #12313

Open
dmivankov opened this issue Oct 20, 2020 · 24 comments
Open

*_binary args get splitted #12313

dmivankov opened this issue Oct 20, 2020 · 24 comments
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: bug

Comments

@dmivankov
Copy link
Contributor

dmivankov commented Oct 20, 2020

Description of the problem / feature request:

Arguments set by args attribute of *_binary get splitted

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

BUILD:

cc_binary(
  name = "main",
  srcs = [":main.c"],
)
cc_binary(
  name = "main2",
  srcs = [":main.c"],
  args = ["a", "b c"], # "b c" gets split
)
cc_binary(
  name = "main3",
  srcs = [":main.c"],
  args = ["a", "\"b c\""], # excessive quoting is needed
)
cc_binary(
  name = "main4",
  srcs = [":main.c"],
  args = ["a \"b c\""], # we could say singleton lists are preferred and are splitted
)

main.c:

#include <stdio.h>
int main(int argc, char** argv) {
	for (int i = 1; i < argc; ++i) {
		printf("%d: %s\n", i, argv[i]);
	}
	return 0;
}

Testing:

$ bazel run :main -- a "b c"
1: a
2: b c
$ bazel run :main2
1: a
2: b
3: c
$ bazel run :main3
1: a
2: b c
$ bazel run :main4
1: a
2: b c

What operating system are you running Bazel on?

Linux/NixOS

What's the output of bazel info release?

release 3.3.1- (@non-git)

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Comes from NixOS packages
nix-env -iA nixpkgs.bazel

Any other information, logs, or outputs that you want to share?

Luckily

cc_binary(
  name = "main5",
  srcs = [":main.c"],
  args = ["a \"b c\" >/dev/null"],
)

treats ">/dev/null" as string rather than redirection. Could've been an injection vulnerability otherwise

Also tested with recent git version 6bf44ba

@aiuto aiuto added type: bug untriaged category: misc > misc team-Local-Exec Issues and PRs for the Execution (Local) team and removed category: misc > misc labels Oct 20, 2020
@dmivankov
Copy link
Contributor Author

dmivankov commented Oct 20, 2020

extra info

$ bazel build :main
$ bazel run --shell_executable=$(pwd)/bazel-bin/main :main2 -- "d e"
1: -c
2: SOME_PATH/execroot/__main__/bazel-out/k8-fastbuild/bin/main2 a b c 'd e'

also

BUILD file args=[" 'x' "] -> argv = ["x"]
bazel run -- " 'x' " -> argv = [" 'x' "]

@tetromino tetromino added team-Starlark team-Configurability platforms, toolchains, cquery, select(), config transitions and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Oct 22, 2020
@tetromino
Copy link
Contributor

Not sure who owns RunfilesSupport, but this looks like a real logic bug.

@jwnimmer-tri
Copy link
Contributor

To my reading, https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args specifically says that args undergoes Bourne shell tokenization, which explaining the splitting. I don't understand what the bug is?

@dmivankov
Copy link
Contributor Author

May be just a documentation issue then
https://docs.bazel.build/versions/master/be/common-definitions.html#sh-tokenization

Certain string attributes of some rules are split into multiple words according to the tokenization rules of the Bourne shell

while
https://docs.bazel.build/versions/master/be/common-definitions.html#binary.args

args | List of strings;

Intuitively, when you supply array of strings, you don''t expect split and combine the lists behavior, also sh-tokenization is said to apply to string arguments, not to lists of string.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 23, 2020

If I recall correctly, this has come up a few times before. I agree that the behavior should be changed to avoid shell tokenization; however, this is a difficult transition. It's not enough to have a flag because there's no way to write args in a way that is compatible with and without shell tokenization.

@dmivankov
Copy link
Contributor Author

One way to migrate is

  • add a flag that fails the build if tokenization changes the args (splits or removes spaces), should be doable as tokenization is done by bazel
  • add a builtin function tokenize(): string(or string..?) -> [string], need to carefully check how it interacts with variable expansion though
    then compatible way would be
args = ["--foo"] + tokenize(something) + ["abc"]
# or
args = ["--foo"] + tokenize(something1, something2) + ["abc"]

which would work regardless of flag given that there are no splits in free args

@dmivankov
Copy link
Contributor Author

We also probably need builtin quote(string_with_spaces_or_quotes): string -> string
So canonical arguments would be

args = ["--foo", "abc", quote(" x "), quote("'"), quote("\'x\'")] + tokenize("a \"b b\"") + tokenize("c d", "e f")
# meaning
argv = ["--foo", "abc", " x ", "'", "\'x\'", "a", "b b", "c", "d", "e", "f"]

and rejected arguments would be

args = ["a b", " x ", "\"x\"", "\'y\'", "\"x y\"", "a \"b\" c" ] + tokenize("a \" b")

Note that currently unquoting seems to do funny and unpredictable things

args = ["a", " \"b c\" ", " \"\"b\" c\" ", "'a ' ", "'d'", "\"e\"", "\"\"x\"\"", "\"'y'\"", "''\"''\"\"\"z\"\"\"''\"''", "''zz''", "'''xx'''"]

results in

1: a
2: b c
3: b c
4: a 
5: d
6: e
7: x
8: 'y'
9: ''z''
10: zz
11: xx

@aiuto aiuto added P2 We'll consider working on this in future. (Assignee optional) and removed team-Starlark untriaged labels Oct 26, 2020
@aiuto
Copy link
Contributor

aiuto commented Oct 26, 2020

I'll do some thinking about this. To me, a list of args should be based 1:1 as the arg list of the called process. The whole concept of shell tokanization is wrong.
OTOH, I would guess there are a lot of shell genrules that might break if we change this, and windows does not quite always parse well without quotes, so this is not a simple bug fix.

My hope is that we can devise a mechanism so that binaries can indicate that they parse args correctly, and thus no tokenization is needed. What you list in args is what you get.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 26, 2020

@dmivankov, if I understand correctly, you're suggesting to add a flag that changes how Bazel parses args and simultaneously changes how quote and tokenize behave internally?

You didn't say that second part explicitly. If we don't change how quote and tokenize behave internally, then this would cause double-tokenization unless we can find a way to signal inline that a string should not be tokenized inside Bazel. However, the downside of this approach is that it would be confusing that tokenize doesn't actually tokenize unless the flag is set.

I think the most straightforward way to do this is to add a tag tokenize-args (alternatively, use a full attribute). This would support a migration (add tokenize-args on all rules, flip the flag to not tokenize by default, migrate all rules to explicitly tokenize using a separate method, and remove the tag).

@dmivankov
Copy link
Contributor Author

tokenize-args or even just argv=[..] that is never tokenized are also valid options.

With tokenize&quote we'd either need them to produce a tagged string/list of strings that indicates if it is wrapped(& into which of them) or not, or indeed make them behave differently depending on the flag. Technically we could also

  • make quote add '' when needed regardless of flag, so quote("a")=="a", quote("a b") == "'a b'"
  • make tokenize produce quote-ed elements
  • do double tokenization, but second run shouldn't be changing anything
  • flag only controls whether not using quote/tokenize is forbidden for args with spaces/special chars
  • and flag somehow only applies to user-provided strings, result of quote("a b") is "'a b'" doesn't fail the build but users can't specify such strings manually

@aiuto aiuto removed their assignment Jan 22, 2021
@gregestren gregestren added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed P2 We'll consider working on this in future. (Assignee optional) labels Dec 3, 2021
@trybka
Copy link
Contributor

trybka commented Jan 14, 2022

As another data point, this is also present in the defines attribute of cc_* rules and friends.

It was added ostensibly for copts compatibility, but defines errors if you have >1 tokens, which makes this quite a bit less useful anyhow.

defines might be a whole lot easier to clean-up lending themselves to having a per-package feature or something, IDK?

@aiuto aiuto removed P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Configurability platforms, toolchains, cquery, select(), config transitions labels Jan 14, 2022
@gregestren
Copy link
Contributor

@aiuto how should we categorize this issue?

@aiuto
Copy link
Contributor

aiuto commented Feb 1, 2022

I don't know.
This is one of those things that does not fit any of the teams. It is execution semantics, but neither local nor remote.
So, either local-exec or xproduct, where some day we will re-triage it.

@sventiffe sventiffe added untriaged team-Local-Exec Issues and PRs for the Execution (Local) team labels Mar 16, 2022
@meisterT meisterT added team-XProduct and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Apr 7, 2022
fmeum added a commit to fmeum/bazel that referenced this issue Oct 8, 2022
Executable Starlark rules can use the new `arguments` parameter on
`RunEnvironmentInfo` to specify the arguments that Bazel should pass
on the command line with `test` or `run`.

If set to a non-`None` value, this parameter overrides the value of
the `args` attribute that is implicitly defined for all rules. This
allows Starlark rules to implement their own version of this attribute
which isn't bound to its proprietary processing (data label expansion
and tokenization).

Along the way, this commit adds test coverage and documentation for the
interplay between `RunEnvironmentInfo`'s `environment` and `--test_env`.

The value of the `arguments` field of `RunEnvironmentInfo` is
intentionally not exposed to Starlark yet: It is not clear how these
arguments should be represented and whether rules relying on the magic
`args` attribute should also provide this field.

Fixes bazelbuild#16076
Work towards bazelbuild#12313
fmeum added a commit to fmeum/bazel that referenced this issue Oct 8, 2022
Executable Starlark rules can use the new `arguments` parameter on
`RunEnvironmentInfo` to specify the arguments that Bazel should pass
on the command line with `test` or `run`.

If set to a non-`None` value, this parameter overrides the value of
the `args` attribute that is implicitly defined for all rules. This
allows Starlark rules to implement their own version of this attribute
which isn't bound to its proprietary processing (data label expansion
and tokenization).

Along the way, this commit adds test coverage and documentation for the
interplay between `RunEnvironmentInfo`'s `environment` and `--test_env`.

The value of the `arguments` field of `RunEnvironmentInfo` is
intentionally not exposed to Starlark yet: It is not clear how these
arguments should be represented and whether rules relying on the magic
`args` attribute should also provide this field.

Fixes bazelbuild#16076
Work towards bazelbuild#12313
fmeum added a commit to fmeum/bazel that referenced this issue Oct 8, 2022
Executable Starlark rules can use the new `arguments` parameter on
`RunEnvironmentInfo` to specify the arguments that Bazel should pass
on the command line with `test` or `run`.

If set to a non-`None` value, this parameter overrides the value of
the `args` attribute that is implicitly defined for all rules. This
allows Starlark rules to implement their own version of this attribute
which isn't bound to its proprietary processing (data label expansion
and tokenization).

Along the way, this commit adds test coverage and documentation for the
interplay between `RunEnvironmentInfo`'s `environment` and `--test_env`.

The value of the `arguments` field of `RunEnvironmentInfo` is
intentionally not exposed to Starlark yet: It is not clear how these
arguments should be represented and whether rules relying on the magic
`args` attribute should also provide this field.

RELNOTES: Executable starlark rules can use the `arguments` parameter of
`RunEnvironmentInfo` to specify their command-line arguments with `bazel
run` and `bazel test`.

Fixes bazelbuild#16076
Work towards bazelbuild#12313
@brandjon brandjon added team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts and removed team-Build-Language labels Nov 4, 2022
@comius comius added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Nov 17, 2022
@comius
Copy link
Contributor

comius commented Nov 17, 2022

I think the default implementation should be the obvious one. Parameters in the list should be passed to the binary without any changes.

Doing this might break others who had to workaround this. So anybody fixing the implementation should also put it behind an incompatible flag.

@aiuto
Copy link
Contributor

aiuto commented Nov 23, 2022

+1. But the obvious thing is that args should be passed as is. Going back to the examples, main2 should print

1: a
2: b c

@fmeum
Copy link
Collaborator

fmeum commented Nov 23, 2022

What do you think should be the behavior of $(execpaths ...) with this flag? It would probably be more useful if that would still result in multiple arguments.

@aiuto
Copy link
Contributor

aiuto commented Nov 23, 2022

I think all args in a list should be passed as single strings and never retokenized by blaze. So, if there are paths in execpaths that have, for example, spaces in them, they should be expanded in a way that preserves the spaces on invocation of the command line.

@fmeum
Copy link
Collaborator

fmeum commented Nov 23, 2022

@aiuto I agree, but what about $(execpaths ...) resolving to multiple file paths (note the plural form)? In that case you would certainly want the result to be 'file1' 'file2' 'file3' on the command line rather than 'file1 file2 file3', but the $(execpaths ...) would still appear in a single args list entry.

For unrelated reasons, I have thought about rewriting the LocationExpander logic to return a CustomCommandLine instead of strings. That could solve this issue.

@aiuto
Copy link
Contributor

aiuto commented Nov 23, 2022

TBH. I'm not wrapping my head around that use case right now. That is, I don't have a mental model of when to use execpaths. I'm not sure I ever have used it. So, .... I don't know?

@phst
Copy link
Contributor

phst commented Jan 15, 2024

The primary use case appears to be complex scripts in genrules, e.g.

cmd = """
mkdir -p bazel/debian
tar -xf $(location //:bazel-srcs) -C ./bazel
for f in $(locations :debian-files); do
cp $$f ./bazel/debian/
done
cp $(location :changelog) ./bazel/debian
dpkg-source -b ./bazel
cp ./bazel_*.dsc $(location bazel.dsc)
cp ./bazel_*.tar.gz $(location bazel.tar.gz)
""",
or https://github.com/protocolbuffers/protobuf/blob/7b42f1c08b534f9e43629284e22acce6b13e7136/ruby/BUILD.bazel#L155-L174.

Ideally these would be converted to proper binaries with command-line arguments and custom actions using e.g. Args.add_all(ctx.attr.debian_files, format_each="--debian-file=%s"). That would resolve quoting/tokenization issues and obviate the need for shell quoting. So maybe we could consider any use of execpaths etc. a code smell?

@fmeum
Copy link
Collaborator

fmeum commented Jun 7, 2024

Based on the discussion so far and ignoring defines for the moment, I could see this being a reasonable path forward:

  1. Faithfully implement
    public static void tokenize(List<String> options, String optionString)
    in Starlark and make it available in bazel-skylib. Since it can't perform location expansion, it would treat $(FOO) and $(location //foo:bar) as if they were tokens that don't contain any spaces.
  2. Add a new --incompatible_no_args_tokenization that, if flipped, makes it so that all args values are treated literally with no postprocessing, with the exception of those targets that are tagged with tokenize-args.
  3. Provide a migration script (buildozer command) that tags all targets with tokenize-args that have args values that could possibly be subject to tokenization.
  4. Flip the flag, offering tokenize-args as a catch-all temporary and the tokenize function in Skylib as a permanent but not 100% compatible alternative.

@comius @lberki What do you think?

@phst
Copy link
Contributor

phst commented Oct 2, 2024

Maybe in addition to the tokenize-args tag there should be a no-tokenize-args tag so that people can opt out eagerly of tokenization independent of the flag.

Or maybe tokenize-args should be a feature so that users can say repo(default_features = ['-tokenize_args'] in their REPO.bazel?

@kjw-enf
Copy link

kjw-enf commented Oct 23, 2024

I recently wrote a starlark rule that treats args without $(location) expansion, without escaping necessary (i.e. $$), and without environment variable expansion (at least not in starlark) and I couldn't be happier.

context: pytest which requires a "main" wrapper calling pytest.main()

def _wrapper_pytest_impl(ctx):
    # store pytest args in an expanded template of wrapper_pytest.py
    output = ctx.actions.declare_file(ctx.attr.name)
    _flags = '\"' + '\",\n  \"'.join(ctx.attr.args) + '\"'  # quoted
    _srcs = '\"' + '\",\n  \"'.join([x.path for x in ctx.files.srcs]) + '\"'  # quoted
    ctx.actions.write(
        output = output,
        content = """# -*- coding: utf-8 -*-
# standard libraries
import os
import re
import sys
# third party libraries
import pytest
flags = [
  """ + _flags + """
]
srcs = [
  """ + _srcs + """
] 
# n.b. single dollar signs in $ENV expansion only; no shell escaping required
# TODO(kjw): this regex sucks
expanded_flags = [re.sub(r'\\$(\\S+)', lambda mo: os.environ.get(mo.group(1), ''), x) for x in flags]
if __name__ == "__main__":
    sys.exit(pytest.main(expanded_flags + srcs + sys.argv[1:]))
""",
        is_executable = True,
    )
    return [DefaultInfo(files = depset([output]))]

wrapper_pytest = rule(
    implementation = _wrapper_pytest_impl,
    attrs = {
        "srcs": attr.label_list(allow_files = True),
        "args": attr.string_list(),
        "output": attr.output(),
    },
)

Caveat: I'm a starlark noob, so I'm sure that there are better ways (i.e. with an Args object) to do this.

Yes, this means that some our_py_test(args=) usage looks funny, but it's correct if you think of python subprocess.Popen([cmd]) usage. i.e.:

our_py_test(args=["-k=a and b"])

Add in the complication that not all cli parsers support (single char flags with equals (i.e. -k=). Preferentially one should switch to the --flag= format which typically does. otherwise you're left with ["-k", "a and b"] which is slightly harder to read, and people not used to strict interpretations of args will be confused:

  • "doesn't that need quotes?"
  • "why is it two args and not one?"
  • "why is it one arg and not two?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Rules-API API for writing rules/aspects: providers, runfiles, actions, artifacts type: bug
Projects
None yet
Development

No branches or pull requests