Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor shell_escape_winsomely() and add escaping function for CMD.EXE syntax #34111

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion base/Base.jl
Original file line number Diff line number Diff line change
Expand Up @@ -173,9 +173,9 @@ include("iobuffer.jl")
# strings & printing
include("intfuncs.jl")
include("strings/strings.jl")
include("regex.jl")
include("parse.jl")
include("shell.jl")
include("regex.jl")
include("show.jl")
include("arrayshow.jl")
include("methodshow.jl")
Expand Down
6 changes: 4 additions & 2 deletions base/cmd.jl
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,10 @@ shell_escape(cmd::Cmd; special::AbstractString="") =
shell_escape(cmd.exec..., special=special)
shell_escape_posixly(cmd::Cmd) =
shell_escape_posixly(cmd.exec...)
shell_escape_winsomely(cmd::Cmd) =
shell_escape_winsomely(cmd.exec...)
escape_microsoft_c_args(cmd::Cmd) =
escape_microsoft_c_args(cmd.exec...)
escape_microsoft_c_args(io::IO, cmd::Cmd) =
escape_microsoft_c_args(io::IO, cmd.exec...)

function show(io::IO, cmd::Cmd)
print_env = cmd.env !== nothing
Expand Down
135 changes: 88 additions & 47 deletions base/shell.jl
Original file line number Diff line number Diff line change
Expand Up @@ -255,60 +255,101 @@ shell_escape_posixly(args::AbstractString...) =
sprint(print_shell_escaped_posixly, args...)


function print_shell_escaped_winsomely(io::IO, args::AbstractString...)
first = true
for arg in args
first || write(io, ' ')
first = false
# Quote any arg that contains a whitespace (' ' or '\t') or a double quote mark '"'.
# It's also valid to quote an arg with just a whitespace,
# but the following may be 'safer', and both implementations are valid anyways.
quotes = any(c -> c in (' ', '\t', '"'), arg) || isempty(arg)
quotes && write(io, '"')
backslashes = 0
for c in arg
if c == '\\'
backslashes += 1
"""
shell_escape_wincmd(s::AbstractString)
shell_escape_wincmd(io::IO, s::AbstractString)

The unexported `shell_escape_wincmd` function escapes Windows
`cmd.exe` shell meta characters. It escapes `()!^<>&|` by placing a
`^` in front. An `@` is only escaped at the start of the string. Pairs
of `"` characters and the strings they enclose are passed through
unescaped. Any remaining `"` is escaped with `^` to ensure that the
number of unescaped `"` characters in the result remains even.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The result is safe to pass as an argument to a command call being processed by `CMD.exe /c` and will be received verbatim by the target application (or this function will fail with an ArgumentError).
This function may be useful in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing process pipelines.
!warning
The argument parsing done by CMD when calling batch files (either inside `.bat` files or as arguments to them) is not compatible with the output of this function.
!important
Due to a peculiar behavior of the CMD parser/interpreter, each command after a literal `|` character
(indicating a command pipeline) must have `shell_escape_wincmd` applied twice since it will be parsed twice by CMD. For example:
```
to_print = "All for 1 & 1 for all!"
run(Cmd(Cmd(["cmd /c \"break | echo \$(shell_escape_wincmd(shell_escape_wincmd(to_print)))"]), windows_verbatim=true))
```

(adopting some of the advisory text from #33474):

julia/base/shell.jl

Lines 324 to 345 in 09c1d61

shell_escape_CMDly(arg::AbstractString) -> String
The unexported `shell_escape_CMDly` function takes a string and escapes any special characters
in such a way that it is safe to pass it as an argument to some `CMD.exe`. This may be useful
in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing process
pipelines.
See also [`shell_escape_PWSHly`](@ref).
# Example
```jldoctest
julia> println(shell_escape_CMDly("\"A B\\\" & C"))
^"A B\\^" ^& C
!important
Due to a peculiar behavior of the CMD, each command after a literal `|` character
(indicating a command pipeline) must have `shell_escape_CMDly` applied twice. For example:
```
to_print = "All for 1 & 1 for all!"
run(Cmd(Cmd(["cmd /c \"break | echo \$(shell_escape_CMDly(shell_escape_CMDly(to_print)))"]), windows_verbatim=true))
```
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't quite understand yet that suggested addition and I am not really familiar yet with all the behind-the-scenes processing that may go on during run(Cmd(Cmd(["cmd /c \"... on a Windows machine. My own use case and expertise is preparing command lines that get sent via ssh to a Windows server running OpenSSH sshd (e.g., in Distributed), where no libuv processing of command-line arguments passed to Windows processes is involved. I couldn't reproduce yet myself the effect referred to in this suggestion, and therefore would be more comfortable if we left this to a separate PR.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the behind-the-scenes processing

There is none. That's the point of using that syntax and flag: to disable all helpers and get the raw string without libuv processing or such.

Though I would perhaps further clarify here that it may be best passed as CMD.exe /S /C " prog args " (with the '/S' and surrounding " pair).

Since `cmd.exe` substitutes variable references (like `%USER%`)
_before_ processing the escape characters `^` and `"`, this function
makes no attempt to escape the percent sign (`%`).

Input strings with ASCII control characters that cannot be escaped
(NUL, CR, LF) will cause an `ArgumentError` exception.

With an I/O stream parameter `io`, the result will be written there,
rather than returned as a string.

See also: [`escape_microsoft_c_args`](@ref), [`shell_escape_posixly`](@ref)

# Example
```jldoctest
julia> Base.shell_escape_wincmd("a^\\"^o\\"^u\\"")
"a^^\\"^o\\"^^u^\\""
```
"""
function shell_escape_wincmd(io::IO, s::AbstractString)
# https://stackoverflow.com/a/4095133/1990689
mgkuhn marked this conversation as resolved.
Show resolved Hide resolved
occursin(r"[\r\n\0]", s) &&
throw(ArgumentError("control character unsupported by CMD.EXE"))
i = 1
len = ncodeunits(s)
if len > 0 && s[1] == '@'
write(io, '^')
end
while i <= len
c = s[i]
if c == '"' && (j = findnext('"', s, nextind(s,i))) !== nothing
write(io, SubString(s,i,j))
i = j
else
if c in ('"', '(', ')', '!', '^', '<', '>', '&', '|')
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that cmd /? emits this list, but that I don't know where this list comes from:

The special characters that require quotes are:
     <space>
     &()[]{}^=;!'+,`~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't spotted that list before, but it also seems wrong, at least for my use case of command invocation. For example: typing into the command line julia -e "print(\"hi\")">b or julia -e "print(\">\")"quite clearly demonstrates that any > in a CMD.EXE command-line argument needs to be escaped to not trigger stdout redirect by CMD.EXE. But the > character is not listed in that last line of cmd /?.

Reading the text before, it may well be meant in the context of the “completion” function, i.e. it may be a list of characters that the authors of the completion function worried about, and that function may work differently in different contexts. That may not be the same list of characters that we need to escape in command invocation, i.e. when sending strings such as set var=value&&pushd directory&&julia -e "print(\"...\")". When you ask for filename completion after having typed a filename that starts with >, you may want to redirect rather than have the > escaped.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but note that the list is intentionally incomplete with respect to our usage since \/:*?<>| (plus newline) can't appear in the result of the completion, so any of those might also be candidates for this list. The question then is where some of the others come from??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Jan Erik / Dave Benham description of the CMD.EXE parser does say that under certain circumstances in Phase 7 if the quote flag is off, then when testing if a command token is an internal command, it “break[s] the command token before the first occurrence of + / [ ] <space> <tab> , ; or =”, which probably accounts for the remaining characters in that list. I'm not claiming to fully understand yet all of this, but I doubt it is relevant to our use cases. These characters seem just used as token delimiters, not as special characters that trigger substitutions of their own in regular command-line arguments.

[Note that my original approach (which did not factor Microsoft C ARGV quoting and CMD.EXE quoting into separate functions) would have been able to use quotes more aggressively (i.e., quote something for the C library to unquote because CMD.EXE might interpret it as a meta character, not because the C library might), and thereby guarantee far more easily that none of these contexts are ever reached. But I think we are fine. Can you construct any use case where my proposed current implementation is not already doing the right thing?]

write(io, '^', c)
else
# escape all backslashes and the following double quote
c == '"' && (backslashes = backslashes * 2 + 1)
for j = 1:backslashes
# backslashes aren't special here
write(io, '\\')
end
backslashes = 0
write(io, c)
end
end
# escape all backslashes, letting the terminating double quote we add below to then be interpreted as a special char
quotes && (backslashes *= 2)
for j = 1:backslashes
write(io, '\\')
end
quotes && write(io, '"')
i = nextind(s,i)
end
return nothing
end

shell_escape_wincmd(s::AbstractString) = sprint(shell_escape_wincmd, s;
sizehint = 2*sizeof(s))

"""
shell_escaped_winsomely(args::Union{Cmd,AbstractString...})::String

Convert the collection of strings `args` into single string suitable for passing as the argument
string for a Windows command line. Windows passes the entire command line as a single string to
the application (unlike POSIX systems, where the list of arguments are passed separately).
Many Windows API applications (including julia.exe), use the conventions of the [Microsoft C
runtime](https://docs.microsoft.com/en-us/cpp/c-language/parsing-c-command-line-arguments) to
split that command line into a list of strings. This function implements the inverse of such a
C runtime command-line parser. It joins command-line arguments to be passed to a Windows console
application into a command line, escaping or quoting meta characters such as space,
double quotes and backslash where needed. This may be useful in concert with the `windows_verbatim`
flag to [`Cmd`](@ref) when constructing process pipelines.
escape_microsoft_c_args(args::Union{Cmd,AbstractString...})
escape_microsoft_c_args(io::IO, args::Union{Cmd,AbstractString...})

# Example
```jldoctest
julia> println(shell_escaped_winsomely("A B\\", "C"))
"A B\\" C
Convert a collection of string arguments into a string that can be
passed to many Windows command-line applications.

Microsoft Windows passes the entire command line as a single string to
the application (unlike POSIX systems, where the shell splits the
command line into a list of arguments). Many Windows API applications
(including julia.exe), use the conventions of the [Microsoft C/C++
runtime](https://docs.microsoft.com/en-us/cpp/c-language/parsing-c-command-line-arguments)
to split that command line into a list of strings.

This function implements an inverse for a parser compatible with these rules.
It joins command-line arguments to be passed to a Windows
C/C++/Julia application into a command line, escaping or quoting the
meta characters space, TAB, double quote and backslash where needed.

See also: [`shell_escape_wincmd`](@ref), [`escape_raw_string`](@ref)
"""
shell_escape_winsomely(args::AbstractString...) =
sprint(print_shell_escaped_winsomely, args..., sizehint=(sum(length, args)) + 3*length(args))
function escape_microsoft_c_args(io::IO, args::AbstractString...)
# http://daviddeley.com/autohotkey/parameters/parameters.htm#WINCRULES
first = true
for arg in args
if first
first = false
else
write(io, ' ') # separator
end
if isempty(arg) || occursin(r"[ \t\"]", arg)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isempty(arg) || occursin(r"[ \t\"]", arg)
if isempty(arg) || any(c -> c in (' ', '\t', '"'), arg)

(faster)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the length. On my computer, if length(arg)<27 then there are about 50 ns setup penalty associated with regex use:

julia> using BenchmarkTools
julia> arg = 'a'^1; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
  56.275 ns (0 allocations: 0 bytes)
  7.988 ns (0 allocations: 0 bytes)

For length(arg)==27 there is no difference:

julia> arg = 'a'^27; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
  88.455 ns (0 allocations: 0 bytes)
  89.344 ns (0 allocations: 0 bytes)

For length(arg)>27 regex is faster, asymptotically twice as fast:

julia> arg = 'a'^1000; @btime occursin(r"[ \t\"]", $arg) ; @btime any(c -> c in (' ', '\t', '"'), $arg)
  1.380 μs (0 allocations: 0 bytes)
  2.673 μs (0 allocations: 0 bytes)
false

So overall, I still prefer using a regular expression here (and I suspect they might be beneficially used in some of the other shell escaping functions as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more common to have arg < 27 than long arguments with arg>27 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking a maximum of ~50 ns benchmark overhead here on my 8-year-old PC (the time in which a bit travels 10 metres on a cable), which is many orders of magnitude less than the overhead of invoking a new process.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine with me.

# Julia raw strings happen to use the same escaping convention
# as the argv[] parser in Microsoft's C runtime library.
escape_raw_string(io, arg)
else
write(io, arg)
end
end
end
escape_microsoft_c_args(args::AbstractString...) =
sprint(escape_microsoft_c_args, args...;
sizehint = (sum(sizeof.(args)) + 3*length(args)))
53 changes: 53 additions & 0 deletions base/strings/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -523,6 +523,59 @@ julia> println(raw"\\\\x \\\\\\"")
"""
macro raw_str(s); s; end

"""
escape_raw_string(s::AbstractString)
escape_raw_string(io, s::AbstractString)

Convert a string into a raw-string literal. This function counts in
input string `s` for any double-quote (`"`) character the number _n_
of preceeding backslash (`\\`) characters, and then increases there
the number of backslashes from _n_ to 2_n_+1 (even for _n_ = 0). It
also doubles any sequence of backslashes at the end of the string.
Finally, it returns the entire string enclosed with added double-quote
delimiters.

This escaping convention is used in raw strings and other non-standard
string literals. (It also happens to be the escaping convention
expected by the Microsoft C/C++ compiler runtime when it parses a
command-line string into the argv[] array.)

See also: [`escape_string`](@ref)
"""
function escape_raw_string(io, str::AbstractString)
write(io, '"')
escapes = 0
for c in str
if c == '\\'
escapes += 1
else
if c == '"'
# if one or more backslashes are followed by
# a double quote then escape all backslashes
# and the double quote
escapes = escapes * 2 + 1
end
while escapes > 0
write(io, '\\')
escapes -= 1
end
escapes = 0
write(io, c)
end
end
# also escape any trailing backslashes,
# so they do not affect the closing quote
while escapes > 0
write(io, '\\')
write(io, '\\')
escapes -= 1
end
write(io, '"')
end
escape_raw_string(str::AbstractString) = sprint(escape_raw_string, str;
sizehint = lastindex(str) + 2)


## multiline strings ##

"""
Expand Down
67 changes: 45 additions & 22 deletions test/spawn.jl
Original file line number Diff line number Diff line change
Expand Up @@ -677,86 +677,109 @@ if Sys.iswindows()
end


# shell escaping on Windows
@testset "shell_escape_winsomely" begin
@testset "shell escaping on Windows" begin
# Note argument A can be parsed both as A or "A".
# We do not test that the parsing satisfies either of these conditions.
# In other words, tests may fail even for valid parsing.
# This is done to avoid overly verbose tests.

# input :
# output: ""
@test Base.shell_escape_winsomely("") == "\"\""
@test Base.escape_microsoft_c_args("") == "\"\""

@test Base.shell_escape_winsomely("A") == "A"
@test Base.escape_microsoft_c_args("A") == "A"

@test Base.shell_escape_winsomely(`A`) == "A"
@test Base.escape_microsoft_c_args(`A`) == "A"

# input : hello world
# output: "hello world"
@test Base.shell_escape_winsomely("hello world") == "\"hello world\""
@test Base.escape_microsoft_c_args("hello world") == "\"hello world\""

# input : hello world
# output: "hello world"
@test Base.shell_escape_winsomely("hello\tworld") == "\"hello\tworld\""
@test Base.escape_microsoft_c_args("hello\tworld") == "\"hello\tworld\""

# input : hello"world
# output: "hello\"world" (also valid) hello\"world
@test Base.shell_escape_winsomely("hello\"world") == "\"hello\\\"world\""
@test Base.escape_microsoft_c_args("hello\"world") == "\"hello\\\"world\""

# input : hello""world
# output: "hello\"\"world" (also valid) hello\"\"world
@test Base.shell_escape_winsomely("hello\"\"world") == "\"hello\\\"\\\"world\""
@test Base.escape_microsoft_c_args("hello\"\"world") == "\"hello\\\"\\\"world\""

# input : hello\world
# output: hello\world
@test Base.shell_escape_winsomely("hello\\world") == "hello\\world"
@test Base.escape_microsoft_c_args("hello\\world") == "hello\\world"

# input : hello\\world
# output: hello\\world
@test Base.shell_escape_winsomely("hello\\\\world") == "hello\\\\world"
@test Base.escape_microsoft_c_args("hello\\\\world") == "hello\\\\world"

# input : hello\"world
# output: "hello\"world" (also valid) hello\"world
@test Base.shell_escape_winsomely("hello\\\"world") == "\"hello\\\\\\\"world\""
@test Base.escape_microsoft_c_args("hello\\\"world") == "\"hello\\\\\\\"world\""

# input : hello\\"world
# output: "hello\\\\\"world" (also valid) hello\\\\\"world
@test Base.shell_escape_winsomely("hello\\\\\"world") == "\"hello\\\\\\\\\\\"world\""
@test Base.escape_microsoft_c_args("hello\\\\\"world") == "\"hello\\\\\\\\\\\"world\""

# input : hello world\
# output: "hello world\\"
@test Base.shell_escape_winsomely("hello world\\") == "\"hello world\\\\\""
@test Base.escape_microsoft_c_args("hello world\\") == "\"hello world\\\\\""

# input : A\B
# output: A\B"
@test Base.shell_escape_winsomely("A\\B") == "A\\B"
@test Base.escape_microsoft_c_args("A\\B") == "A\\B"

# input : [A\, B]
# output: "A\ B"
@test Base.shell_escape_winsomely("A\\", "B") == "A\\ B"
@test Base.escape_microsoft_c_args("A\\", "B") == "A\\ B"

# input : A"B
# output: "A\"B"
@test Base.shell_escape_winsomely("A\"B") == "\"A\\\"B\""
@test Base.escape_microsoft_c_args("A\"B") == "\"A\\\"B\""

# input : [A B\, C]
# output: "A B\\" C
@test Base.shell_escape_winsomely("A B\\", "C") == "\"A B\\\\\" C"
@test Base.escape_microsoft_c_args("A B\\", "C") == "\"A B\\\\\" C"

# input : [A "B, C]
# output: "A \"B" C
@test Base.shell_escape_winsomely("A \"B", "C") == "\"A \\\"B\" C"
@test Base.escape_microsoft_c_args("A \"B", "C") == "\"A \\\"B\" C"

# input : [A B\, C]
# output: "A B\\" C
@test Base.shell_escape_winsomely("A B\\", "C") == "\"A B\\\\\" C"
@test Base.escape_microsoft_c_args("A B\\", "C") == "\"A B\\\\\" C"

# input :[A\ B\, C]
# output: "A\ B\\" C
@test Base.shell_escape_winsomely("A\\ B\\", "C") == "\"A\\ B\\\\\" C"
@test Base.escape_microsoft_c_args("A\\ B\\", "C") == "\"A\\ B\\\\\" C"

# input : [A\ B\, C, D K]
# output: "A\ B\\" C "D K"
@test Base.shell_escape_winsomely("A\\ B\\", "C", "D K") == "\"A\\ B\\\\\" C \"D K\""
@test Base.escape_microsoft_c_args("A\\ B\\", "C", "D K") == "\"A\\ B\\\\\" C \"D K\""

# shell_escape_wincmd
@test Base.shell_escape_wincmd("") == ""
@test Base.shell_escape_wincmd("\"") == "^\""
@test Base.shell_escape_wincmd("\"\"") == "\"\""
@test Base.shell_escape_wincmd("\"\"\"") == "\"\"^\""
@test Base.shell_escape_wincmd("\"\"\"\"") == "\"\"\"\""
@test Base.shell_escape_wincmd("a^\"^o\"^u\"") == "a^^\"^o\"^^u^\""
@test Base.shell_escape_wincmd("ä^\"^ö\"^ü\"") == "ä^^\"^ö\"^^ü^\""
@test Base.shell_escape_wincmd("@@()!^<>&|\"") == "^@@^(^)^!^^^<^>^&^|^\""
@test_throws ArgumentError Base.shell_escape_wincmd("\0")
@test_throws ArgumentError Base.shell_escape_wincmd("\r")
@test_throws ArgumentError Base.shell_escape_wincmd("\n")

# combined tests of shell_escape_wincmd and escape_microsoft_c_args
@test Base.shell_escape_wincmd(Base.escape_microsoft_c_args(
"julia", "-e", "println(ARGS)", raw"He said \"a^2+b^2=c^2\"!" )) ==
"julia -e println^(ARGS^) \"He said \\\"a^^2+b^^2=c^^2\\\"!\""

ascii95 = String(range(' ',stop='~')); # all printable ASCII characters
args = ["ab ^` c", " \" ", "\"", ascii95, ascii95,
"\"\\\"\\", "", "|", "&&", ";"];
@test Base.shell_escape_wincmd(Base.escape_microsoft_c_args(args...)) == "\"ab ^` c\" \" \\\" \" \"\\\"\" \" !\\\"#\$%^&'^(^)*+,-./0123456789:;^<=^>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^^_`abcdefghijklmnopqrstuvwxyz{^|}~\" \" ^!\\\"#\$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\" \"\\\"\\\\\\\"\\\\\" \"\" ^| ^&^& ;"

end
1 change: 1 addition & 0 deletions test/strings/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@
@test "aaa \\g \n" == unescape_string(str, ['g'])
@test "aaa \\g \\n" == unescape_string(str, ['g', 'n'])
end
@test Base.escape_raw_string(raw"\"\\\"\\-\\") == "\"\\\"\\\\\\\"\\\\-\\\\\""
end
@testset "join()" begin
@test join([]) == join([],",") == ""
Expand Down