Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/rename shell_escape_winsomely() and add escaping function for CMD.EXE syntax #38352

Merged
merged 2 commits into from
Nov 19, 2020

Conversation

mgkuhn
Copy link
Contributor

@mgkuhn mgkuhn commented Nov 8, 2020

This is a rebased resubmission of PR #34111

Refactored shell_escape_winsomely() into two separate functions escape_microsoft_c_args() and escape_raw_string() to make the former easier to understand. (The latter function was since already cherry-picked by @JeffBezanson in #35309, as it was needed for #29580, so no longer included here.) As requested by @vtjnash in #33465.

Also added Windows CMD.EXE escaping function shell_escape_wincmd() as requested by @vtjnash to be separated out from #30614 into base/shell.jl (originally needed by SSHManager to be able to prepare Windows command lines, but might be useful elsewhere).

The second commit deletes the shell_escape_winsomely() alias (originally my code from #30614, committed by @musm) as that is a misleading function name born out of a misunderstanding of how Windows command lines work (this function does not actually escape any “shell” meta characters, the newly added shell_escape_wincmd() does that instead.)

* Using `escape_raw_string()` makes `shell_escape_winsomely()` much easier to understand.

* Name of `shell_escape_winsomely()` changed to `escape_microsoft_c_args()`, as this function has nothing to do with any “shell”.

* Added `shell_escape_wincmd()` to escape metacharacters of `CMD.EXE`
Copy link
Contributor

@musm musm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base/shell.jl Show resolved Hide resolved
base/shell.jl Show resolved Hide resolved
end
while i <= len
c = s[i]
if c == '"' && (j = findnext('"', s, nextind(s,i))) !== nothing
Copy link
Contributor

@musm musm Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully follow https://stackoverflow.com/questions/4094699/how-does-the-windows-command-interpreter-cmd-exe-parse-scripts/4095133#4095133
It's quite detailed and not all of it looks relevant to the point of this method. I'm assuming this is a simplified version of "Command Line Parser" and is somewhat a reverse-engineered version of phase 2 in the linked page?

Copy link
Contributor Author

@mgkuhn mgkuhn Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of shell_escape_wincmd is to robustly perform exactly the kind of escaping needed to start other applications on the cmd.exe command line. It does exactly the type of escaping needed to pass arbitrary strings as environment variables and command-line arguments to called exe files, as needed by SSHManager in stdlib/Distributed. It attempts nothing else. There may be lots of other parts of the CMD.EXE control-structure syntax that may require a variety of other, context-dependent escaping rules. This function makes no attempt whatsoever to cater for any of these, because doing so would be extremely complex. If you find this function useful for base/shell.jl, you are welcome to have it there, otherwise, I would much prefer to keep it in stdlib/Distributed/src/managers.jl, where its application context is unambiguously clear and its inclusion is not delayed by another year due to confusion of what type of escaping the people who wanted it in shell.jl actually want.

The cmd.exe syntax is obviously extremely complex and ill-defined, and there is no chance one single escaping function can cover every possible need in every single application context. This function does one thing and does it well. Don't try to improve it. Take it or leave it as it is in stdlib/Distributed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Perhaps we should add an additional warning or note to the docstring to something of this effect. In other words, adding to the docstring that this method makes no additional attempts at escaping other control-structure syntax that may require a variety of other, context-dependent escaping rules.

Ultimately, I think we still want to use this Base.shell_escape_wincmd(Base.escape_microsoft_c_args(args...)) as a basic "dumb" shell interpreter if it's set to cmd.exe. Although, for that unrelated use case, I think instead looking at PowerShell's rules might be somewhat easier.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing after phase 2 on the linked page is related to command line parsing, and there's no escape character to avoid the problems that may get introduced in phase 1. After phase 2 it gets to command line interpretation, and each interpreter then may have their own rules that would be run prior to this (most commonly, Base.escape_microsoft_c_args).

The parsing rules are not complex at all, and fit quite compactly in this function. The rules are unusual and a bit awkward (because after the delimiter tokens are parsed, they may be left in the stream). But there's no reason to fear them.

the type of escaping needed to pass arbitrary strings as environment variables

No escaping is quite entirely possible on these to be safe, since cmd will always eval the resulting string, so they need to be mangled ahead of time to be safe in context, and the context isn't trivial to know. Curiously though, since expansion isn't recursive, this means that an entirely safe way to call cmd is:

run(setenv(`cmd /c %CMDLINE%`, "CMDLINE" => Base.shell_escape_wincmd(cmdline)))

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By contrast, PowerShell's rules are terrifyingly complex to me, ill-defined, and have numerous known bugs (PowerShell/PowerShell#1995, PowerShell/PowerShell#3049, PowerShell/PowerShell-RFC#90). There may be a simple way through that mess, but given that those issues have been open for a number of years, I'm not particularly hopeful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@musm I prefer the current approach, where the docstring simply describes the algorithm implemented, without making any promises of what use cases it is or isn’t suitable for:

The unexported shell_escape_wincmd function escapes Windows
cmd.exe shell meta characters. It escapes ()!^<>&| by placing a
^ in front. An @ is only escaped at the start of the string. Pairs
of " characters and the strings they enclose are passed through
unescaped. Any remaining " is escaped with ^ to ensure that the
number of unescaped " characters in the result remains even.

Since cmd.exe substitutes variable references (like %USER%)
before processing the escape characters ^ and ", this function
makes no attempt to escape the percent sign (%).

It is the responsibility of the user to understand cmd.exe syntax well enough to decide whether this particular escaping algorithm is suitable for their particular use case. (I fear that at this stage trying to improve the docstring is likely to make it worse. There is probably nothing this docstring can say that would make cmd.exe command-line escaping as nice and easy to understand as POSIX shell escaping.)

The question is: do you want this version of shell_escape_wincmd in base/shell.jl? If not, I can remove it from this PR, so we can focus on merging the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that document confused me more than helped to be honest, since it's describing what cmd does after receiving the stream. In any case, we aren't trying to parse the command line, but to escape characters to they get parsed correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is: do you want this version of shell_escape_wincmd in base/shell.jl? If not, I can remove it from this PR, so we can focus on merging the rest.

Yes let's keep it in base/shell.jl, it's more more appropriate.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of documentation is not to be a design document or describe the implementation of an algorithm. The purpose is to describe the usage of it and what it promises. If this promise nothing, than it would not likely be a candidate for merging into the standard library. Fortunately, it can promise useful properties, so we do want to merge it.

@musm
Copy link
Contributor

musm commented Nov 19, 2020

Sans the minor docstring update, I plan to merge this soon. @mgkuhn please ping me if I forget.

Copy link
Sponsor Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, realized I had written some suggested documentation, then forget to post it. The implementation lgtm.


Input strings with ASCII control characters that cannot be escaped
(NUL, CR, LF) will cause an `ArgumentError` exception.

Copy link
Sponsor Member

@vtjnash vtjnash Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional documentation for users that would like to understand further the guarantees, usage, and gotchas of this function:

Suggested change
The result is safe to pass as an argument to a command call being processed by `CMD.exe /S /C " ... "` (with
surrounding double-quote pair) and will be received verbatim by the target application if the input does not
contain `%` (else this function will fail with an ArgumentError). The presence of `%` in the input string may result
in command injection vulnerabilities and may invalidate any claim of suitability of the output of this function for
use as an argument to cmd (due to the ordering described above), so use caution when assembling a string
from various sources.
This function may be useful in concert with the `windows_verbatim` flag to [`Cmd`](@ref) when constructing
process pipelines.
!warning
The argument parsing done by CMD when calling batch files (either inside `.bat` files or as arguments to
them) is not fully compatible with the output of this function. In particular, the processing of `%` is different.
!important
Due to a peculiar behavior of the CMD parser/interpreter, each command after a literal `|` character
(indicating a command pipeline) must have `shell_escape_wincmd` applied twice since it will be parsed twice by CMD. For example:
```
to_print = "All for 1 & 1 for all!"
run(Cmd(Cmd(["cmd /S /C \" break | echo $(shell_escape_wincmd(shell_escape_wincmd(to_print))) \""]), windows_verbatim=true))
```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I have not myself done the research needed to convince myself that this is all 100% true and accurate, may I suggest you make this extended documentation a separate follow-up PR under your name? This way, I don't end up with any git blame for security advice that I haven't personally written or verified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are good points to bring up and help understand what's going on, but I'm not sure that the !important note is relevant to the shell_escape_wincmd method? I think this would be good to add somewhere in the docs instead.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome to make it a separate commit or add Co-authored-by to this commit.

This is adding it to the docs. Where else, other than the documentation for shell_escape_wincmd, would you suggest putting clarifications on the correct usage of shell_escape_wincmd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgkuhn Do you feel cmofrtable with @vtjnash 's suggestion to add this as a commit with a co-authored?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we squash and merge or do a merge commit if that is the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually prefer "rebase and merge" on github if a PR is tidy, but don't know what's most customary here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a separate PR for the extended documentation. That would also give me time to run some tests on the claims first before I can provide meaningful feedback. (All my tests so far have been in the ssh scenario, i.e. preparing command-line invocations of programs such as julia.exe with arguments, environment variables, and pushd, as they would be typed into the console keyboard. I have not yet studied what happens or needs to happen with metacharacters when invoking cmd.exe with run, or what escaping mechanisms libuv already offers.)

write(io, SubString(s,i,j))
i = j
else
if c in ('"', '(', ')', '!', '^', '<', '>', '&', '|')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we missing \ in this list?

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\ isn't a significant character

Copy link
Contributor

@musm musm Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. I was looking https://ss64.com/nt/syntax-esc.html (which granted may not be 100% correct)
they say:

Escape Character
^ Escape character.
Adding the escape character before a command symbol allows it to be treated as ordinary text.
When piping or redirecting any of these characters you should prefix with the escape character: & \ < > ^ |

 e.g.  ^\  ^&  ^|  ^>  ^<  ^^ 

and then down the page

Many characters such as \ = ( ) do not need to be escaped when they are used within a "quoted string" typically these are characters you might find in a filename/path. The percent character is one exception to this rule, even though under NTFS % is a valid filename character.

Which kind of worried me since, we do not uniformly "quote strings", only if they contain spaces (see https://github.com/JuliaLang/julia/pull/38352/files#diff-92d17ed4c3ab24cca65d43dd11a040f4f8e96378384fa143e2d418c5b07656d3R341)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, \ is an extremely common Windows pathname character that we definitely do not want to escape unnecessarily, and cmd.exe just passes it through to the application.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Information online about cmd is hilariously, and almost notoriously bad, imo. That page contradict their own list in their example that follows immediately afterwards, in which they show that \ usage doesn't need escaping 😂. But even cmd /c help cmd is wrong here (missing some obvious ones, and including some nonsense ones), so it's perhaps not their fault.

The special characters that require quotes are:
     <space>
     &()[]{}^=;!'+,`~

@musm musm merged commit 72e67d7 into JuliaLang:master Nov 19, 2020
@musm musm added the system:windows Affects only Windows label Nov 19, 2020
vtjnash added a commit that referenced this pull request Nov 20, 2020
Note that most resources online are wrong, and even `cmd /c help cmd` prints the wrong list,
so it is important to be clear here about the actual guarantees this function can afford.

Refs #38352
vtjnash added a commit that referenced this pull request Nov 20, 2020
Note that most resources online are wrong, and even `cmd /c help cmd` prints the wrong list,
so it is important to be clear here about the actual guarantees this function can afford.

Refs #38352
vtjnash added a commit that referenced this pull request Dec 5, 2020
Note that most resources online are wrong, and even `cmd /c help cmd` prints the wrong list,
so it is important to be clear here about the actual guarantees this function can afford.

Refs #38352
vtjnash added a commit that referenced this pull request Dec 5, 2020
Note that most resources online are wrong, and even `cmd /c help cmd` prints the wrong list,
so it is important to be clear here about the actual guarantees this function can afford.

Refs #38352
vtjnash added a commit that referenced this pull request Dec 6, 2020
Note that most resources online are wrong, and even `cmd /c help cmd` prints the wrong list,
so it is important to be clear here about the actual guarantees this function can afford.

Refs #38352
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:windows Affects only Windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants