-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement raw strings (r#"foo"#) #9674
Conversation
I'm submitting this PR because I'd like some review and comments on whether I'm going in the right direction here. What bothers me is described in the commit message for 06eee38 (edit: removed that commit): I want the pretty-printer to reproduce the right style of string literal, but I hesitate to thread that extra info through all the mangling and data structures between the parser itself and what ends up in the AST. Would that be the right way to go? |
My initial thought is that maybe the pretty printer should be in charge of deciding for itself, based on the content of the string, what kind of string literal is best. E.g. if the author didn't "need" to use a raw string, then it seems to me like the pretty-printer could emit a normal string literal. And likewise, if the author painfully put in a string with many escapes, the pretty-printer might scan it for any embedded doublequotes, and if so, then the longest occurrence of (This would raise some policy questions like "how many escapes is too many" (as in, what is the threshold where I should switch to using a raw string). Off the top of my head, I would say more than two escapes implies print raw-string. Of course this would hopefully be easy to change after the fact.) Also, @kballard and @Kimundi will probably want to weigh in. :) |
addendum to my prior suggestion: My suggested policy was a bit naive; e.g. I am not sure I would want |
I suppose pretty-printing outputting the precise input sequence is already not viable since it's not recorded anywhere whether something like a newline a string resulted from a literal newline or an escape sequence (also There's so many ways to encode the same string, I'm not sure how far a heuristic would go towards finding the "ideal" string literal every time. :( |
@ben0x539 IMO the pretty-printer already breaks down pretty badly in a number of cases (e.g. the way that comments get moved around), so I do not think you should invest too much effort in trying to come up with the perfect answer here. My main piece of advice: get the pretty printer to the point where it can emit either variant of string literal, and then factor how the choice is made into a cleanly separated policy. Then the team can refine the policy later as we gain more experience. |
I have a change that would make the pretty printer emit whichever is the shorter kind of string literal. I'm not sure about that because that doesn't affect the With that change (added above), only the lexer and the parser are aware of the two kinds of string literals and everybody else just gets |
// option. This file may not be copied, modified, or distributed | ||
// except according to those terms. | ||
|
||
//error-pattern:unterminated raw string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to use the //~ ERROR
syntax here to make sure it's put in the right spot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it's not possible because the //~ ERROR
syntax must follow the syntax error, while the string necessarily can't be followed by a comment or it wouldn't be unterminated. Maybe there is a non-obvious way to get it right, but I haven't found it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does using the //~^^... ERROR
variant, where you use as many ^
as needed to count backwards to the offending line, help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine that any comment after the opening "
will be included in the string if it isn't terminated (like here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Unless the compile-test runner just checks for //~^...^ ERROR
as a string match, rather than anything lexical?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. That actually works. I take everything back.
This looks good to me, and I would be willing to r+ this. I'd like to get at least one other opinion though. This is a pretty major change and I want to make sure we've covered all our bases in making sure it's well integrated. |
rebased to replace |
escaped in order to denote *itself*. | ||
|
||
Raw string literals do not process any escapes. They start with the character |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth putting an example or two here? e.g. r"foo\bar"
r####"string "### still in the string"####
or something.
@huonw I added a few examples in rust.md and changed the error message to |
|
||
"foo #\"# bar", r##"foo #"# bar"## // foo #"# bar | ||
|
||
"\x2603", "☃", r"☃" // ☃ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be \u2603
.
@huonw fixed, thanks |
@@ -353,6 +353,10 @@ whose literals are written between single quotes, as in `'x'`. | |||
Just like C, Rust understands a number of character escapes, using the backslash | |||
character, such as `\n`, `\r`, and `\t`. String literals, | |||
written between double quotes, allow the same escape sequences. | |||
|
|||
Raw string literals on the other hand do not process any escape sequences. | |||
They are written as `r##"blah"##`, with any (matching) number of `#` on either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"optionally with a matching number of #
on each side", or something else to make it clear that zero #
s is allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or perhaps: "with any non-negative number of #
characters on each side" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the phrasing "zero or more" to "non-negative". The idea of a negative number of characters is just confusing.
I haven't read the details here but @pnkfelix I think I disagree that the pretty-printer should decide whether to use raw strings or not. Rather I think it would be more consistent if we store exactly what was written, as we do with constants, to write out during pretty-printing. |
I agree it's more consistent. Any chance we could change the |
As per @brson above, try to store the kind of string literal in the ast. Seems to work in expression context, The I apologize for holding up this feature for so long. This probably still isn't ready to merge, I don't feel very good about how the pretty-printing of raw string literals impacts the AST and its consumers and I'm still trying to think of a better way to handle it. |
Raw string literals are lexed into regular string literals. This is okay for them to "work" and be usable/testable, but the pretty-printer does not know about them yet and will just emit regular string literals.
Treat it as a synonym for LIT_STR for now.
For the benefit of the pretty printer we want to keep track of how string literals in the ast were originally represented in the source code. This commit changes parser functions so they don't extract strings from the token stream without at least also returning what style of string literal it was. This is stored in the resulting ast node for string literals, obviously, for the package id in `extern mod = r"package id"` view items, for the inline asm in `asm!()` invocations. For `asm!()`'s other arguments or for `extern "Rust" fn()` items, I just the style of string, because it seemed disproportionally cumbersome to thread that information through the string processing that happens with those string literals, given the limited advantage raw string literals would provide in these positions. The other syntax extensions don't seem to store passed string literals in the ast, so they also discard the style of strings they parse.
@alexcrichton thinks we can get away without pretty-printing ABI names in |
This branch parses raw string literals as in #9411.
This branch parses raw string literals as in #9411.
Fix `needless_borrow` false positive The PR fixes the false positive exposed by `@BusyJay's` example in: rust-lang/rust-clippy#9111 (comment) The current approach is described in rust-lang/rust-clippy#9674 (comment) and rust-lang/rust-clippy#9674 (comment). The original approach appears below. --- The proposed fix is to flag only "simple" trait implementations involving references, a concept that I introduce next. Intuitively, a trait implementation is "simple" if all it does is dereference and apply the trait implementation of a type named by a type parameter. `AsRef` provides a good example of a simple implementation: https://doc.rust-lang.org/std/convert/trait.AsRef.html#impl-AsRef%3CU%3E-for-%26T We can make this idea more precise as follows. Given a trait implementation, first determine whether the implementation is "used defined." If so, then examine its nested obligations. Consider the implementation simple if-and-only-if: - there is at least one nested obligation for the same trait - for each type `X` in the nested obligation's substitution, either `X` is the same as that of the original obligation's substitution, or the original type is `&X` For example, the following implementation from `@BusyJay's` example is "complex" (i.e., not simple) because it produces no nested obligations: ```rust impl<'a> Extend<&'a u8> for A { ... } ``` On the other hand, the following slightly modified implementation is simple, because it produces a nested obligation for `Extend<X>`: ```rust impl<'a, X> Extend<&'a X> for A where A: Extend<X> { ... } ``` How does flagging only simple implementations help? One way of interpreting the false positive in `@BusyJay's` example is that it separates a reference from a concrete type. Doing so turns a successful type inference into a failing one. By flagging only simple implementations, we separate references from type variables only, thereby eliminating this class of false positives. Note that `Deref` is a special case, as the obligations generated for it already involve the underlying type. r? `@Jarcho` (Sorry to keep pinging you with `needless_borrow` stuff. But my impression is no one knows this code better than you.) changelog: fix `needless_borrow` false positive
This branch parses raw string literals as in #9411.