implement raw strings (r#"foo"#) #9674

ben0x539 · 2013-10-02T05:49:50Z

This branch parses raw string literals as in #9411.

ben0x539 · 2013-10-02T05:52:58Z

I'm submitting this PR because I'd like some review and comments on whether I'm going in the right direction here.

What bothers me is described in the commit message for 06eee38 (edit: removed that commit): I want the pretty-printer to reproduce the right style of string literal, but I hesitate to thread that extra info through all the mangling and data structures between the parser itself and what ends up in the AST. Would that be the right way to go?

pnkfelix · 2013-10-02T07:16:17Z

@ben0x539

My initial thought is that maybe the pretty printer should be in charge of deciding for itself, based on the content of the string, what kind of string literal is best.

E.g. if the author didn't "need" to use a raw string, then it seems to me like the pretty-printer could emit a normal string literal. And likewise, if the author painfully put in a string with many escapes, the pretty-printer might scan it for any embedded doublequotes, and if so, then the longest occurrence of "##... (which will hopefully be none at all), and use that to determine what kind of raw string to produce.

(This would raise some policy questions like "how many escapes is too many" (as in, what is the threshold where I should switch to using a raw string). Off the top of my head, I would say more than two escapes implies print raw-string. Of course this would hopefully be easy to change after the fact.)

Also, @kballard and @Kimundi will probably want to weigh in. :)

pnkfelix · 2013-10-02T08:12:52Z

addendum to my prior suggestion: My suggested policy was a bit naive; e.g. I am not sure I would want "\n\n\n" or "\t\t\t" to be pretty-printed as a raw-string. And unicode escapes may also be exceptions of note (though I don't know if the pretty-printer actually prints those via escape sequences currently anyway). So maybe it would be better to count up the number of escapes for things like \\ and \", and if that is more than two, then use a raw-string. Anyway, I trust you to work out these sorts of details.

ben0x539 · 2013-10-02T08:27:18Z

I suppose pretty-printing outputting the precise input sequence is already not viable since it's not recorded anywhere whether something like a newline a string resulted from a literal newline or an escape sequence (also \x, \u, \U, other special characters). Inspecting spans would also fall short in the ast bits where there's no lit_str, just another kind of node with strs in it (though maybe they should be changed to carry a lit_ around and only extract the string later on?)

There's so many ways to encode the same string, I'm not sure how far a heuristic would go towards finding the "ideal" string literal every time. :(

pnkfelix · 2013-10-02T10:53:11Z

@ben0x539 IMO the pretty-printer already breaks down pretty badly in a number of cases (e.g. the way that comments get moved around), so I do not think you should invest too much effort in trying to come up with the perfect answer here.

My main piece of advice: get the pretty printer to the point where it can emit either variant of string literal, and then factor how the choice is made into a cleanly separated policy. Then the team can refine the policy later as we gain more experience.

ben0x539 · 2013-10-02T20:49:35Z

I have a change that would make the pretty printer emit whichever is the shorter kind of string literal. I'm not sure about that because that doesn't affect the --pretty normal mode which doesn't re-render literals but just gets them out of another run of the tokenizer via syntax::parse::comments from what I can tell.

With that change (added above), only the lexer and the parser are aware of the two kinds of string literals and everybody else just gets ast::lit_str nodes, which the pretty-printer selectively renders as either kind of literal in the --pretty expanded etc modes.

alexcrichton · 2013-10-02T23:49:46Z

src/test/compile-fail/raw-str-unterminated.rs

+// option. This file may not be copied, modified, or distributed
+// except according to those terms.
+
+//error-pattern:unterminated raw string


Is it not possible to use the //~ ERROR syntax here to make sure it's put in the right spot?

My understanding is that it's not possible because the //~ ERROR syntax must follow the syntax error, while the string necessarily can't be followed by a comment or it wouldn't be unterminated. Maybe there is a non-obvious way to get it right, but I haven't found it. :)

does using the //~^^... ERROR variant, where you use as many ^ as needed to count backwards to the offending line, help here?

I imagine that any comment after the opening " will be included in the string if it isn't terminated (like here).

(Unless the compile-test runner just checks for //~^...^ ERROR as a string match, rather than anything lexical?)

Oh. That actually works. I take everything back.

alexcrichton · 2013-10-03T01:25:46Z

This looks good to me, and I would be willing to r+ this. I'd like to get at least one other opinion though. This is a pretty major change and I want to make sure we've covered all our bases in making sure it's well integrated.

ben0x539 · 2013-10-03T02:23:03Z

history-edited the unterminated string literal test to use //~^^ ERROR. thanks, @pnkfelix / @huonw!

ben0x539 · 2013-10-03T07:45:43Z

rebased to replace loop with continue

huonw · 2013-10-03T08:42:56Z

doc/rust.md

    escaped in order to denote *itself*.

+Raw string literals do not process any escapes. They start with the character


Is it worth putting an example or two here? e.g. r"foo\bar" r####"string "### still in the string"#### or something.

ben0x539 · 2013-10-03T10:06:56Z

@huonw I added a few examples in rust.md and changed the error message to "only # is allowed in raw string delimitation; found illegal character: <whatever>"

huonw · 2013-10-03T10:15:28Z

doc/rust.md

+
+"foo #\"# bar", r##"foo #"# bar"## // foo #"# bar
+
+"\x2603", "☃", r"☃"                // ☃


Should be \u2603.

ben0x539 · 2013-10-03T11:12:29Z

@huonw fixed, thanks

huonw · 2013-10-03T11:55:03Z

doc/tutorial.md

@@ -353,6 +353,10 @@ whose literals are written between single quotes, as in `'x'`.
 Just like C, Rust understands a number of character escapes, using the backslash
 character, such as `\n`, `\r`, and `\t`. String literals,
 written between double quotes, allow the same escape sequences.
+
+Raw string literals on the other hand do not process any escape sequences.
+They are written as `r##"blah"##`, with any (matching) number of `#` on either


"optionally with a matching number of # on each side", or something else to make it clear that zero #s is allowed?

or perhaps: "with any non-negative number of # characters on each side" ?

I prefer the phrasing "zero or more" to "non-negative". The idea of a negative number of characters is just confusing.

brson · 2013-10-04T13:55:28Z

I haven't read the details here but @pnkfelix I think I disagree that the pretty-printer should decide whether to use raw strings or not. Rather I think it would be more consistent if we store exactly what was written, as we do with constants, to write out during pretty-printing.

ben0x539 · 2013-10-04T14:48:12Z

I agree it's more consistent.

Any chance we could change the extern "Rust" fn() syntax to something like extern(Rust) fn() though? ;)

ben0x539 · 2013-10-05T08:14:42Z

As per @brson above, try to store the kind of string literal in the ast. Seems to work in expression context, extern r"Rust" ABI decls, extern mod blah = r"blah" position, with extra fields in a bunch of ast nodes storing an enum StrStyle { CookedStr, RawStr(uint) } that are mostly ignored or blindly reproduced outside of the pretty printer. I hope I didn't miss string literals impliclty carried in the ast elsewhere.

The asm!() syntax extension half-works; the pretty printer will reproduce the asm code as a raw string if it was given as such. I wasn't sure whether adding another StrStyle field for all the other positions for string literals in the asm!() parameters so I'm only allowing non-raw strings there for now. Tempted to suggest the syntax extension could parse things like =r(0) instead of "r"(0). :}

I apologize for holding up this feature for so long. This probably still isn't ready to merge, I don't feel very good about how the pretty-printing of raw string literals impacts the AST and its consumers and I'm still trying to think of a better way to handle it.

Raw string literals are lexed into regular string literals. This is okay for them to "work" and be usable/testable, but the pretty-printer does not know about them yet and will just emit regular string literals.

Treat it as a synonym for LIT_STR for now.

For the benefit of the pretty printer we want to keep track of how string literals in the ast were originally represented in the source code. This commit changes parser functions so they don't extract strings from the token stream without at least also returning what style of string literal it was. This is stored in the resulting ast node for string literals, obviously, for the package id in `extern mod = r"package id"` view items, for the inline asm in `asm!()` invocations. For `asm!()`'s other arguments or for `extern "Rust" fn()` items, I just the style of string, because it seemed disproportionally cumbersome to thread that information through the string processing that happens with those string literals, given the limited advantage raw string literals would provide in these positions. The other syntax extensions don't seem to store passed string literals in the ast, so they also discard the style of strings they parse.

ben0x539 · 2013-10-08T02:32:42Z

@alexcrichton thinks we can get away without pretty-printing ABI names in extern "C" etc as the right kind of string, so I'm removing that (fairly invasive) commit and just letting the pretty printer print those strings, as well as the latter asm! parameters, as regular non-raw strings. Some examples for the behavior as of this push: http://ix.io/8nr becomes http://ix.io/8ns with --pretty normal. With --pretty expanded, we get http://ix.io/8nt. That the "Rust" in the extern "Rust" fn disappears is the same behavior as on master.

This branch parses raw string literals as in #9411.

Fix `needless_borrow` false positive The PR fixes the false positive exposed by `@BusyJay's` example in: rust-lang/rust-clippy#9111 (comment) The current approach is described in rust-lang/rust-clippy#9674 (comment) and rust-lang/rust-clippy#9674 (comment). The original approach appears below. --- The proposed fix is to flag only "simple" trait implementations involving references, a concept that I introduce next. Intuitively, a trait implementation is "simple" if all it does is dereference and apply the trait implementation of a type named by a type parameter. `AsRef` provides a good example of a simple implementation: https://doc.rust-lang.org/std/convert/trait.AsRef.html#impl-AsRef%3CU%3E-for-%26T We can make this idea more precise as follows. Given a trait implementation, first determine whether the implementation is "used defined." If so, then examine its nested obligations. Consider the implementation simple if-and-only-if: - there is at least one nested obligation for the same trait - for each type `X` in the nested obligation's substitution, either `X` is the same as that of the original obligation's substitution, or the original type is `&X` For example, the following implementation from `@BusyJay's` example is "complex" (i.e., not simple) because it produces no nested obligations: ```rust impl<'a> Extend<&'a u8> for A { ... } ``` On the other hand, the following slightly modified implementation is simple, because it produces a nested obligation for `Extend<X>`: ```rust impl<'a, X> Extend<&'a X> for A where A: Extend<X> { ... } ``` How does flagging only simple implementations help? One way of interpreting the false positive in `@BusyJay's` example is that it separates a reference from a concrete type. Doing so turns a successful type inference into a failing one. By flagging only simple implementations, we separate references from type variables only, thereby eliminating this class of false positives. Note that `Deref` is a special case, as the obligations generated for it already involve the underlying type. r? `@Jarcho` (Sorry to keep pinging you with `needless_borrow` stuff. But my impression is no one knows this code better than you.) changelog: fix `needless_borrow` false positive

alexcrichton reviewed Oct 2, 2013
View reviewed changes

huonw reviewed Oct 3, 2013
View reviewed changes

doc/rust.md

"foo #\"# bar", r##"foo #"# bar"## // foo #"# bar

"\x2603", "☃", r"☃" // ☃

Copy link

Member

huonw Oct 3, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be \u2603.

huonw reviewed Oct 3, 2013
View reviewed changes

ben0x539 added 7 commits October 8, 2013 01:44

pp: print spaces around '=' in 'extern mod blah = "blah"'

e007f94

lex raw string literals, like r#"blah"#

904c6c4

Raw string literals are lexed into regular string literals. This is okay for them to "work" and be usable/testable, but the pretty-printer does not know about them yet and will just emit regular string literals.

document raw string literals in tutorial.md and rust.md

6885c73

add vim syntax highlighting support for raw string literals

18099fe

add token::LIT_STR_RAW(ident, num of # symbols)

9787872

Treat it as a synonym for LIT_STR for now.

pp: add test for raw strs in non-expression positions

d7dfe0a

bors added a commit that referenced this pull request Oct 8, 2013

auto merge of #9674 : ben0x539/rust/raw-str, r=alexcrichton

be2b33e

This branch parses raw string literals as in #9411.

bors added a commit that referenced this pull request Oct 8, 2013

auto merge of #9674 : ben0x539/rust/raw-str, r=alexcrichton

c919629

This branch parses raw string literals as in #9411.

bors closed this Oct 8, 2013

bors merged commit d7dfe0a into rust-lang:master Oct 8, 2013

alexcrichton mentioned this pull request Oct 8, 2013

RFC: Syntax for raw string literals #9411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement raw strings (r#"foo"#) #9674

implement raw strings (r#"foo"#) #9674

ben0x539 commented Oct 2, 2013

ben0x539 commented Oct 2, 2013

pnkfelix commented Oct 2, 2013

pnkfelix commented Oct 2, 2013

ben0x539 commented Oct 2, 2013

pnkfelix commented Oct 2, 2013

ben0x539 commented Oct 2, 2013

alexcrichton Oct 2, 2013

ben0x539 Oct 2, 2013

pnkfelix Oct 3, 2013

huonw Oct 3, 2013

huonw Oct 3, 2013

ben0x539 Oct 3, 2013

alexcrichton commented Oct 3, 2013

ben0x539 commented Oct 3, 2013

ben0x539 commented Oct 3, 2013

huonw Oct 3, 2013

ben0x539 commented Oct 3, 2013

huonw Oct 3, 2013

ben0x539 commented Oct 3, 2013

huonw Oct 3, 2013

pnkfelix Oct 3, 2013

lilyball Oct 3, 2013

brson commented Oct 4, 2013

ben0x539 commented Oct 4, 2013

ben0x539 commented Oct 5, 2013

ben0x539 commented Oct 8, 2013

		escaped in order to denote itself.

		Raw string literals do not process any escapes. They start with the character


		"foo #\"# bar", r##"foo #"# bar"## // foo #"# bar

		"\x2603", "☃", r"☃" // ☃

implement raw strings (r#"foo"#) #9674

implement raw strings (r#"foo"#) #9674

Conversation

ben0x539 commented Oct 2, 2013

ben0x539 commented Oct 2, 2013

pnkfelix commented Oct 2, 2013

pnkfelix commented Oct 2, 2013

ben0x539 commented Oct 2, 2013

pnkfelix commented Oct 2, 2013

ben0x539 commented Oct 2, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Oct 3, 2013

ben0x539 commented Oct 3, 2013

ben0x539 commented Oct 3, 2013

Choose a reason for hiding this comment

ben0x539 commented Oct 3, 2013

Choose a reason for hiding this comment

ben0x539 commented Oct 3, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brson commented Oct 4, 2013

ben0x539 commented Oct 4, 2013

ben0x539 commented Oct 5, 2013

ben0x539 commented Oct 8, 2013