-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert \r\n -> \n in include_str! macro #63681
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
Seems like a reasonable change to me and it removes exceptions (yay!).
Currently the crater queue is quite long, but do you want to crater it in any case? I think this is small enough a change that we don't need an FCP but it seems good to discuss it a bit at least so I've nominated for the T-Lang meeting on Thursday (21:00 Stockholm/Paris time). Feel free to join that meeting. |
So, the primary alternatives / extensions are:
The choice between these extensions and doing nothing probably depends on what crater finds. |
This change is very easily revertable and backportable if necessary, so we may get the necessary experimental results faster and better by landing it on nightly, rather than by running crater. |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
Ideally, the meaning of the program should be independent of the line endings used, because, for example, git can change line endings during a checkout. We currently do line-ending conversion in almost all cases, with `include_str` being an exception. This commit removes this exception, bringing `include_str` closer in behavior to string literals. Note that this is technically a breaking change. In case that you really mean to include a string with DOS line endings, you can use `include_bytes!` macro which is guaranteed to not do any translation, like this pub fn my_text() -> &'static str { unsafe { std::str::from_utf8_unchecked(MY_TEXT_BYTES); } } const MY_TEXT_BYTES: &[u8] = include_bytes("my_text.bin"); #[test] fn test_encoding() { std::str::from_utf8(MY_TEXT_BYTES) .unwrap(); }
dffb988
to
df73b26
Compare
Yay bad defaults strike again... even if there are ways to ensure the right behavior in I don't have much against this change, although I will say we should make |
@@ -7,6 +7,6 @@ fn main() { | |||
); | |||
assert_eq!( | |||
include_str!("data.bin"), | |||
"\u{FEFF}This file starts with BOM.\r\nLines are separated by \\r\\n.\r\n", | |||
"This file starts with BOM.\nLines are separated by \\r\\n.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Lines are separated by \\n."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it should be \\r\\n
, escaped \r is not changed in any way.
This is a breaking change. Unlike with |
For |
@petrochenkov I don't think so: line endings in multiline string literals are not observable, regardless of the way you get the literal. |
I mean the normalization is observable, the literals have |
cc @retep998 |
Regarding prior art on Windows, MSVC normalizes Rust always normalizes too (including raw strings), with |
Another alternative: we could add a new |
I personally am not opposed to changing |
Hey all, we discussed this in the @rust-lang/lang meeting. It seemed to us that it would be a good topic for an RFC, as the best decision is not entirely obvious. In particular it'd be good to get a survey of how similar features work in other languages. I think there was some general discomfort with changing the behavior here -- I guess it comes down to whether one considers the existing behavior a "bug" or a "feature". Going with another name (and probably deprecating the existing function) might obviate that concern. A stamp of approval from RFC commentors might also help. If you think this is unreasonable, feel free to complain. (Side question that occurs to me now: this seems like one of those "kind of lang kind of libs" questions, so I'd also like to hear what @rust-lang/libs members think about this question.) |
I would personally find it very surprising if As Koute said, Git's special handling of line endings is a big misfeature, and it seems strange to design around it. (And will probably be used less and less, with even Notepad supporting unix line endings.) In particular, I would be worried about any situation involving hashes and digital signatures. If data included with |
CRLFs aren't just a Windows thing, unfortunately. Most IETF network protocols (e.g. HTTP, SMTP, SIP, etc.) require CRLF line breaks. It's common to store protocol snippets in files for various testing purposes, and these files invariably contain the CRLFs. Some libraries treat these kind of protocols as binary, but others treat them as UTF-8 and may use @casey's point about breaking hashes and digital signatures is also a very serious concern. In Rust source files, newline normalization is indeed the right thing to do. But making |
If there are use cases for unnormalized UTF-8, then we need The result of |
I'm also opposed to this change. A lot of the discussion here seems to be focused around the notion that:
Both points are eminently reasonable, and indeed, I might even be swayed by them if we had a do-over and were able to redesign these APIs in a vacuum. But we're not. I'm opposed to this change not only because it's a breaking change, but it's a very subtle breaking change. It's the kind of breaking change that won't easily be detected, but could prove quite frustrating to folks who stumble over it. IMO, these kinds of cases---where one might reasonably call this an inconsistent wart---are precisely where we should be doubling down on our commitment to stability. On the one hand, we have a wart that we might be able to fix with fairly little breakage in practice. But on the other hand, we have a wart that doesn't seem to be causing any substantial frustration in practice, other than one perhaps deeming it inconsistent. From my perspective, that's not worth breaking our commitment to stability. I'd also like to address some other minor points:
If we were convinced that we wanted to solve this problem, then yes, I agree this would be a better alternative than changing the behavior of
Could you help me understand this a bit more? Are you saying that a breaking change was made to the semantics of raw string literals without going through an FCP? That seems like a process break down, and explicitly not something to cite as a precdent. |
Do you have any evidence which confirms this? Virtually every time I use
In case of C this is a huge misfeature, as then the behavior of your program becomes subtly different depending on which OS you're running on (e.g. on UNIX-like systems |
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. The RFC will be merged soon. |
How on earth is the disposition merge? There are a number of outstanding issues that myself and others have brought up, including breaking the back-compatibility promise of Rust. |
This is just an automated, time-based message based on #63681 (comment). It occurred as none of the concerns were registered with rfcbot. |
Looking over my own code, all my use of Also, it's not obvious to me that there is any advantage in having Given my current understanding, this change is all downside with very little upside. Is internal consistency the only motivating factor for this change? |
Also, just a note that
(Just noting this because I was a little confused by the term "normalization" being used to include |
All of those fit into "just source code" in my terminology, and want newlines converted, and in theory may be written inline rather than in separate files. |
Well, things like this are often at the reviewer's discretion and depend on estimated impact, many compiler changes affect observable behavior in specially crafted corner cases, if we FCP them all we won't move anywhere.
|
Yeah I get that. I'm looking for a clarification here, as I want to make sure I'm understanding right. It sounds like that change was made to the language, as it impacts the semantics of how raw strings are interpreted. Are those kinds of changes regularly made without some kind of consensus from the lang team? It seems like it rises well beyond a tweak in observable compiler behavior. |
I think it's reasonable for the Rust toolchain to make assumptions about Rust source code, but I don't think is's a good idea to make assumptions about non-Rust source code. Here is an example of non-Rust source code distinguishing between |
Thank you to everyone who commented during the Final Comment Period here. The FCP exists exactly for situations like this, to bring attention to changes and give time for everyone to bring up additional data and concerns that may not have been raised before. We discussed this in the lang team meeting today, and decided that in light of the situations raised the breakage is more than we're willing to take, so we will not go forward with the change. @rfcbot fcp cancel If anyone is interested in this still happening, or otherwise enabling the "does the same thing despite git checkout platform" scenario, one could open an RFC to do this over an edition boundary, add a new macro, do something in the library, or a variety of other possibilities. |
@scottmcm proposal cancelled. |
^^ For what it's worth, it would be relatively trivial to implement this proposal as a proc macro. |
As a follow-up question: Is there any linting that people would expect here? For example, there could be a lint for inconsistent newlines in (Commenting separately from my previous "official" one, since this is just me wondering.) |
I think a lint, either enable-by-default or in clippy would make sense. Even for most situations where |
Based on our previous comment, I'm going to go ahead and close this PR. Thanks @matklad for the suggestion. |
Edit: lang team response #63681 (comment) (after 2019-10-03 meeting)
Ideally, the meaning of the program should be independent of the line
endings used, because, for example, git can change line endings during
a checkout.
We currently do line-ending conversion in almost all cases, with
include_str
being an exception. This commit removes this exception,bringing
include_str
closer in behavior to string literals.Note that this is technically a breaking change.
In case that you really mean to include a string with DOS line
endings, you can use
include_bytes!
macro which is guaranteed to notdo any translation, like this
cc @petrochenkov, @rust-lang/lang
Some preliminary discussion in #63525 (comment)