Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate \r\n -> \n when reading files #62865

Closed
matklad opened this issue Jul 22, 2019 · 2 comments · Fixed by #62948
Closed

Translate \r\n -> \n when reading files #62865

matklad opened this issue Jul 22, 2019 · 2 comments · Fixed by #62948
Labels
A-frontend Area: frontend (errors, parsing and HIR) C-cleanup Category: PRs that clean code up or issues documenting cleanup. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@matklad
Copy link
Member

matklad commented Jul 22, 2019

Currently, we handle \r\n explicitly in the lexer. We should do this at the file read time instead.

Motivation:

  1. Line endings should not affect semantics of the language. For example, git on windows by default checkouts with \r\n line endings, and it would be bad if compiling code on windows led to observably different results. This could be handled on a case-by-case basis in the lexer (the current approach), but it's easy to miss something (for example Raw Byte strings do not handle \r #60604). Additionally, proc macros currently see original tokens, and so can observe different line endings. The simplest way to make compiler lineending-invariant is to normalize line endings as soon as possible.

  2. For IDE features, which work close to the source code, and especially for code generation during refactorings, the surface are where you need to handle different line endings is much larger. It would be easier for IDE to assume \n line endings and convert at one place, at the boundary.

i propose that we convert \r\n once, in SourceFile::new constructor. Note that we already do BOM-removal there, so, in theory, all code should already be prepared to mismatch between in-memory and on-disk file content. The replacement shortens the string, so it can be a pretty fast in-place transformation.

One technical question is what to do with isolated \r? I see two options:

  • leave them as is. This makes the transform non-idempotent though: \r\r\n -> \r\n, and requires other code to explicitely not treat \r\n as line ending.
  • error at file load time if there's a lone \r in the source text. That is, after normalization, \r can not be found in the source code at all.

I propose to go with the second option it's slightly more annoying to implement, but seems more robust.

@matklad matklad changed the title Translate \r\n -> \n when reading file Translate \r\n -> \n when reading files Jul 22, 2019
@matklad
Copy link
Member Author

matklad commented Jul 22, 2019

cc @petrochenkov

@jonas-schievink jonas-schievink added A-frontend Area: frontend (errors, parsing and HIR) C-cleanup Category: PRs that clean code up or issues documenting cleanup. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 22, 2019
@matklad
Copy link
Member Author

matklad commented Jul 24, 2019

Found a bug: here we load doc comment via a file, without translate_crlf step.

@bors bors closed this as completed in ef1ecbe Aug 18, 2019
sbrugman added a commit to sbrugman/ruff that referenced this issue Jan 21, 2023
sbrugman added a commit to sbrugman/ruff that referenced this issue Jan 23, 2023
sbrugman added a commit to sbrugman/ruff that referenced this issue Jan 24, 2023
github-merge-queue bot pushed a commit to leanprover/lean4 that referenced this issue May 19, 2024
To eliminate parsing differences between Windows and other platforms,
the frontend now normalizes all CRLF line endings to LF, like [in
Rust](rust-lang/rust#62865).

Effects:
- This makes Lake hashes be faithful to what Lean sees (Lake already
normalizes line endings before computing hashes).
- Docstrings now have normalized line endings. In particular, this fixes
`#guard_msgs` failing multiline tests for Windows users using CRLF.
- Now strings don't have different lengths depending on the platform.
Before this PR, the following theorem is true for LF and false for CRLF
files.
```lean
example : "
".length = 1 := rfl
```

Note: the normalization will take `\r\r\n` and turn it into `\r\n`. In
the elaborator, we reject loose `\r`'s that appear in whitespace. Rust
instead takes the approach of making the normalization routine fail.
They do this so that there's no downstream confusion about any `\r\n`
that appears.

Implementation note: the LSP maintains its own copy of a source file
that it updates when edit operations are applied. We are assuming that
edit operations never split or join CRLFs. If this assumption is not
correct, then the LSP copy of a source file can become slightly out of
sync. If this is an issue, there is some discussion
[here](#3903 (comment)).
github-merge-queue bot pushed a commit to leanprover/lean4 that referenced this issue May 20, 2024
To eliminate parsing differences between Windows and other platforms,
the frontend now normalizes all CRLF line endings to LF, like [in
Rust](rust-lang/rust#62865).

Effects:
- This makes Lake hashes be faithful to what Lean sees (Lake already
normalizes line endings before computing hashes).
- Docstrings now have normalized line endings. In particular, this fixes
`#guard_msgs` failing multiline tests for Windows users using CRLF.
- Now strings don't have different lengths depending on the platform.
Before this PR, the following theorem is true for LF and false for CRLF
files.
```lean
example : "
".length = 1 := rfl
```

Note: the normalization will take `\r\r\n` and turn it into `\r\n`. In
the elaborator, we reject loose `\r`'s that appear in whitespace. Rust
instead takes the approach of making the normalization routine fail.
They do this so that there's no downstream confusion about any `\r\n`
that appears.

Implementation note: the LSP maintains its own copy of a source file
that it updates when edit operations are applied. We are assuming that
edit operations never split or join CRLFs. If this assumption is not
correct, then the LSP copy of a source file can become slightly out of
sync. If this is an issue, there is some discussion
[here](#3903 (comment)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-frontend Area: frontend (errors, parsing and HIR) C-cleanup Category: PRs that clean code up or issues documenting cleanup. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants