-
Notifications
You must be signed in to change notification settings - Fork 62
Conversation
Previously rustfix was treating the highlight indices as byte offsets. This caused it to crash when a highlight index was pointing inside a multi-byte character. This fix makes sure to index into the characters of the string instead of the bytes.
@rust-lang-nursery/libs, can someone review this PR? |
I'm not sure this makes sense to me. Why are we reporting spans using codepoint offsets? I imagine we know the encoding, so we should be using byte offsets. |
@BurntSushi Is that concern enough to block this PR? I'm not terribly familiar with rustc's internals, but it looks like things like |
I'm not familiar with rustc internals either. I wasn't necessarily intending to block this PR. It just seems weird to me that codepoint offsets are being used. The only place where that makes sense, I think, is when you want to represent offsets into the same text across multiple encodings. Codepoint offsets will be invariant, but byte offsets wouldn't be. The downside is that you need either more time or more memory to convert between them. |
From the |
Why is it desirable to avoid providing byte offsets in lieu of char offsets?
…On Tue, Jun 4, 2019, 14:14 Esteban Kuber ***@***.***> wrote:
From the rustc point of view, we could provide either, but any change in
this area has the potential of being a breaking change. I guess that the
ideal situation would be to have both char offset and byte offset provided
in the json output, but that would be overly verbose IMO. The reason we
provide char offsets instead of byte offsets is because we make this
conversion internally for user consumption, to avoid leaking byte offsets
in the cli ui.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#167>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AADPPYV4O4GV5CCIPRVMFJTPY2WH3ANCNFSM4HDVR7XA>
.
|
Because the original consumer of these are the users, which count chars on screen, not bytes. We could provide bytes in the json output, but that means opening the ticket to do so in the rust repo. |
Ah I see. Basically, if the intent is for users to read them directly, then I agree that codepoint offsets are bettee than byte offsets. But if you want a machine to read them and use them to slice the text, then we should be using byte offsets. Having both seems fine to me. (And indees, one might even want grapheme cluster offsets instead of codepoint offsets, for human consumption.) Either way, I don't mean to raise this as a blocking issue. I'm just giving my opinion on what we should be doing. If we need to slice by codepoint to solve a pressing issue, then by all means, go for it. |
@killercup Do you think you could take a look at this? The PR itself seems fine to me, I've tested it with a few different odd characters (like combining characters), and it seems to be ok. Overall I think it would probably be safer to have byte offsets, but I think that would require significant changes in rustc, and I'm not sure what tangible situations it would actually matter. |
By “characters on screen”, do you mean columns in a monospace font terminal’s grid? They’re not necessarily the same. And the count of |
Previously rustfix was treating the highlight indices as byte offsets.
This caused it to crash when a highlight index was pointing inside a
multi-byte character.
This fix makes sure to index into the characters of the string instead
of the bytes.
Fixes rust-lang/rust#59660