Fix string indexing #167

phansch · 2019-04-04T18:04:34Z

Previously rustfix was treating the highlight indices as byte offsets.
This caused it to crash when a highlight index was pointing inside a
multi-byte character.

This fix makes sure to index into the characters of the string instead
of the bytes.

Fixes rust-lang/rust#59660

Previously rustfix was treating the highlight indices as byte offsets. This caused it to crash when a highlight index was pointing inside a multi-byte character. This fix makes sure to index into the characters of the string instead of the bytes.

estebank · 2019-04-10T20:18:09Z

@rust-lang-nursery/libs, can someone review this PR?

BurntSushi · 2019-04-10T21:12:12Z

I'm not sure this makes sense to me. Why are we reporting spans using codepoint offsets? I imagine we know the encoding, so we should be using byte offsets.

ehuss · 2019-06-01T17:43:22Z

@BurntSushi Is that concern enough to block this PR? I'm not terribly familiar with rustc's internals, but it looks like things like Loc and LineInfo are very character-offset oriented. IIUC, all these structures would need to add a byte offset field as well, and it's not clear to me how easy that would be.

BurntSushi · 2019-06-02T13:42:57Z

I'm not familiar with rustc internals either. I wasn't necessarily intending to block this PR. It just seems weird to me that codepoint offsets are being used. The only place where that makes sense, I think, is when you want to represent offsets into the same text across multiple encodings. Codepoint offsets will be invariant, but byte offsets wouldn't be. The downside is that you need either more time or more memory to convert between them.

estebank · 2019-06-04T18:14:19Z

From the rustc point of view, we could provide either, but any change in this area has the potential of being a breaking change. I guess that the ideal situation would be to have both char offset and byte offset provided in the json output, but that would be overly verbose IMO. The reason we provide char offsets instead of byte offsets is because we make this conversion internally for user consumption, to avoid leaking byte offsets in the cli ui.

BurntSushi · 2019-06-04T19:17:15Z

Why is it desirable to avoid providing byte offsets in lieu of char offsets?

…

On Tue, Jun 4, 2019, 14:14 Esteban Kuber ***@***.***> wrote: From the rustc point of view, we could provide either, but any change in this area has the potential of being a breaking change. I guess that the ideal situation would be to have both char offset and byte offset provided in the json output, but that would be overly verbose IMO. The reason we provide char offsets instead of byte offsets is because we make this conversion internally for user consumption, to avoid leaking byte offsets in the cli ui. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#167>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADPPYV4O4GV5CCIPRVMFJTPY2WH3ANCNFSM4HDVR7XA> .

estebank · 2019-06-04T19:20:47Z

Because the original consumer of these are the users, which count chars on screen, not bytes.

We could provide bytes in the json output, but that means opening the ticket to do so in the rust repo.

BurntSushi · 2019-06-04T19:24:42Z

Ah I see. Basically, if the intent is for users to read them directly, then I agree that codepoint offsets are bettee than byte offsets. But if you want a machine to read them and use them to slice the text, then we should be using byte offsets. Having both seems fine to me. (And indees, one might even want grapheme cluster offsets instead of codepoint offsets, for human consumption.)

Either way, I don't mean to raise this as a blocking issue. I'm just giving my opinion on what we should be doing. If we need to slice by codepoint to solve a pressing issue, then by all means, go for it.

ehuss · 2019-07-16T19:48:34Z

@killercup Do you think you could take a look at this? The PR itself seems fine to me, I've tested it with a few different odd characters (like combining characters), and it seems to be ok. Overall I think it would probably be safer to have byte offsets, but I think that would require significant changes in rustc, and I'm not sure what tangible situations it would actually matter.

SimonSapin · 2019-08-07T15:10:36Z

Because the original consumer of these are the users, which count chars on screen, not bytes.

By “characters on screen”, do you mean columns in a monospace font terminal’s grid? They’re not necessarily the same. And the count of char code points does not necessarily equal either of them, though it’s a less bad approximation than UTF-8 byte count.

Fix string indexing

0cc670d

Previously rustfix was treating the highlight indices as byte offsets. This caused it to crash when a highlight index was pointing inside a multi-byte character. This fix makes sure to index into the characters of the string instead of the bytes.

ehuss mentioned this pull request Aug 6, 2019

cargo fix panics thread 'main' panicked at 'byte index 22 is not a char boundary; rust-lang/cargo#7220

Closed

alexcrichton merged commit 836108f into rust-lang:master Aug 7, 2019

phansch deleted the fix_string_indexing branch August 7, 2019 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix string indexing #167

Fix string indexing #167

phansch commented Apr 4, 2019

estebank commented Apr 10, 2019

BurntSushi commented Apr 10, 2019

ehuss commented Jun 1, 2019

BurntSushi commented Jun 2, 2019

estebank commented Jun 4, 2019

BurntSushi commented Jun 4, 2019 via email

estebank commented Jun 4, 2019

BurntSushi commented Jun 4, 2019

ehuss commented Jul 16, 2019

SimonSapin commented Aug 7, 2019

Fix string indexing #167

Fix string indexing #167

Conversation

phansch commented Apr 4, 2019

estebank commented Apr 10, 2019

BurntSushi commented Apr 10, 2019

ehuss commented Jun 1, 2019

BurntSushi commented Jun 2, 2019

estebank commented Jun 4, 2019

BurntSushi commented Jun 4, 2019 via email

estebank commented Jun 4, 2019

BurntSushi commented Jun 4, 2019

ehuss commented Jul 16, 2019

SimonSapin commented Aug 7, 2019