Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

Fix string indexing #167

Merged
merged 1 commit into from
Aug 7, 2019
Merged

Conversation

phansch
Copy link
Member

@phansch phansch commented Apr 4, 2019

Previously rustfix was treating the highlight indices as byte offsets.
This caused it to crash when a highlight index was pointing inside a
multi-byte character.

This fix makes sure to index into the characters of the string instead
of the bytes.

Fixes rust-lang/rust#59660

Previously rustfix was treating the highlight indices as byte offsets.
This caused it to crash when a highlight index was pointing inside a
multi-byte character.

This fix makes sure to index into the characters of the string instead
of the bytes.
@estebank
Copy link

@rust-lang-nursery/libs, can someone review this PR?

@BurntSushi
Copy link
Member

I'm not sure this makes sense to me. Why are we reporting spans using codepoint offsets? I imagine we know the encoding, so we should be using byte offsets.

@ehuss
Copy link
Collaborator

ehuss commented Jun 1, 2019

@BurntSushi Is that concern enough to block this PR? I'm not terribly familiar with rustc's internals, but it looks like things like Loc and LineInfo are very character-offset oriented. IIUC, all these structures would need to add a byte offset field as well, and it's not clear to me how easy that would be.

@BurntSushi
Copy link
Member

I'm not familiar with rustc internals either. I wasn't necessarily intending to block this PR. It just seems weird to me that codepoint offsets are being used. The only place where that makes sense, I think, is when you want to represent offsets into the same text across multiple encodings. Codepoint offsets will be invariant, but byte offsets wouldn't be. The downside is that you need either more time or more memory to convert between them.

@estebank
Copy link

estebank commented Jun 4, 2019

From the rustc point of view, we could provide either, but any change in this area has the potential of being a breaking change. I guess that the ideal situation would be to have both char offset and byte offset provided in the json output, but that would be overly verbose IMO. The reason we provide char offsets instead of byte offsets is because we make this conversion internally for user consumption, to avoid leaking byte offsets in the cli ui.

@BurntSushi
Copy link
Member

BurntSushi commented Jun 4, 2019 via email

@estebank
Copy link

estebank commented Jun 4, 2019

Because the original consumer of these are the users, which count chars on screen, not bytes.

We could provide bytes in the json output, but that means opening the ticket to do so in the rust repo.

@BurntSushi
Copy link
Member

Ah I see. Basically, if the intent is for users to read them directly, then I agree that codepoint offsets are bettee than byte offsets. But if you want a machine to read them and use them to slice the text, then we should be using byte offsets. Having both seems fine to me. (And indees, one might even want grapheme cluster offsets instead of codepoint offsets, for human consumption.)

Either way, I don't mean to raise this as a blocking issue. I'm just giving my opinion on what we should be doing. If we need to slice by codepoint to solve a pressing issue, then by all means, go for it.

@ehuss
Copy link
Collaborator

ehuss commented Jul 16, 2019

@killercup Do you think you could take a look at this? The PR itself seems fine to me, I've tested it with a few different odd characters (like combining characters), and it seems to be ok. Overall I think it would probably be safer to have byte offsets, but I think that would require significant changes in rustc, and I'm not sure what tangible situations it would actually matter.

@SimonSapin
Copy link

Because the original consumer of these are the users, which count chars on screen, not bytes.

By “characters on screen”, do you mean columns in a monospace font terminal’s grid? They’re not necessarily the same. And the count of char code points does not necessarily equal either of them, though it’s a less bad approximation than UTF-8 byte count.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DiagnosticSpanLine pointing inside char boundary
6 participants