Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Havvy · 2015-06-02T08:56:02Z

This program should warn that zero-width spaces are being used directly and should be replaced with their unicode escape code (\u{200B}).

See Unicode spaces for more information on them.

The text was updated successfully, but these errors were encountered:

Manishearth · 2015-06-02T08:59:59Z

In literals, or in general?

Catching these in general seems a bit hard to do for clippy.

Havvy · 2015-06-02T09:05:39Z

I just tested, and they're not allowed in identifier names, so that only leaves literals, right?

Manishearth · 2015-06-02T09:06:52Z

Oh, I thought they may count as regular spaces as far as separating tokens goes.

That sounds good.

llogiq · 2015-06-02T09:14:44Z

Could this be generalized to also check for non-canonical Unicode representations? Or higher ranged Unicode glyphs in general (perhaps Allow by default)?

Manishearth · 2015-06-02T09:17:15Z

I was thinking the same thing, really. Though UTF8 files mean that it's okay to do this anyway.

Havvy · 2015-06-02T09:19:44Z

I'd rather zero-width spaces be a specific lint, since it'd be useful to say "Hey, there's invisible characters here." An ascii_only lint might also be useful, but it shouldn't subsume this.

llogiq · 2015-06-02T09:31:41Z

@Havvy Full ack. Since the default behaviour should be different anyway, it makes sense to implement multiple lints for the different cases. This also enables users to set ascii_only to Warn, but zero_width_space to Deny (which it should be in my opinion, because frankly, that's just evil). Also unicode_canon could Warn or Allow by default – I don't have a strong opinion there.

So:

zero_width_space (default: Deny) for \u{200B}
unicode_canon (default: Warn – or Allow?) for combining unicode glyphs, e.g. a\u{300} instead of \u{e0} (the unicode::decompose_canonical(c: char, f: FnMut(char)) should be helpful with that)
ascii_only (default: Allow) for anything above \u{7F}.This would include the above lints but not generate warnings if the respective lint is also enabled.

If multiple lints would generate warnings for a piece of code, the lint with the higher level wins, then the more specific.

llogiq · 2015-06-11T09:58:39Z

I just thought about another possible unicode warning: Detect right-to-left strings (or at least mixed left-to-right/right-to-left strings). We should check that on the whole crate, including comments. Note that this trick has been used e.g. in shell scripts to hide malicious code, so the check probably even belongs in the compiler.

I'll see if I can whip up a test case.

…lt), rust-lang#85

birkenfeld · 2015-08-12T18:43:32Z

I added the non-ascii lint in the linked PR.
As for the last point, I don't know if clippy should have an opinion if NFC or NFD is the preferred normal form.

BTW, in the currently inactive test case the combining accent is at the wrong position, and applies to the opening quote :)

birkenfeld · 2015-08-12T18:45:40Z

What might be interesting is to recognize "problematic" characters that look basically the same as ASCII characters, e.g. some Greek and Cyrillic ones. (These are often used to impersonate legitimate file names and formerly DNS names.) But I don't have a list of those handy.

Manishearth · 2015-08-12T18:51:59Z

We could use the same criteria punycode uses.

Manishearth · 2015-08-12T18:52:07Z

But I'm not sure if we really need to do that.

llogiq · 2015-08-14T07:34:47Z

@birkenfeld yeah I got that one wrong, fixed it. Anyway, the idea of unicode_canon is to check for NFC normal form.

birkenfeld · 2015-08-14T07:42:02Z

Why is NFC to be preferred?

llogiq · 2015-08-14T08:01:21Z

Because NFD would standardize on a\u{300} instead of \u{e0}, and the compatibility normalizations (NFKD/NFKC) change the meaning (e.g. x² becomes x2).

Unicode lints, second attempt: Lint whole strings, help with replacement This fixes #85

Havvy changed the title ~~Warn when using zero-width spaces directly.~~ Various unicode warnings [zero-width-space, unicode_canon, ascii_only] Jun 2, 2015

llogiq mentioned this issue Jun 11, 2015

first unicode lint: zero_width_space #94

Merged

Manishearth added good-first-issue These issues are a good way to get started with Clippy T-AST Type: Requires working with the AST A-lint Area: New lints labels Aug 11, 2015

birkenfeld added a commit to birkenfeld/rust-clippy that referenced this issue Aug 12, 2015

unicode: add lint against non-ascii chars in literals (Allow by defau…

3044d3d

…lt), rust-lang#85

birkenfeld mentioned this issue Aug 12, 2015

unicode: add lint against non-ascii chars in literals (Allow by default) #146

Merged

This was referenced Aug 28, 2015

new lint: unicode_not_nfc #249

Closed

Unicode lints, second attempt: Lint whole strings, help with replacement #299

Merged

llogiq closed this as completed in #299 Sep 4, 2015

llogiq added a commit that referenced this issue Sep 4, 2015

Merge pull request #299 from Manishearth/unicode_str

0c50d76

Unicode lints, second attempt: Lint whole strings, help with replacement This fixes #85

alexjago mentioned this issue Nov 6, 2022

non_ascii_literal should not be machine-applicable to raw strings #9805

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Havvy commented Jun 2, 2015

Manishearth commented Jun 2, 2015

Havvy commented Jun 2, 2015

Manishearth commented Jun 2, 2015

llogiq commented Jun 2, 2015

Manishearth commented Jun 2, 2015

Havvy commented Jun 2, 2015

llogiq commented Jun 2, 2015

llogiq commented Jun 11, 2015

birkenfeld commented Aug 12, 2015

birkenfeld commented Aug 12, 2015

Manishearth commented Aug 12, 2015

Manishearth commented Aug 12, 2015

llogiq commented Aug 14, 2015

birkenfeld commented Aug 14, 2015

llogiq commented Aug 14, 2015

Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Comments

Havvy commented Jun 2, 2015

Manishearth commented Jun 2, 2015

Havvy commented Jun 2, 2015

Manishearth commented Jun 2, 2015

llogiq commented Jun 2, 2015

Manishearth commented Jun 2, 2015

Havvy commented Jun 2, 2015

llogiq commented Jun 2, 2015

llogiq commented Jun 11, 2015

birkenfeld commented Aug 12, 2015

birkenfeld commented Aug 12, 2015

Manishearth commented Aug 12, 2015

Manishearth commented Aug 12, 2015

llogiq commented Aug 14, 2015

birkenfeld commented Aug 14, 2015

llogiq commented Aug 14, 2015