Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Closed
Havvy opened this issue Jun 2, 2015 · 15 comments · Fixed by #299
Closed

Various unicode warnings [zero-width-space, unicode_canon, ascii_only] #85

Havvy opened this issue Jun 2, 2015 · 15 comments · Fixed by #299
Labels
A-lint Area: New lints good-first-issue These issues are a good way to get started with Clippy T-AST Type: Requires working with the AST

Comments

@Havvy
Copy link

Havvy commented Jun 2, 2015

This program should warn that zero-width spaces are being used directly and should be replaced with their unicode escape code (\u{200B}).

See Unicode spaces for more information on them.

@Manishearth
Copy link
Member

In literals, or in general?

Catching these in general seems a bit hard to do for clippy.

@Havvy
Copy link
Author

Havvy commented Jun 2, 2015

I just tested, and they're not allowed in identifier names, so that only leaves literals, right?

@Manishearth
Copy link
Member

Oh, I thought they may count as regular spaces as far as separating tokens goes.

That sounds good.

@llogiq
Copy link
Contributor

llogiq commented Jun 2, 2015

Could this be generalized to also check for non-canonical Unicode representations? Or higher ranged Unicode glyphs in general (perhaps Allow by default)?

@Manishearth
Copy link
Member

I was thinking the same thing, really. Though UTF8 files mean that it's okay to do this anyway.

@Havvy
Copy link
Author

Havvy commented Jun 2, 2015

I'd rather zero-width spaces be a specific lint, since it'd be useful to say "Hey, there's invisible characters here." An ascii_only lint might also be useful, but it shouldn't subsume this.

@llogiq
Copy link
Contributor

llogiq commented Jun 2, 2015

@Havvy Full ack. Since the default behaviour should be different anyway, it makes sense to implement multiple lints for the different cases. This also enables users to set ascii_only to Warn, but zero_width_space to Deny (which it should be in my opinion, because frankly, that's just evil). Also unicode_canon could Warn or Allow by default – I don't have a strong opinion there.

So:

  • zero_width_space (default: Deny) for \u{200B}
  • unicode_canon (default: Warn – or Allow?) for combining unicode glyphs, e.g. a\u{300} instead of \u{e0} (the unicode::decompose_canonical(c: char, f: FnMut(char)) should be helpful with that)
  • ascii_only (default: Allow) for anything above \u{7F}.This would include the above lints but not generate warnings if the respective lint is also enabled.

If multiple lints would generate warnings for a piece of code, the lint with the higher level wins, then the more specific.

@Havvy Havvy changed the title Warn when using zero-width spaces directly. Various unicode warnings [zero-width-space, unicode_canon, ascii_only] Jun 2, 2015
@llogiq
Copy link
Contributor

llogiq commented Jun 11, 2015

I just thought about another possible unicode warning: Detect right-to-left strings (or at least mixed left-to-right/right-to-left strings). We should check that on the whole crate, including comments. Note that this trick has been used e.g. in shell scripts to hide malicious code, so the check probably even belongs in the compiler.

I'll see if I can whip up a test case.

@Manishearth Manishearth added good-first-issue These issues are a good way to get started with Clippy T-AST Type: Requires working with the AST A-lint Area: New lints labels Aug 11, 2015
birkenfeld added a commit to birkenfeld/rust-clippy that referenced this issue Aug 12, 2015
@birkenfeld
Copy link
Contributor

I added the non-ascii lint in the linked PR.
As for the last point, I don't know if clippy should have an opinion if NFC or NFD is the preferred normal form.

BTW, in the currently inactive test case the combining accent is at the wrong position, and applies to the opening quote :)

@birkenfeld
Copy link
Contributor

What might be interesting is to recognize "problematic" characters that look basically the same as ASCII characters, e.g. some Greek and Cyrillic ones. (These are often used to impersonate legitimate file names and formerly DNS names.) But I don't have a list of those handy.

@Manishearth
Copy link
Member

We could use the same criteria punycode uses.

@Manishearth
Copy link
Member

But I'm not sure if we really need to do that.

@llogiq
Copy link
Contributor

llogiq commented Aug 14, 2015

@birkenfeld yeah I got that one wrong, fixed it. Anyway, the idea of unicode_canon is to check for NFC normal form.

@birkenfeld
Copy link
Contributor

Why is NFC to be preferred?

@llogiq
Copy link
Contributor

llogiq commented Aug 14, 2015

Because NFD would standardize on a\u{300} instead of \u{e0}, and the compatibility normalizations (NFKD/NFKC) change the meaning (e.g. becomes x2).

llogiq added a commit that referenced this issue Sep 4, 2015
Unicode lints, second attempt: Lint whole strings, help with replacement

This fixes #85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lint Area: New lints good-first-issue These issues are a good way to get started with Clippy T-AST Type: Requires working with the AST
Projects
None yet
4 participants