Missing validators #223

aochagavia · 2018-11-09T13:50:56Z

matklad · 2018-11-21T15:13:02Z

Path validators: I am not sure if we check this, but stuff like super::super::foo or use foo::{crate::bar}; should be forbidden.

kiljacken · 2020-01-10T21:21:54Z

These are still up for grabs if anybody wants to get their hands dirty and contribute something small but useful!

The relevant code is currently in crates/ra_syntax/src/validation.rs. The existing validators should serve as decent examples, but please don't hesitate to ask how to go about it, if you're interested in giving it a go!

2834: refactor(ra_syntax.validation): removed code duplication from validate_literal() r=kiljacken a=Veetaha Hi! This is my first ever contribution to this project. I've taken some dirty job from issue #223 This is a simple atomic PR to remove code duplication according to FIXME comment in the function that is the main focus of the further development. I just didn't want to mix refactoring with the implementation of new features... I am not sure whether you prefer such atomic PRs here or you'd rather have a single PR that contains all atomic commits inside of it? So if you want me to add all that validation in one PR I'll mark this one as WIP and update it when the work is finished, otherwise, I'll go with the option of creating separate PRs per each feature of validation of strings, numbers, and comments respectively. ### Comments about refactoring Yeah, reducing the duplication is quite hard here, extracting into stateless functions could be another option but the number of their arguments would be very big and repeated across char and string implementations so that just writing their types and names would become cumbersome. I tried the option of having everything captured implicitly in the closure but failed since rust doesn't have templated (or generic) closures as C++ does, this is needed because `unescape_byte*()` and `unescape_char|str()` have different return types... Maybe I am missing something here? I may be wrong because I am not enough experienced in Rust... Well, I am awaiting any kind of feedback! Co-authored-by: Veetaha <gerzoh1@gmail.com>

chuck-flowers · 2020-01-17T01:05:53Z

I'll take a crack at this over the next couple of days.

Veetaha · 2020-01-19T15:40:24Z

@chuck-flowers, did you already start? I am also working on this, though not very fast ...

chuck-flowers · 2020-01-19T15:46:14Z

I haven't done much yet

Veetaha · 2020-01-19T15:57:09Z

@chuck-flowers I am not contributing very fast, but I suppose this issue is not the most important one since it was opened more than a year ago and I already learned a decent amount of codebase around ra_syntax where the validation occurs.
Well, maybe, might you tackle some other issue to avoid conflicts with my changes? I also have some unfinished work to do on the previous refactoring...

chuck-flowers · 2020-01-19T15:59:55Z

No problem. I can find something else to work on.

Veetaha · 2020-01-19T16:00:52Z

Okay, thanks )

Veetaha · 2020-01-19T16:28:20Z

@matklad
@kiljacken
I see we are dropping some useful information from rutc_lexer when tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token is terminated or not, whether block comment has closing */ or not and some other helpful information.

See its definition here (I wish I could embed code from foreign repos in github):
the link

pub enum LiteralKind {
    /// "12_u8", "0o100", "0b120i99"
    Int { base: Base, empty_int: bool },
    /// "12.34f32", "0b100.100"
    Float { base: Base, empty_exponent: bool },
    /// "'a'", "'\\'", "'''", "';"
    Char { terminated: bool },
    /// "b'a'", "b'\\'", "b'''", "b';"
    Byte { terminated: bool },
    /// ""abc"", ""abc"
    Str { terminated: bool },
    /// "b"abc"", "b"abc"
    ByteStr { terminated: bool },
    /// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
    RawStr { n_hashes: usize, started: bool, terminated: bool },
    /// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
    RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}

But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/3a7724e44181ccd5c248589538bd82458b5a9407/crates/ra_syntax/src/parsing/lexer.rs#L17-L28

Well, saving this info in ra_parser::SyntaxKind is not possible since its representation must be fixed to u16. I guess we would need to recompute that anyway, wouldn't we?

If we do need to recompute I think we could reuse rustc_lexer::first_token() that would return the same token that was previously dropped, or do we want to lessen the coupling and implement that logic in our crate?

matklad · 2020-01-19T16:53:51Z

Ideally, the text tokenization function should produce a separate list of diagnostics, just like the SourceFile::parse does

…

On Sunday, 19 January 2020, Veetaha ***@***.***> wrote: @matklad <https://github.com/matklad> @kiljacken <https://github.com/kiljacken> I see we are dropping some useful information from rutc_lexer when tokenizing the input. rustc_lexer::TokenKind contains the information about whether the token is terminated or not, whether block comment has closing */ or not and some other helpful information. See its definition here (I wish I could embed code from foreign repos in github): the link <https://github.com/rust-lang/rust/blob/a916ac22b9f7f1f0f7aba0a41a789b3ecd765018/src/librustc_lexer/src/lib.rs#L121> pub enum LiteralKind { /// "12_u8", "0o100", "0b120i99" Int { base: Base, empty_int: bool }, /// "12.34f32", "0b100.100" Float { base: Base, empty_exponent: bool }, /// "'a'", "'\\'", "'''", "';" Char { terminated: bool }, /// "b'a'", "b'\\'", "b'''", "b';" Byte { terminated: bool }, /// ""abc"", ""abc" Str { terminated: bool }, /// "b"abc"", "b"abc" ByteStr { terminated: bool }, /// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a" RawStr { n_hashes: usize, started: bool, terminated: bool }, /// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a" RawByteStr { n_hashes: usize, started: bool, terminated: bool }, } But here we drop that knowledge: https://github.com/rust-analyzer/rust-analyzer/blob/ 3a7724e/crates/ra_syntax/ src/parsing/lexer.rs#L17-L28 Well, saving this info in ra_parser::SyntaxKind is not possible since its representation must be fixed to u16. I guess we would need to recompute that anyway, yes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#223>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANB3MYTDDVYMLUZWARB3FLQ6R5SLANCNFSM4GC2OOLA> .

Veetaha · 2020-01-19T21:25:35Z

@matklad so do you think we can rework our tokenize(text: &str) -> Vec<Token>
to something like tokenize(text: &str) -> impl Iterator<Item = (Token, Option<SyntaxErrror>)> or something like that? I wonder how would we go about validation apart from tokenizing (though in fact why would the user need this at all if in order to get tokens he needs to do tokenizing that already returns diagnostics.

Veetaha · 2020-01-19T21:32:42Z

Okay, let me rereview the whole ra_syntax crate and think of the most rational decision ...

matklad · 2020-01-19T21:40:31Z

I'd imagine tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>)

Veetaha · 2020-01-19T21:57:11Z

@matklad
Okay, just want to clarify if you do want to go with vectors and not iterators.
Why I used impl Iterator in the first place was to let the caller decide whether he wants all tokens at once.
We already have such usecases:

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_ide/src/references/rename.rs#L20-L25

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_syntax/src/parsing/reparsing.rs#L56-L60

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_syntax/src/parsing/reparsing.rs#L67-L70

rustc_lexer also exposes iterator API

But another solution would be to add an explicit fn first_token(text: &str) -> Option<Token> function and rustc_lexer does that as well

[UPD]: I tried doing this with impl Iterator and the code doesn't get convoluted

bjorn3 · 2020-01-19T21:59:11Z

A better name would be single_token as they require exactly one token.

Veetaha · 2020-01-19T22:31:03Z

@bjorn3 well, to me, first already means that it will return exactly one token.

bjorn3 · 2020-01-19T22:42:47Z

In the mentioned examples there must not be two or more tokens, while first_token implies that more tokens may follow.

Veetaha · 2020-01-20T00:37:33Z

Well, to conclude, the previous ideas, I am not ready to choose which of the proposed design approaches would be the best now.

From my perspective, ra_syntax and ra_parser crates need to be thoroughly reviewed by me.

Don't want to jump straight to conclusions, I am used to be rational, so I'll get there once I feel enough context.
This will be fairly not complicated since the codebase is pretty neat and I already have more than a high-level view on these two crates.
I may ask more questions here if you don't mind (just for the history in case of any unforeseen circumstances).

matklad · 2020-01-20T09:05:28Z

@Veetaha I think both iterators and vectors might work. I think we don't necessary need iterators (as we either dump everything to vec anyway, or tokenize a really short string which should get tokenized into a single token). In this cases, in Rust, I usually just default to Vecs as in Rust they are easier (don't carry lifetime with them, and don't require opaque types).

It might be a good idea to move thouse two "odd" uses of tokenize inside the lexer module, such that we don't actually depend on the specifics of the interface.

Veetaha · 2020-01-27T00:31:43Z

2911: Implement collecting errors while tokenizing r=matklad a=Veetaha Now we are collecting errors from `rustc_lexer` and returning them in `ParsedToken { token, error }` and `ParsedTokens { tokens, errors }` structures **([UPD]: this is now simplified, see updates bellow)**. The main changes are introduced in `ra_syntax/parsing/lexer.rs`. It now exposes the following functions and types: ```rust pub fn tokenize(text: &str) -> ParsedTokens; pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens); pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text pub struct ParsedToken { pub token: Token, pub error: Option<SyntaxError> } pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError> } pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ } ``` In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for `rust-analyzer` and we clearly see the places of its usage it would be better to simplify it to vectors. This is currently WIP, because I want to add tests for error messages generated by the lexer. I'd like to listen to you thoughts how to define these tests in `ra_syntax/test-data` dir. Related issues: #223 **[UPD]** After the PR review the API was simplified: ```rust pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>); // Both lex functions do not check for unescape errors pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>; pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>; // This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)` pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ } // this is private, but may be made public if such demand would exist in future (least privilege principle) fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>; ``` Co-authored-by: Veetaha <gerzoh1@gmail.com>

3026: ra_syntax: reshape SyntaxError for the sake of removing redundancy r=matklad a=Veetaha Followup of #2911, also puts some crosses to the todo list of #223. **AHTUNG!** A big part of the diff of this PR are test data files changes. Simplified `SyntaxError` that was `SyntaxError { kind: { /* big enum */ }, location: Location }` to `SyntaxError(String, TextRange)`. I am not sure whether the tuple struct here is best fit, I am inclined to add names to the fields, because I already provide getters `SyntaxError::message()`, `SyntaxError::range()`. I also removed `Location` altogether ... This is currently WIP, because the following is not done: - [ ] ~~Add tests to `test_data` dir for unescape errors *// I don't know where to put these errors in particular, because they are out of the scope of the lexer and parser. However, I have an idea in mind that we move all validators we have right now to parsing stage, but this is up to discussion...*~~ **[UPD]** I came to a conclusion that tree validation logic, which unescape errors are a part of, should be rethought of, we currently have no tests and no place to put tests for tree validations. So I'd like to extract potential redesign (maybe move of tree validation to ra_parser) and adding tests for this into a separate task. Co-authored-by: Veetaha <gerzoh1@gmail.com> Co-authored-by: Veetaha <veetaha2@gmail.com>

djrenren · 2020-04-28T15:26:35Z

I took a shot validating the use of crate in paths (#4178) but it turns out it gets pretty subtle in case of use items because crate is allowed at the semantic root of a path, not the syntactic root. Example:

use {crate as bar, {{crate as cat, foo}};

This is pretty obviously solvable with a traversal of the tree, but I'm not sure what performance implications I should be worried about, if any. Any guidance would be super helpful.

Veetaha · 2020-04-28T17:13:21Z

Well, my initial thought was about not having a separate validation step and collect errors like crate in invalid position during parsing itself where you have as much context available as you need and even more. I am still not sure we cannot follow this idea, but I guess @matklad has a strong opinion against this. Anyway, this does seem like more of a semantic error, even the rust reference doesn't prohibit crate keyword in arbitrary positions on grammar level.

From my naive perspective, what I see as the ultimate solution is desugaring such patterns like use {{{}}, {}} where the path segment starts with the curly brace right away by removing the curly braces completely and unwrapping this into multiple use statements. However, the desugaring part in rust-analyzer is not obvious at least to me nowadays. Our syntax trees save the source code form without any transformations (I may be mistaken, maybe @matklad can clarfiy) so this will require additional groundwork for desugaring plumbing...

djrenren · 2020-04-28T17:25:56Z

Yeah I also thought about moving this down to the parser but I think it's important to move as many errors out the parser and into a separate pass/flow so that in the long-run we can give higher quality error messages.

Also I found the desugaring / lowering pass: https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs

and reading through it I feel like an idiot. There's a PathSegmentKind that will tell me whether the segment is crate keyword, so I'll rewrite in terms of that. For now though, I think there's still value in the conservative validation pass, so long as we track that it doesn't error on UseTrees

matklad · 2020-04-29T06:42:41Z

Imo, this should just be a separate traversal of syntax tree. I wouldn't worry about perf.

…

On Tue, 28 Apr 2020 at 19:26, djrenren ***@***.***> wrote: Yeah I also thought about moving this down to the parser but I think it's important to move as many errors out the parser and into a separate pass/flow so that in the long-run we can give higher quality error messages. Also I found the desugaring / lowering pass: https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs and reading through it I feel like an idiot. There's a PathSegmentKind that will tell me whether the segment is crate keyword, so I'll rewrite in terms of that. For now though, I think there's still value in the conservative validation pass, so long as we track that it doesn't error on UseTrees — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#223 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANB3MZHWVTUQDRLPCYA7K3RO4GTHANCNFSM4GC2OOLA> .

jacg · 2020-06-09T14:11:56Z

As there seems to have been no activity here for over a month, would it make sense for me to try to make some progress on this? @Veetaha, are you planning to do more work here?

Veetaha · 2020-06-09T14:18:02Z

@jacg yup, feel free to go :)
(no activity planned by me so far)

jacg · 2020-06-10T13:53:08Z

Looking at validation.rs I'm really missing some tests to help explain what is going on, what cases have been taken into consideration etc.

Would there be any objection to rewriting the validate_* functions so that they return Vec<SyntaxError> (or maybe just a single Option<SyntaxError> (it's not entirely clear whether there might be an accumulation of multiple errors, or whether at most one is generated)) which would get combined in the main validate function, and then adding a bunch of unit tests for the validate_*s?

matklad · 2020-06-10T13:56:43Z

Would there be any objection to rewriting the validate_* functions so that they return Vec

The functions by design accept a sink argument (&mut Vec) so as not to do N allocations. Returning an Option<SyntaxError> should be ok though.

Veetaha · 2020-06-10T13:58:47Z

Actually Option<SyntaxError> is a special case of Iterator<Item = SyntaxError> (where it yields at most 1 element), but I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea

jacg · 2020-06-10T14:10:38Z

@matklad So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?

@Veetaha

I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea

I don't follow. What's wrong with the validate_*s returning Option<SyntaxError>, which can easily

be compared to the expected result in unit tests
have its contents, if any, extracted and pushed to errors inside validate

?

matklad · 2020-06-10T14:12:39Z

So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?

I haven't looked at the code in a while, but I think producing more than one error is ok. Generally, sink argument is a pattern that usually works best for producing a bunch of errors/warnings.

jacg · 2020-06-10T14:16:43Z

So, in order to write unit tests for these, do you recommend just leaving the sink argument, and having the tests provide an empty one? It would make expressing the tests a bit more of a pain, but I guess that could actually be hidden in a little macro.

matklad · 2020-06-10T14:19:29Z

I think those are best tested via data-driven tests, so that the tests are completely oblivious about the actual interface:

https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_syntax/test_data/parser/inline/err/0010_bad_tuple_index_expr.rast#L51-L53

Veetaha · 2020-06-10T14:24:59Z

Yup, just adding a bunch of input .rs files somewhere to ra_syntax/test_data/parser/error I guess should do the thing. Once you run cargo test .rast file will be autogenerated and then you should check that it has errors at the end of the syntax tree that you expected

jacg · 2020-06-10T14:30:05Z

Oh, that's excellent, I'm so glad this exists!

jacg · 2020-06-11T15:53:44Z

IIUC, the purpose of the validator is to generate syntax errors in the context of the parser being tolerant to invalid input: The parser accepts incorrect syntax, and the validator, in a separate traversal of the tree, should generate errors in such cases.

More specifically, in the context of this issue, let's consider RawString, which is the next unchecked SyntaxKind in the TODO list in the top comment.

I would expect validate_literal occasionally to receive some tokens with SyntaxKind::RAW_STRING which do not correctly represent raw strings. The work to be done here, would be to recognize such situations, and generate an apropriate SyntaxError. However, I'm failing to find such a case. For example, if I type let x = r#"abc"; in my editor, I am immediately presented with

Syntax Error: Missing trailing `"` with `#` symbols to terminate the raw string literal

If I save this sample somewhere in test_data/parser/err, tests fail with the same message.

So it is unclear to me what needs to be done to complete raw string validation.

lnicola · 2020-06-11T16:12:42Z

I think every raw string (and raw byte string) that parses correctly (has the right number of hashes and quotes) is valid.

An example of invalid string would be "\c" or "\xgg".

jacg · 2020-06-11T16:19:03Z

Does that mean that there is nothing left to be done for raw strings?

lnicola · 2020-06-11T16:23:55Z

I might be wrong, but I think there's nothing to be done for raw strings and raw binary strings.

Veetaha · 2020-06-11T16:34:49Z

@jacg

Does that mean that there is nothing left to be done for raw strings?

I am 90% sure that yes. Raw strings errors are collected from rustc_lexer during the tokenization step: https://github.com/rust-analyzer/rust-analyzer/blob/bd61ad756cc0a7bfeaa5dae81ac5ab50a2e71697/crates/ra_syntax/src/parsing/lexer.rs#L218-L239.
I recommend looking at this comment with the remaining list of errors that are not yet reported, this is more up-to-date.

matklad · 2020-07-15T18:10:06Z

I think we impled the bulk of stuff, so closing this issue, and the rest should be handled in specific issues.

aochagavia mentioned this issue Nov 9, 2018

Prototype validators API #27

Closed

matklad mentioned this issue Nov 15, 2018

String and char literals #6

Closed

3 tasks

kiljacken added E-easy E-has-instructions Issue has some instructions and pointers to code to get started good first issue labels Jan 10, 2020

Veetaha mentioned this issue Jan 14, 2020

refactor(ra_syntax.validation): removed code duplication from validate_literal() #2834

Merged

Veetaha mentioned this issue Jan 26, 2020

Implement collecting errors while tokenizing #2911

Merged

Veetaha mentioned this issue Feb 6, 2020

ra_syntax: reshape SyntaxError for the sake of removing redundancy #3026

Merged

1 task

PurpleMyst mentioned this issue May 25, 2020

[Feature Request] Improve constrast of checkboxes darkreader/darkreader#2610

Open

matklad closed this as completed Jul 15, 2020

Missing validators #223

Missing validators #223

Comments

aochagavia commented Nov 9, 2018 • edited Loading

matklad commented Nov 21, 2018

kiljacken commented Jan 10, 2020

chuck-flowers commented Jan 17, 2020

Veetaha commented Jan 19, 2020

chuck-flowers commented Jan 19, 2020

Veetaha commented Jan 19, 2020 • edited Loading

chuck-flowers commented Jan 19, 2020

Veetaha commented Jan 19, 2020

Veetaha commented Jan 19, 2020 • edited Loading

matklad commented Jan 19, 2020 via email

Veetaha commented Jan 19, 2020

Veetaha commented Jan 19, 2020

matklad commented Jan 19, 2020

Veetaha commented Jan 19, 2020 • edited Loading

bjorn3 commented Jan 19, 2020

Veetaha commented Jan 19, 2020

bjorn3 commented Jan 19, 2020

Veetaha commented Jan 20, 2020 • edited Loading

matklad commented Jan 20, 2020

Veetaha commented Jan 27, 2020 • edited Loading

djrenren commented Apr 28, 2020

Veetaha commented Apr 28, 2020 • edited Loading

djrenren commented Apr 28, 2020

matklad commented Apr 29, 2020 via email

jacg commented Jun 9, 2020

Veetaha commented Jun 9, 2020 • edited Loading

jacg commented Jun 10, 2020

matklad commented Jun 10, 2020

Veetaha commented Jun 10, 2020

jacg commented Jun 10, 2020

matklad commented Jun 10, 2020

jacg commented Jun 10, 2020

matklad commented Jun 10, 2020

Veetaha commented Jun 10, 2020 • edited Loading

jacg commented Jun 10, 2020

jacg commented Jun 11, 2020

lnicola commented Jun 11, 2020 • edited Loading

jacg commented Jun 11, 2020

lnicola commented Jun 11, 2020

Veetaha commented Jun 11, 2020 • edited Loading

matklad commented Jul 15, 2020

aochagavia commented Nov 9, 2018 •

edited

Loading

Veetaha commented Jan 19, 2020 •

edited

Loading

Veetaha commented Jan 19, 2020 •

edited

Loading

Veetaha commented Jan 19, 2020 •

edited

Loading

Veetaha commented Jan 20, 2020 •

edited

Loading

Veetaha commented Jan 27, 2020 •

edited

Loading

Veetaha commented Apr 28, 2020 •

edited

Loading

Veetaha commented Jun 9, 2020 •

edited

Loading

Veetaha commented Jun 10, 2020 •

edited

Loading

lnicola commented Jun 11, 2020 •

edited

Loading

Veetaha commented Jun 11, 2020 •

edited

Loading