Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing validators #223

Closed
4 of 10 tasks
aochagavia opened this issue Nov 9, 2018 · 41 comments
Closed
4 of 10 tasks

Missing validators #223

aochagavia opened this issue Nov 9, 2018 · 41 comments
Labels
E-easy E-has-instructions Issue has some instructions and pointers to code to get started good first issue

Comments

@aochagavia
Copy link
Contributor

aochagavia commented Nov 9, 2018

Literals:

  • String
  • Char
  • Byte
  • ByteString
  • RawString
  • RawByteString
  • Integers
  • Floats

Comments:

  • /* comments have a closing */
  • Doc comments are attached to nodes
@matklad
Copy link
Member

matklad commented Nov 21, 2018

Path validators: I am not sure if we check this, but stuff like super::super::foo or use foo::{crate::bar}; should be forbidden.

@kiljacken kiljacken added E-easy E-has-instructions Issue has some instructions and pointers to code to get started good first issue labels Jan 10, 2020
@kiljacken
Copy link
Contributor

These are still up for grabs if anybody wants to get their hands dirty and contribute something small but useful!

The relevant code is currently in crates/ra_syntax/src/validation.rs. The existing validators should serve as decent examples, but please don't hesitate to ask how to go about it, if you're interested in giving it a go!

bors bot added a commit that referenced this issue Jan 14, 2020
2834: refactor(ra_syntax.validation): removed code duplication from validate_literal() r=kiljacken a=Veetaha

Hi! This is my first ever contribution to this project.
I've taken some dirty job from issue #223

This is a simple atomic PR to remove code duplication according to FIXME comment in the function that is the main focus of the further development.

I just didn't want to mix refactoring with the implementation of new features...

I am not sure whether you prefer such atomic PRs here or you'd rather have a single PR that contains all atomic commits inside of it?

So if you want me to add all that validation in one PR I'll mark this one as WIP and update it when the work is finished, otherwise, I'll go with the option of creating separate PRs per each feature of validation of strings, numbers, and comments respectively.

### Comments about refactoring
Yeah, reducing the duplication is quite hard here, extracting into stateless functions could be another option but the number of their arguments would be very big and repeated across char and string implementations so that just writing their types and names would become cumbersome.
I tried the option of having everything captured implicitly in the closure but failed since rust doesn't have templated (or generic) closures as C++ does, this is needed because `unescape_byte*()` and `unescape_char|str()` have different return types...
Maybe I am missing something here? I may be wrong because I am not enough experienced in Rust...
Well, I am awaiting any kind of feedback!

Co-authored-by: Veetaha <gerzoh1@gmail.com>
@chuck-flowers
Copy link

I'll take a crack at this over the next couple of days.

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

@chuck-flowers, did you already start? I am also working on this, though not very fast ...

@chuck-flowers
Copy link

I haven't done much yet

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

@chuck-flowers I am not contributing very fast, but I suppose this issue is not the most important one since it was opened more than a year ago and I already learned a decent amount of codebase around ra_syntax where the validation occurs.
Well, maybe, might you tackle some other issue to avoid conflicts with my changes? I also have some unfinished work to do on the previous refactoring...

@chuck-flowers
Copy link

No problem. I can find something else to work on.

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

Okay, thanks )

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

@matklad
@kiljacken
I see we are dropping some useful information from rutc_lexer when tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token is terminated or not, whether block comment has closing */ or not and some other helpful information.

See its definition here (I wish I could embed code from foreign repos in github):
the link

pub enum LiteralKind {
    /// "12_u8", "0o100", "0b120i99"
    Int { base: Base, empty_int: bool },
    /// "12.34f32", "0b100.100"
    Float { base: Base, empty_exponent: bool },
    /// "'a'", "'\\'", "'''", "';"
    Char { terminated: bool },
    /// "b'a'", "b'\\'", "b'''", "b';"
    Byte { terminated: bool },
    /// ""abc"", ""abc"
    Str { terminated: bool },
    /// "b"abc"", "b"abc"
    ByteStr { terminated: bool },
    /// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
    RawStr { n_hashes: usize, started: bool, terminated: bool },
    /// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
    RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}

But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/3a7724e44181ccd5c248589538bd82458b5a9407/crates/ra_syntax/src/parsing/lexer.rs#L17-L28

Well, saving this info in ra_parser::SyntaxKind is not possible since its representation must be fixed to u16. I guess we would need to recompute that anyway, wouldn't we?

If we do need to recompute I think we could reuse rustc_lexer::first_token() that would return the same token that was previously dropped, or do we want to lessen the coupling and implement that logic in our crate?

@matklad
Copy link
Member

matklad commented Jan 19, 2020 via email

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

@matklad so do you think we can rework our tokenize(text: &str) -> Vec<Token>
to something like tokenize(text: &str) -> impl Iterator<Item = (Token, Option<SyntaxErrror>)> or something like that? I wonder how would we go about validation apart from tokenizing (though in fact why would the user need this at all if in order to get tokens he needs to do tokenizing that already returns diagnostics.

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

Okay, let me rereview the whole ra_syntax crate and think of the most rational decision ...

@matklad
Copy link
Member

matklad commented Jan 19, 2020

I'd imagine tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>)

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

@matklad
Okay, just want to clarify if you do want to go with vectors and not iterators.
Why I used impl Iterator in the first place was to let the caller decide whether he wants all tokens at once.
We already have such usecases:

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_ide/src/references/rename.rs#L20-L25

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_syntax/src/parsing/reparsing.rs#L56-L60

https://github.com/rust-analyzer/rust-analyzer/blob/648241ee930de08ba70b0b5c2172dfb3cc7a34c6/crates/ra_syntax/src/parsing/reparsing.rs#L67-L70

rustc_lexer also exposes iterator API

But another solution would be to add an explicit fn first_token(text: &str) -> Option<Token> function and rustc_lexer does that as well

[UPD]: I tried doing this with impl Iterator and the code doesn't get convoluted

@bjorn3
Copy link
Member

bjorn3 commented Jan 19, 2020

A better name would be single_token as they require exactly one token.

@Veetaha
Copy link
Contributor

Veetaha commented Jan 19, 2020

@bjorn3 well, to me, first already means that it will return exactly one token.

@bjorn3
Copy link
Member

bjorn3 commented Jan 19, 2020

In the mentioned examples there must not be two or more tokens, while first_token implies that more tokens may follow.

@Veetaha
Copy link
Contributor

Veetaha commented Jan 20, 2020

Well, to conclude, the previous ideas, I am not ready to choose which of the proposed design approaches would be the best now.

From my perspective, ra_syntax and ra_parser crates need to be thoroughly reviewed by me.

Don't want to jump straight to conclusions, I am used to be rational, so I'll get there once I feel enough context.
This will be fairly not complicated since the codebase is pretty neat and I already have more than a high-level view on these two crates.
I may ask more questions here if you don't mind (just for the history in case of any unforeseen circumstances).

@matklad
Copy link
Member

matklad commented Jan 20, 2020

@Veetaha I think both iterators and vectors might work. I think we don't necessary need iterators (as we either dump everything to vec anyway, or tokenize a really short string which should get tokenized into a single token). In this cases, in Rust, I usually just default to Vecs as in Rust they are easier (don't carry lifetime with them, and don't require opaque types).

It might be a good idea to move thouse two "odd" uses of tokenize inside the lexer module, such that we don't actually depend on the specifics of the interface.

@Veetaha
Copy link
Contributor

Veetaha commented Jan 27, 2020

Some of the required validations were implemented in PR #2911. Take a note that they are done during the tokenisation-time and were named TokenizeErrors.

Other errors like "Doc comments are attached to nodes" or "Symbol path must not contain crate keyword in the middle" are soon to be implemented in new PR. I'd like to take the latter errors apart with some dedicated enum like SemmanticError as the pattern dictates in SyntaxErrorKind definition (though one would argue that SemmanticError is not a subtype of SyntaxError and I'd propose her to pick the better name then...). I'll try to come up with a better name if you guys wish.

Here is the updated todo list:

  • String is terminated
  • Char is terminated
  • Byte is terminated
  • ByteString is terminated
  • RawString is properly started and terminated
  • RawByteString is properly started and terminated
  • Integers have digits after the base specifier
  • Floats have digits after the exponent specifier
  • /* comments have a closing */
  • Lifetime names don't start with a number
  • Create tests for errors returned during tokenization
  • Report all unescape errors
  • Doc comments are attached to nodes
  • Symbol paths do not contain crate keyword in the middle
  • Come up with the name for post-parsing validation errors (unnecessary after future SyntaxError redesign)
  • Move validation functions that are used for debugging or tests into a separate file

[UPD according to PR review]

  • Use tuples (Token, Option<SyntaxError>) and (Vec<Token>, Vec<SyntaxError>) instead of ParsedToken[s]
  • Remove Location and change SyntaxError from being an enum to { msg: String, loc: TextRange } shape.
  • Remove some suggestion comments (like regarding enum_dispatch)
  • Use #[rustfmt::skip] when better formatting is obvious
  • Use early-return instead of if {} else {}
  • Use match ... { ... => return (val, Some(err)), ... => val } instead of local ok(), ok_if(), err() functions

bors bot added a commit that referenced this issue Feb 3, 2020
2911: Implement collecting errors while tokenizing r=matklad a=Veetaha

Now we are collecting errors from `rustc_lexer` and returning them in `ParsedToken { token, error }` and `ParsedTokens { tokens, errors }` structures **([UPD]: this is now simplified, see updates bellow)**.

The main changes are introduced in `ra_syntax/parsing/lexer.rs`. It now exposes the following functions and types:

```rust
pub fn tokenize(text: &str) -> ParsedTokens;
pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens);
pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text
pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text

pub struct ParsedToken  { pub token: Token,       pub error: Option<SyntaxError> }
pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError>   }

pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }
```
In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for `rust-analyzer` and we clearly see the places of its usage it would be better to simplify it to vectors.

This is currently WIP, because I want to add tests for error messages generated by the lexer.
I'd like to listen to you thoughts how to define these tests in `ra_syntax/test-data` dir.

Related issues: #223 

**[UPD]**

After the PR review the API was simplified:
```rust
pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>);
// Both lex functions do not check for unescape errors
pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>;
pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>;

// This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)`
pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ }

// this is private, but may be made public if such demand would exist in future (least privilege principle)
fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>;
```

Co-authored-by: Veetaha <gerzoh1@gmail.com>
bors bot added a commit that referenced this issue Feb 18, 2020
3026: ra_syntax: reshape SyntaxError for the sake of removing redundancy r=matklad a=Veetaha

Followup of #2911, also puts some crosses to the todo list of #223.

**AHTUNG!** A big part of the diff of this PR are test data files changes.

Simplified `SyntaxError` that was `SyntaxError { kind: { /* big enum */  }, location: Location }` to `SyntaxError(String, TextRange)`. I am not sure whether the tuple struct here is best fit, I am inclined to add names to the fields, because I already provide getters `SyntaxError::message()`, `SyntaxError::range()`.
I also removed `Location` altogether ...

This is currently WIP, because the following is not done:
- [ ] ~~Add tests to `test_data` dir for unescape errors *// I don't know where to put these errors in particular, because they are out of the scope of the lexer and parser. However, I have an idea in mind that we move all validators we have right now to parsing stage, but this is up to discussion...*~~ **[UPD]** I came to a conclusion that tree validation logic, which unescape errors are a part of, should be rethought of, we currently have no tests and no place to put tests for tree validations. So I'd like to extract potential redesign (maybe move of tree validation to ra_parser) and adding tests for this into a separate task.

Co-authored-by: Veetaha <gerzoh1@gmail.com>
Co-authored-by: Veetaha <veetaha2@gmail.com>
@djrenren
Copy link
Contributor

I took a shot validating the use of crate in paths (#4178) but it turns out it gets pretty subtle in case of use items because crate is allowed at the semantic root of a path, not the syntactic root. Example:

use {crate as bar, {{crate as cat, foo}};

This is pretty obviously solvable with a traversal of the tree, but I'm not sure what performance implications I should be worried about, if any. Any guidance would be super helpful.

@Veetaha
Copy link
Contributor

Veetaha commented Apr 28, 2020

Well, my initial thought was about not having a separate validation step and collect errors like crate in invalid position during parsing itself where you have as much context available as you need and even more. I am still not sure we cannot follow this idea, but I guess @matklad has a strong opinion against this. Anyway, this does seem like more of a semantic error, even the rust reference doesn't prohibit crate keyword in arbitrary positions on grammar level.

From my naive perspective, what I see as the ultimate solution is desugaring such patterns like use {{{}}, {}} where the path segment starts with the curly brace right away by removing the curly braces completely and unwrapping this into multiple use statements. However, the desugaring part in rust-analyzer is not obvious at least to me nowadays. Our syntax trees save the source code form without any transformations (I may be mistaken, maybe @matklad can clarfiy) so this will require additional groundwork for desugaring plumbing...

@djrenren
Copy link
Contributor

Yeah I also thought about moving this down to the parser but I think it's important to move as many errors out the parser and into a separate pass/flow so that in the long-run we can give higher quality error messages.

Also I found the desugaring / lowering pass: https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs

and reading through it I feel like an idiot. There's a PathSegmentKind that will tell me whether the segment is crate keyword, so I'll rewrite in terms of that. For now though, I think there's still value in the conservative validation pass, so long as we track that it doesn't error on UseTrees

@matklad
Copy link
Member

matklad commented Apr 29, 2020 via email

@jacg
Copy link
Contributor

jacg commented Jun 9, 2020

As there seems to have been no activity here for over a month, would it make sense for me to try to make some progress on this? @Veetaha, are you planning to do more work here?

@Veetaha
Copy link
Contributor

Veetaha commented Jun 9, 2020

@jacg yup, feel free to go :)
(no activity planned by me so far)

@jacg
Copy link
Contributor

jacg commented Jun 10, 2020

Looking at validation.rs I'm really missing some tests to help explain what is going on, what cases have been taken into consideration etc.

Would there be any objection to rewriting the validate_* functions so that they return Vec<SyntaxError> (or maybe just a single Option<SyntaxError> (it's not entirely clear whether there might be an accumulation of multiple errors, or whether at most one is generated)) which would get combined in the main validate function, and then adding a bunch of unit tests for the validate_*s?

@matklad
Copy link
Member

matklad commented Jun 10, 2020

Would there be any objection to rewriting the validate_* functions so that they return Vec

The functions by design accept a sink argument (&mut Vec) so as not to do N allocations. Returning an Option<SyntaxError> should be ok though.

@Veetaha
Copy link
Contributor

Veetaha commented Jun 10, 2020

Actually Option<SyntaxError> is a special case of Iterator<Item = SyntaxError> (where it yields at most 1 element), but I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea

@jacg
Copy link
Contributor

jacg commented Jun 10, 2020

@matklad So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?

@Veetaha

I suppose using impl Iterator<Item = T> in validation will be hard, so &mut Vec doesn't seem like a bad idea

I don't follow. What's wrong with the validate_*s returning Option<SyntaxError>, which can easily

  • be compared to the expected result in unit tests
  • have its contents, if any, extracted and pushed to errors inside validate

?

@matklad
Copy link
Member

matklad commented Jun 10, 2020

So I can rely on each validate_* never being required to produce more than one SyntaxError per invocation?

I haven't looked at the code in a while, but I think producing more than one error is ok. Generally, sink argument is a pattern that usually works best for producing a bunch of errors/warnings.

@jacg
Copy link
Contributor

jacg commented Jun 10, 2020

So, in order to write unit tests for these, do you recommend just leaving the sink argument, and having the tests provide an empty one? It would make expressing the tests a bit more of a pain, but I guess that could actually be hidden in a little macro.

@matklad
Copy link
Member

matklad commented Jun 10, 2020

I think those are best tested via data-driven tests, so that the tests are completely oblivious about the actual interface:

https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_syntax/test_data/parser/inline/err/0010_bad_tuple_index_expr.rast#L51-L53

@Veetaha
Copy link
Contributor

Veetaha commented Jun 10, 2020

Yup, just adding a bunch of input .rs files somewhere to ra_syntax/test_data/parser/error I guess should do the thing. Once you run cargo test .rast file will be autogenerated and then you should check that it has errors at the end of the syntax tree that you expected

@jacg
Copy link
Contributor

jacg commented Jun 10, 2020

Oh, that's excellent, I'm so glad this exists!

@jacg
Copy link
Contributor

jacg commented Jun 11, 2020

IIUC, the purpose of the validator is to generate syntax errors in the context of the parser being tolerant to invalid input: The parser accepts incorrect syntax, and the validator, in a separate traversal of the tree, should generate errors in such cases.

More specifically, in the context of this issue, let's consider RawString, which is the next unchecked SyntaxKind in the TODO list in the top comment.

I would expect validate_literal occasionally to receive some tokens with SyntaxKind::RAW_STRING which do not correctly represent raw strings. The work to be done here, would be to recognize such situations, and generate an apropriate SyntaxError. However, I'm failing to find such a case. For example, if I type let x = r#"abc"; in my editor, I am immediately presented with

Syntax Error: Missing trailing `"` with `#` symbols to terminate the raw string literal

If I save this sample somewhere in test_data/parser/err, tests fail with the same message.

So it is unclear to me what needs to be done to complete raw string validation.

@lnicola
Copy link
Member

lnicola commented Jun 11, 2020

I think every raw string (and raw byte string) that parses correctly (has the right number of hashes and quotes) is valid.

An example of invalid string would be "\c" or "\xgg".

@jacg
Copy link
Contributor

jacg commented Jun 11, 2020

Does that mean that there is nothing left to be done for raw strings?

@lnicola
Copy link
Member

lnicola commented Jun 11, 2020

I might be wrong, but I think there's nothing to be done for raw strings and raw binary strings.

@Veetaha
Copy link
Contributor

Veetaha commented Jun 11, 2020

@jacg

Does that mean that there is nothing left to be done for raw strings?

I am 90% sure that yes. Raw strings errors are collected from rustc_lexer during the tokenization step: https://github.com/rust-analyzer/rust-analyzer/blob/bd61ad756cc0a7bfeaa5dae81ac5ab50a2e71697/crates/ra_syntax/src/parsing/lexer.rs#L218-L239.
I recommend looking at this comment with the remaining list of errors that are not yet reported, this is more up-to-date.

@matklad
Copy link
Member

matklad commented Jul 15, 2020

I think we impled the bulk of stuff, so closing this issue, and the rest should be handled in specific issues.

@matklad matklad closed this as completed Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-easy E-has-instructions Issue has some instructions and pointers to code to get started good first issue
Projects
None yet
Development

No branches or pull requests

9 participants