-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing validators #223
Comments
Path validators: I am not sure if we check this, but stuff like |
These are still up for grabs if anybody wants to get their hands dirty and contribute something small but useful! The relevant code is currently in crates/ra_syntax/src/validation.rs. The existing validators should serve as decent examples, but please don't hesitate to ask how to go about it, if you're interested in giving it a go! |
2834: refactor(ra_syntax.validation): removed code duplication from validate_literal() r=kiljacken a=Veetaha Hi! This is my first ever contribution to this project. I've taken some dirty job from issue #223 This is a simple atomic PR to remove code duplication according to FIXME comment in the function that is the main focus of the further development. I just didn't want to mix refactoring with the implementation of new features... I am not sure whether you prefer such atomic PRs here or you'd rather have a single PR that contains all atomic commits inside of it? So if you want me to add all that validation in one PR I'll mark this one as WIP and update it when the work is finished, otherwise, I'll go with the option of creating separate PRs per each feature of validation of strings, numbers, and comments respectively. ### Comments about refactoring Yeah, reducing the duplication is quite hard here, extracting into stateless functions could be another option but the number of their arguments would be very big and repeated across char and string implementations so that just writing their types and names would become cumbersome. I tried the option of having everything captured implicitly in the closure but failed since rust doesn't have templated (or generic) closures as C++ does, this is needed because `unescape_byte*()` and `unescape_char|str()` have different return types... Maybe I am missing something here? I may be wrong because I am not enough experienced in Rust... Well, I am awaiting any kind of feedback! Co-authored-by: Veetaha <gerzoh1@gmail.com>
I'll take a crack at this over the next couple of days. |
@chuck-flowers, did you already start? I am also working on this, though not very fast ... |
I haven't done much yet |
@chuck-flowers I am not contributing very fast, but I suppose this issue is not the most important one since it was opened more than a year ago and I already learned a decent amount of codebase around ra_syntax where the validation occurs. |
No problem. I can find something else to work on. |
Okay, thanks ) |
@matklad See its definition here (I wish I could embed code from foreign repos in github): pub enum LiteralKind {
/// "12_u8", "0o100", "0b120i99"
Int { base: Base, empty_int: bool },
/// "12.34f32", "0b100.100"
Float { base: Base, empty_exponent: bool },
/// "'a'", "'\\'", "'''", "';"
Char { terminated: bool },
/// "b'a'", "b'\\'", "b'''", "b';"
Byte { terminated: bool },
/// ""abc"", ""abc"
Str { terminated: bool },
/// "b"abc"", "b"abc"
ByteStr { terminated: bool },
/// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
RawStr { n_hashes: usize, started: bool, terminated: bool },
/// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
RawByteStr { n_hashes: usize, started: bool, terminated: bool },
} But here we drop that knowledge: Well, saving this info in If we do need to recompute I think we could reuse |
Ideally, the text tokenization function should produce a separate list of
diagnostics, just like the SourceFile::parse does
…On Sunday, 19 January 2020, Veetaha ***@***.***> wrote:
@matklad <https://github.com/matklad>
@kiljacken <https://github.com/kiljacken>
I see we are dropping some useful information from rutc_lexer when
tokenizing the input.
rustc_lexer::TokenKind contains the information about whether the token
is terminated or not, whether block comment has closing */ or not and
some other helpful information.
See its definition here (I wish I could embed code from foreign repos in
github):
the link
<https://github.com/rust-lang/rust/blob/a916ac22b9f7f1f0f7aba0a41a789b3ecd765018/src/librustc_lexer/src/lib.rs#L121>
pub enum LiteralKind {
/// "12_u8", "0o100", "0b120i99"
Int { base: Base, empty_int: bool },
/// "12.34f32", "0b100.100"
Float { base: Base, empty_exponent: bool },
/// "'a'", "'\\'", "'''", "';"
Char { terminated: bool },
/// "b'a'", "b'\\'", "b'''", "b';"
Byte { terminated: bool },
/// ""abc"", ""abc"
Str { terminated: bool },
/// "b"abc"", "b"abc"
ByteStr { terminated: bool },
/// "r"abc"", "r#"abc"#", "r####"ab"###"c"####", "r#"a"
RawStr { n_hashes: usize, started: bool, terminated: bool },
/// "br"abc"", "br#"abc"#", "br####"ab"###"c"####", "br#"a"
RawByteStr { n_hashes: usize, started: bool, terminated: bool },
}
But here we drop that knowledge:
https://github.com/rust-analyzer/rust-analyzer/blob/
3a7724e/crates/ra_syntax/
src/parsing/lexer.rs#L17-L28
Well, saving this info in ra_parser::SyntaxKind is not possible since its
representation must be fixed to u16. I guess we would need to recompute
that anyway, yes?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#223>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANB3MYTDDVYMLUZWARB3FLQ6R5SLANCNFSM4GC2OOLA>
.
|
@matklad so do you think we can rework our |
Okay, let me rereview the whole ra_syntax crate and think of the most rational decision ... |
I'd imagine |
@matklad
But another solution would be to add an explicit [UPD]: I tried doing this with |
A better name would be |
@bjorn3 well, to me, |
In the mentioned examples there must not be two or more tokens, while |
Well, to conclude, the previous ideas, I am not ready to choose which of the proposed design approaches would be the best now. From my perspective, Don't want to jump straight to conclusions, I am used to be rational, so I'll get there once I feel enough context. |
@Veetaha I think both iterators and vectors might work. I think we don't necessary need iterators (as we either dump everything to vec anyway, or tokenize a really short string which should get tokenized into a single token). In this cases, in Rust, I usually just default to It might be a good idea to move thouse two "odd" uses of tokenize inside the lexer module, such that we don't actually depend on the specifics of the interface. |
Some of the required validations were implemented in PR #2911. Take a note that they are done during the tokenisation-time and were named Other errors like "Doc comments are attached to nodes" or "Symbol path must not contain Here is the updated todo list:
[UPD according to PR review]
|
2911: Implement collecting errors while tokenizing r=matklad a=Veetaha Now we are collecting errors from `rustc_lexer` and returning them in `ParsedToken { token, error }` and `ParsedTokens { tokens, errors }` structures **([UPD]: this is now simplified, see updates bellow)**. The main changes are introduced in `ra_syntax/parsing/lexer.rs`. It now exposes the following functions and types: ```rust pub fn tokenize(text: &str) -> ParsedTokens; pub fn tokenize_append(text: &str, parsed_tokens_to_append_to: &mut ParsedTokens); pub fn first_token(text: &str) -> Option<ParsedToken>; // allows any number of tokens in text pub fn single_token(text: &str) -> Option<ParsedToken>; // allows only a single token in text pub struct ParsedToken { pub token: Token, pub error: Option<SyntaxError> } pub struct ParsedTokens { pub tokens: Vec<Token>, pub errors: Vec<SyntaxError> } pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ } ``` In the first commit I implemented it with iterators, but then decided that since this crate is ad hoc for `rust-analyzer` and we clearly see the places of its usage it would be better to simplify it to vectors. This is currently WIP, because I want to add tests for error messages generated by the lexer. I'd like to listen to you thoughts how to define these tests in `ra_syntax/test-data` dir. Related issues: #223 **[UPD]** After the PR review the API was simplified: ```rust pub fn tokenize(text: &str) -> (Vec<Token>, Vec<SyntaxError>); // Both lex functions do not check for unescape errors pub fn lex_single_syntax_kind(text: &str) -> Option<(SyntaxKind, Option<SyntaxError>)>; pub fn lex_single_valid_syntax_kind(text: &str) -> Option<SyntaxKind>; // This will be removed in the next PR in favour of simlifying `SyntaxError` to `(String, TextRange)` pub enum TokenizeError { /* Simple enum which reflects rustc_lexer tokenization errors */ } // this is private, but may be made public if such demand would exist in future (least privilege principle) fn lex_first_token(text: &str) -> Option<(Token, Option<SyntaxError>)>; ``` Co-authored-by: Veetaha <gerzoh1@gmail.com>
3026: ra_syntax: reshape SyntaxError for the sake of removing redundancy r=matklad a=Veetaha Followup of #2911, also puts some crosses to the todo list of #223. **AHTUNG!** A big part of the diff of this PR are test data files changes. Simplified `SyntaxError` that was `SyntaxError { kind: { /* big enum */ }, location: Location }` to `SyntaxError(String, TextRange)`. I am not sure whether the tuple struct here is best fit, I am inclined to add names to the fields, because I already provide getters `SyntaxError::message()`, `SyntaxError::range()`. I also removed `Location` altogether ... This is currently WIP, because the following is not done: - [ ] ~~Add tests to `test_data` dir for unescape errors *// I don't know where to put these errors in particular, because they are out of the scope of the lexer and parser. However, I have an idea in mind that we move all validators we have right now to parsing stage, but this is up to discussion...*~~ **[UPD]** I came to a conclusion that tree validation logic, which unescape errors are a part of, should be rethought of, we currently have no tests and no place to put tests for tree validations. So I'd like to extract potential redesign (maybe move of tree validation to ra_parser) and adding tests for this into a separate task. Co-authored-by: Veetaha <gerzoh1@gmail.com> Co-authored-by: Veetaha <veetaha2@gmail.com>
I took a shot validating the use of use {crate as bar, {{crate as cat, foo}}; This is pretty obviously solvable with a traversal of the tree, but I'm not sure what performance implications I should be worried about, if any. Any guidance would be super helpful. |
Well, my initial thought was about not having a separate validation step and collect errors like From my naive perspective, what I see as the ultimate solution is desugaring such patterns like |
Yeah I also thought about moving this down to the parser but I think it's important to move as many errors out the parser and into a separate pass/flow so that in the long-run we can give higher quality error messages. Also I found the desugaring / lowering pass: https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs and reading through it I feel like an idiot. There's a |
Imo, this should just be a separate traversal of syntax tree. I wouldn't
worry about perf.
…On Tue, 28 Apr 2020 at 19:26, djrenren ***@***.***> wrote:
Yeah I also thought about moving this down to the parser but I think it's
important to move as many errors out the parser and into a separate
pass/flow so that in the long-run we can give higher quality error messages.
Also I found the desugaring / lowering pass:
https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_hir_def/src/path/lower.rs
and reading through it I feel like an idiot. There's a PathSegmentKind
that will tell me whether the segment is crate keyword, so I'll rewrite
in terms of that. For now though, I think there's still value in the
conservative validation pass, so long as we track that it doesn't error on
UseTrees
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#223 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANB3MZHWVTUQDRLPCYA7K3RO4GTHANCNFSM4GC2OOLA>
.
|
As there seems to have been no activity here for over a month, would it make sense for me to try to make some progress on this? @Veetaha, are you planning to do more work here? |
@jacg yup, feel free to go :) |
Looking at Would there be any objection to rewriting the |
The functions by design accept a sink argument ( |
Actually |
@matklad So I can rely on each
I don't follow. What's wrong with the
? |
I haven't looked at the code in a while, but I think producing more than one error is ok. Generally, sink argument is a pattern that usually works best for producing a bunch of errors/warnings. |
So, in order to write unit tests for these, do you recommend just leaving the sink argument, and having the tests provide an empty one? It would make expressing the tests a bit more of a pain, but I guess that could actually be hidden in a little macro. |
I think those are best tested via data-driven tests, so that the tests are completely oblivious about the actual interface: |
Yup, just adding a bunch of input |
Oh, that's excellent, I'm so glad this exists! |
IIUC, the purpose of the validator is to generate syntax errors in the context of the parser being tolerant to invalid input: The parser accepts incorrect syntax, and the validator, in a separate traversal of the tree, should generate errors in such cases. More specifically, in the context of this issue, let's consider I would expect
If I save this sample somewhere in So it is unclear to me what needs to be done to complete raw string validation. |
I think every raw string (and raw byte string) that parses correctly (has the right number of hashes and quotes) is valid. An example of invalid string would be |
Does that mean that there is nothing left to be done for raw strings? |
I might be wrong, but I think there's nothing to be done for raw strings and raw binary strings. |
I am 90% sure that yes. Raw strings errors are collected from |
I think we impled the bulk of stuff, so closing this issue, and the rest should be handled in specific issues. |
Literals:
Comments:
The text was updated successfully, but these errors were encountered: