-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly tokenize nested comments #1629
Conversation
The tokenizer currently throws EOF error for `select 'foo' /*/**/*/` `last_ch` causes problems when tokenizing nested comments, we have to consume the combination of /* or */ The existing `tokenize_nested_multiline_comment` test fails after fixing this logic: /*multi-line\n* \n/* comment \n /*comment*/*/ */ /comment*/ ^^ Start ^^ End nested comment Relevant: apache#726
@iffyio I'm thinking about adding an option in the dialect, cause not every database supports these nested comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hansott! Left a couple comments
src/tokenizer.rs
Outdated
if last_ch == '/' && ch == '*' { | ||
if ch == '/' && matches!(chars.peek(), Some('*')) && supports_nested_comments { | ||
s.push(ch); | ||
s.push(chars.next().unwrap()); // consume the '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rewrite this to return an error in order to avoid the unwrap?
src/tokenizer.rs
Outdated
break Ok(Some(Token::Whitespace(Whitespace::MultiLineComment(s)))); | ||
} | ||
s.push(slash.unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, it would be nice if we can avoid unwraping in code
src/tokenizer.rs
Outdated
|
||
loop { | ||
match chars.next() { | ||
Some(ch) => { | ||
if last_ch == '/' && ch == '*' { | ||
if ch == '/' && matches!(chars.peek(), Some('*')) && supports_nested_comments { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we move some of the condition out to the match statement? thinking that way we avoid extra nesting + continue. And the different cases are clearer. e.g.
Some('/') if matches!(chars.peek(), Some('*')) && supports_nested_comments => { ... }
Some('*') if matches!(chars.peek(), Some('/')) => { ... }
Some(ch) => { ... }
None => { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do!
@iffyio Addressed your feedback, thanks a lot! Tests passing ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…o patch-carriage-return * 'main' of github.com:hansott/datafusion-sqlparser-rs: Add support for MySQL's INSERT INTO ... SET syntax (apache#1641) Add support for Snowflake LIST and REMOVE (apache#1639) Add support for the SQL OVERLAPS predicate (apache#1638) Add support for various Snowflake grantees (apache#1640) Add support for USE SECONDARY ROLE (vs. ROLES) (apache#1637) Correctly tokenize nested comments (apache#1629) Add support for MYSQL's `RENAME TABLE` (apache#1616) Test benchmarks and Improve benchmark README.md (apache#1627)
The tokenizer currently throws EOF error for
select 'foo' /*/**/*/
last_ch
causes problems when tokenizing nested comments, we have to consume the combination of/*
or*/
The existing
tokenize_nested_multiline_comment
test fails after fixing this logic:proof:
Relevant: #726