Correctly tokenize nested comments #1629

hansott · 2024-12-30T15:54:10Z

The tokenizer currently throws EOF error for select 'foo' /*/**/*/

last_ch causes problems when tokenizing nested comments, we have to consume the combination of /* or */

The existing tokenize_nested_multiline_comment test fails after fixing this logic:

/*multi-line\n* \n/* comment \n /*comment*/*/ */ /comment*/
^^ Start                                      ^^ End nested comment

proof:

psql (14.15 (Homebrew), server 14.11)
Type "help" for help.

main_db=# SELECT 'foo' /*/**/*/;
 ?column?
----------
 foo
(1 row)

main_db=# SELECT 'foo' /*multi-line\n* \n/* comment \n /*comment*/*/ */ /comment*/;
ERROR:  syntax error at or near ";"
LINE 1: .../*multi-line\n* \n/* comment \n /*comment*/*/ */ /comment*/;
                                                                      ^
main_db=# SELECT 'foo' /*multi-line\n* \n/* comment \n /*comment*/*/ */;
 ?column?
----------
 foo
(1 row)

Relevant: #726

The tokenizer currently throws EOF error for `select 'foo' /*/**/*/` `last_ch` causes problems when tokenizing nested comments, we have to consume the combination of /* or */ The existing `tokenize_nested_multiline_comment` test fails after fixing this logic: /*multi-line\n* \n/* comment \n /*comment*/*/ */ /comment*/ ^^ Start ^^ End nested comment Relevant: apache#726

hansott · 2024-12-30T16:12:50Z

@iffyio I'm thinking about adding an option in the dialect, cause not every database supports these nested comments.

iffyio

Thanks @hansott! Left a couple comments

iffyio · 2025-01-05T11:28:39Z

src/tokenizer.rs

-                    if last_ch == '/' && ch == '*' {
+                    if ch == '/' && matches!(chars.peek(), Some('*')) && supports_nested_comments {
+                        s.push(ch);
+                        s.push(chars.next().unwrap()); // consume the '*'


can we rewrite this to return an error in order to avoid the unwrap?

iffyio · 2025-01-05T11:29:05Z

src/tokenizer.rs

                            break Ok(Some(Token::Whitespace(Whitespace::MultiLineComment(s))));
                        }
+                        s.push(slash.unwrap());


same here, it would be nice if we can avoid unwraping in code

iffyio · 2025-01-05T11:36:41Z

src/tokenizer.rs


        loop {
            match chars.next() {
                Some(ch) => {
-                    if last_ch == '/' && ch == '*' {
+                    if ch == '/' && matches!(chars.peek(), Some('*')) && supports_nested_comments {


could we move some of the condition out to the match statement? thinking that way we avoid extra nesting + continue. And the different cases are clearer. e.g.

Some('/') if matches!(chars.peek(), Some('*')) && supports_nested_comments => { ... } Some('*') if matches!(chars.peek(), Some('/')) => { ... } Some(ch) => { ... } None => { ... }

Makes sense, will do!

hansott · 2025-01-05T12:23:10Z

@iffyio Addressed your feedback, thanks a lot! Tests passing ✅

iffyio

LGTM! Thanks @hansott!
cc @alamb

…o patch-carriage-return * 'main' of github.com:hansott/datafusion-sqlparser-rs: Add support for MySQL's INSERT INTO ... SET syntax (apache#1641) Add support for Snowflake LIST and REMOVE (apache#1639) Add support for the SQL OVERLAPS predicate (apache#1638) Add support for various Snowflake grantees (apache#1640) Add support for USE SECONDARY ROLE (vs. ROLES) (apache#1637) Correctly tokenize nested comments (apache#1629) Add support for MYSQL's `RENAME TABLE` (apache#1616) Test benchmarks and Improve benchmark README.md (apache#1627)

hansott added 4 commits December 30, 2024 16:43

Fix comment

5c6d3ed

Use loop with test cases

f068f5e

Add character after nested comment

d7c8b3b

Add option for nested comments to dialects

e1bd5c3

iffyio reviewed Jan 5, 2025

View reviewed changes

Move if statements to match inside loop and avoid unwrap

4a6ab2b

iffyio approved these changes Jan 5, 2025

View reviewed changes

iffyio merged commit 8bc63f0 into apache:main Jan 5, 2025
9 checks passed

hansott deleted the fix-nested-comments branch January 6, 2025 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly tokenize nested comments #1629

Correctly tokenize nested comments #1629

hansott commented Dec 30, 2024 •

edited

Loading

hansott commented Dec 30, 2024 •

edited

Loading

iffyio left a comment

iffyio Jan 5, 2025

iffyio Jan 5, 2025

iffyio Jan 5, 2025

hansott Jan 5, 2025

hansott commented Jan 5, 2025

iffyio left a comment

Correctly tokenize nested comments #1629

Correctly tokenize nested comments #1629

Conversation

hansott commented Dec 30, 2024 • edited Loading

hansott commented Dec 30, 2024 • edited Loading

iffyio left a comment

Choose a reason for hiding this comment

iffyio Jan 5, 2025

Choose a reason for hiding this comment

iffyio Jan 5, 2025

Choose a reason for hiding this comment

iffyio Jan 5, 2025

Choose a reason for hiding this comment

hansott Jan 5, 2025

Choose a reason for hiding this comment

hansott commented Jan 5, 2025

iffyio left a comment

Choose a reason for hiding this comment

hansott commented Dec 30, 2024 •

edited

Loading

hansott commented Dec 30, 2024 •

edited

Loading