Implement `Spanned` to retrieve source locations on AST nodes #1435

Nyrox · 2024-09-19T11:28:35Z

This PR adds a new trait Spanned to retrieve source spans on AST nodes, see #161 , by recursively traversing the AST and combining spans. This approach is in contrast to the one taken in #839 and #790 by trying to minimise the amount of breaking changes for downstream users, by avoiding wrapping everything in WithSpan<T>.

Main Changes

Spanned trait and Span type added
TokenWithLocation now stores Span, instead of Location
Ident now stores a source span (omitted from Partial/-EQ for compatibility and tests)
Some AST nodes store source tokens where not intrusive
Parser::parse_keyword_token added

The general philosophy of the PR is to be "good enough" without breaking things. As a result certain expressions will have broken or incorrect spans. I imagine these can be cleaned up in future PRs (which might require breaking changes).

f.e. many expressions do not include keywords in their span i.e.

<expr> IS NOT NULL
|....|

will have it's source span simply reported as <expr>::span and there is many such cases, some of which are easier to fix than others. For expressions we can't generate spans for, I use Span::EMPTY as a sort of sentinel value to indicate missing information.

With this approach the only downstream changes a user should have to do to upgrade, should be adding additional fields when matching on AST nodes.

Future Work

Store spans for ast::value::Value. This seems like a breaking change to me, which would require a WithSpan<T> like type
Store keyword TokenWithLocation for expressions that currently don't have them
Implement spans for the rest of the AST, namely Statements. I focused on getting Queries done.

Nyrox · 2024-10-02T11:49:53Z

@alamb Hey, we've started using this functionality internally with pretty great success so far. It's still a draft for now, because of the missing todo!'s, but I would appreciate feedback on the overall design and if you think this can get merged in the foreseeable future once issues are addressed. 😄

alamb · 2024-10-03T11:09:14Z

Thanks @Nyrox -- I'll try and take a look over the next few days

cc @lovasoa @iffyio and @jmhain

src/parser/mod.rs

yuyang-ok · 2024-10-08T06:22:47Z

How is this PR going? is it going to merge or what?

Nyrox · 2024-10-08T10:52:47Z

Alright, I have now gotten rid of all the todo!s and warnings and un-drafted the PR. There is a lot of missing implementations of spans and I have documented those (in hindsight maybe a derive macro would have been the way to go here, someone better at writing those than me is free to take a shot at it 😅 ).

iffyio

Thanks for tackling this @Nyrox! took a quick look and left some comments inline, I'll make some time to do another pass

iffyio · 2024-10-08T17:11:02Z

src/ast/mod.rs

+    where
+        S: Into<String>,
+    {
+        assert!(quote == '\'' || quote == '"' || quote == '`' || quote == '[');


for the assert I think we can remove to avoid panicking in any case. if we want to restrict the quote char the fn would rather return a Result<Self> I imagine, though in this case I think it could make more sense to let the parser accept any char

Removing it sounds good to me, I just wanted to have the same behaviour as Ident::with_quote which still has the check, so it should prob. be removed in both places then

What's the verdict here? Remove or not? :P I am not sure it's in the scope of this PR tbh.

Ah I didn't realise the behavior exists already, it sounds reasonable to follow suit. The current behavior feels unusual to me though, and I agree that it'd be out of scope for this PR to change it. I can take a closer look at it and we can have the behavior here match the outcome

src/ast/spans.rs

Co-authored-by: Ifeanyi Ubah <ify1992@yahoo.com>

iffyio

Thanks @Nyrox! The changes look reasonable to me overall given the discussion in the GH issue. Left some comments, one mostly wondering around the equality behavior now that the token location is embedded within the AST

src/ast/mod.rs

src/ast/spans.rs

src/tokenizer.rs

iffyio · 2024-10-10T05:20:51Z

tests/sqlparser_common.rs

@@ -9805,6 +9853,7 @@ fn parse_map_access_expr() {
 #[test]
 fn parse_connect_by() {
    let expect_query = Select {
+        select_token: TokenWithLocation::wrap(Token::make_keyword("SELECT")),


hmm, wondering, I'm not sure if the behavior to include tokenlocation in comparisons is desirable, thinking rather comparing any two select statements or ast nodes should have the same behavior as today, ignoring any metadata?

Yeah I agree it can be a bit tricky. For this particular case it doesn't matter I think, because TokenWithLocation's custom Eq implementation ignoring spans. So as long as selects actually have a select token they will compare.
What actually is the use case of having Eq on ast nodes? I guess it's so you can throw them into HashMaps?

Ah, no my mistake I missed that the span is being ignored in the comparison here as well. the actual token isn't currently ignored however? thinking that potentially makes it awkward needing to figure out what a token on a property should be, and not sure if there are guarantees provided by the parser on its value? (I'm assuming having the token included in the node is useful)

Not only Eq, but PartialEq as well, basically thinking as with today, any two subexpressions should remain equal regardless of what part of the tree they show up in (for example required when taking fingerprints of sql and comparing them)

Hmm yeah. I mean "Token" in most of these cases is going to be Keyword, so there is some certainty to them being predictable. I.e. if two queries have different tokens, I would very much expect them to actually be different queries.

But I also get if it opens too many cans of worms, so we could just omit it from the trait implementations again, it's just a bit annoying then that we would have to start writing custom implementations for every type that is storing a Token.

The Token is useful, because you need to store it (or a at least it's Span) so you can include it for reported spans.
I.e. select a from b, if you don't store the token for 'select' can't include that as part of it's span :/

Yeah I agree it can become unweildy to try to workaround this by needing custom impls for each type, also as you mention its probably a reasonable assumption that the queries are different if they have different tokens (given that they come from the same parser)

src/tokenizer.rs

src/ast/spans.rs

…ely on self

src/ast/spans.rs

Nyrox · 2024-10-22T12:32:19Z

Bump 😅

lustefaniak · 2024-10-22T21:22:41Z

Hi @Nyrox!
Thanks for this PR and pushing locations support further!

We had an internal discussion to try to upstream the location changes we use in SYNQ. I found some time to finish rebase today and then noticed your PR. (#1480)

I like your suggested changes, as they are very noninvasive for the downstream consumers. If this could be included in the main, that would be absolutely amazing. Having location inside every Ident is probably best. In my PR, I wrapped all Idents with location, but our fork uses them only when we need them. Having it always available inside of Indent sounds much better.

It would also be good to design and include wrapping of nodes. It is essential, especially for enum cases where adding location to each of the enum cases would be too verbose. Also in some places those "smaller" locations just don't make much sense. In my PR, I'm using fc8e44a#diff-fc6a8d66b8cb6bd48a119a70e489cbcef82e7f551e7cf08e8e972ef4e774ef49R12259-R12308 to align captured span to a non-whitespace portion of the query based on token indexes. It is very fast to add such spans with accurate capture this way.

Nyrox · 2024-10-23T12:01:36Z

Hi @lustefaniak, thanks I remember seeing your fork when initially researching this 😅
I agree something like a WithSpan<T> type will probably be needed to improve spans, especially in the case of enums as you said. I think for now the focus, as by the discussions with @alamb was mostly on just reducing the breaking surface as much as we can and get something in. Maybe in a future PR we can start thinking about wrapping all the enums, but then at least we have a gradual roll out 😃

Nyrox · 2024-10-25T11:19:13Z

Hey all, think we've gotten quite far here - is there anything actively blocking this from getting approved? Are there any actions I can take to help you guys or do you just need the time to get to this item? 😊

Nyrox added 7 commits September 16, 2024 10:36

feat(tokenizer): add source location spans to tokens

1a77bac

feat: begin work on trait Spanned

d818012

implement a bunch more stuff

079a4e2

Merge branch 'feat/ast-source-locations'

b97a781

fix: restore old behaviour of location display

df9ab1e

implement spans for eveeeeen more ast nodes

b718c76

feat: more ast nodes

8986a1e

Nyrox changed the title ~~Implement Spanned to retrieve sourcec locations on AST nodes~~ Implement Spanned to retrieve source locations on AST nodes Sep 19, 2024

start working on better tests

aeb4f3a

alamb mentioned this pull request Sep 23, 2024

Syntax highlight for keywords datafusion-contrib/datafusion-dft#149

Merged

Nyrox added 3 commits September 25, 2024 14:31

feat: implement spans for Wildcard projections

4de3209

make union_spans public

a04888a

enable serde feat for spans and locations

1b2b03d

Nyrox force-pushed the main branch from b7335cb to 1b2b03d Compare September 30, 2024 08:55

lovasoa reviewed Oct 3, 2024

View reviewed changes

src/parser/mod.rs Outdated Show resolved Hide resolved

feat: implement remaining ast nodes

5f60bdc

Nyrox added 2 commits October 8, 2024 12:38

fix unused variable warnings

ea8a6b1

undo parse_keyword signature change

6a9250a

Nyrox marked this pull request as ready for review October 8, 2024 10:50

fix: diverging hash and partialeq implementations

0804e99

iffyio reviewed Oct 8, 2024

View reviewed changes

Nyrox and others added 4 commits October 9, 2024 09:53

Update src/ast/spans.rs

eb9ff9a

Co-authored-by: Ifeanyi Ubah <ify1992@yahoo.com>

improve docs & un-pub union_spans

a93cebc

move union_spans to top of file

734264a

replace old tests

441ceb1

iffyio reviewed Oct 10, 2024

View reviewed changes

Nyrox and others added 3 commits October 11, 2024 11:05

pr feedback

e6a4340

add small comment

16a3f2a

refactor: rewrite all span implementations to pattern match exhaustiv…

98b051d

…ely on self

lovasoa reviewed Oct 18, 2024

View reviewed changes

src/ast/spans.rs Outdated Show resolved Hide resolved

for_clause is mssql, not mysql

d76d1e0

lustefaniak mentioned this pull request Oct 22, 2024

locations: Add support for Spans with exact location of AST nodes #1480

Open

3 tasks

Nyrox and others added 2 commits October 25, 2024 13:16

Merge branch 'main' into main

bf75fe4

cargo fmt

71c27ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `Spanned` to retrieve source locations on AST nodes #1435

Implement `Spanned` to retrieve source locations on AST nodes #1435

Nyrox commented Sep 19, 2024 •

edited

Loading

Nyrox commented Oct 2, 2024

alamb commented Oct 3, 2024

yuyang-ok commented Oct 8, 2024

Nyrox commented Oct 8, 2024

iffyio left a comment

iffyio Oct 8, 2024

Nyrox Oct 9, 2024

Nyrox Oct 16, 2024

iffyio Oct 18, 2024

iffyio left a comment

iffyio Oct 10, 2024

Nyrox Oct 11, 2024

iffyio Oct 12, 2024

Nyrox Oct 16, 2024

iffyio Oct 18, 2024

Nyrox commented Oct 22, 2024

lustefaniak commented Oct 22, 2024 •

edited

Loading

Nyrox commented Oct 23, 2024

Nyrox commented Oct 25, 2024

Implement Spanned to retrieve source locations on AST nodes #1435

Are you sure you want to change the base?

Implement Spanned to retrieve source locations on AST nodes #1435

Conversation

Nyrox commented Sep 19, 2024 • edited Loading

Main Changes

Future Work

Nyrox commented Oct 2, 2024

alamb commented Oct 3, 2024

yuyang-ok commented Oct 8, 2024

Nyrox commented Oct 8, 2024

iffyio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iffyio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nyrox commented Oct 22, 2024

lustefaniak commented Oct 22, 2024 • edited Loading

Nyrox commented Oct 23, 2024

Nyrox commented Oct 25, 2024

Implement `Spanned` to retrieve source locations on AST nodes #1435

Implement `Spanned` to retrieve source locations on AST nodes #1435

Nyrox commented Sep 19, 2024 •

edited

Loading

lustefaniak commented Oct 22, 2024 •

edited

Loading