Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Spanned to retrieve source locations on AST nodes #1435

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

Nyrox
Copy link

@Nyrox Nyrox commented Sep 19, 2024

This PR adds a new trait Spanned to retrieve source spans on AST nodes, see #161 , by recursively traversing the AST and combining spans. This approach is in contrast to the one taken in #839 and #790 by trying to minimise the amount of breaking changes for downstream users, by avoiding wrapping everything in WithSpan<T>.

Main Changes

  • Spanned trait and Span type added
  • TokenWithLocation now stores Span, instead of Location
  • Ident now stores a source span (omitted from Partial/-EQ for compatibility and tests)
  • Some AST nodes store source tokens where not intrusive
  • Parser::parse_keyword_token added

The general philosophy of the PR is to be "good enough" without breaking things. As a result certain expressions will have broken or incorrect spans. I imagine these can be cleaned up in future PRs (which might require breaking changes).

f.e. many expressions do not include keywords in their span i.e.

<expr> IS NOT NULL
|....|

will have it's source span simply reported as <expr>::span and there is many such cases, some of which are easier to fix than others. For expressions we can't generate spans for, I use Span::EMPTY as a sort of sentinel value to indicate missing information.

With this approach the only downstream changes a user should have to do to upgrade, should be adding additional fields when matching on AST nodes.

Future Work

  • Store spans for ast::value::Value. This seems like a breaking change to me, which would require a WithSpan<T> like type
  • Store keyword TokenWithLocation for expressions that currently don't have them
  • Implement spans for the rest of the AST, namely Statements. I focused on getting Queries done.

@Nyrox Nyrox changed the title Implement Spanned to retrieve sourcec locations on AST nodes Implement Spanned to retrieve source locations on AST nodes Sep 19, 2024
@Nyrox
Copy link
Author

Nyrox commented Oct 2, 2024

@alamb Hey, we've started using this functionality internally with pretty great success so far. It's still a draft for now, because of the missing todo!'s, but I would appreciate feedback on the overall design and if you think this can get merged in the foreseeable future once issues are addressed. 😄

@alamb
Copy link
Contributor

alamb commented Oct 3, 2024

Thanks @Nyrox -- I'll try and take a look over the next few days

cc @lovasoa @iffyio and @jmhain

src/parser/mod.rs Outdated Show resolved Hide resolved
@yuyang-ok
Copy link

How is this PR going? is it going to merge or what?

@Nyrox Nyrox marked this pull request as ready for review October 8, 2024 10:50
@Nyrox
Copy link
Author

Nyrox commented Oct 8, 2024

Alright, I have now gotten rid of all the todo!s and warnings and un-drafted the PR. There is a lot of missing implementations of spans and I have documented those (in hindsight maybe a derive macro would have been the way to go here, someone better at writing those than me is free to take a shot at it 😅 ).

Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @Nyrox! took a quick look and left some comments inline, I'll make some time to do another pass

where
S: Into<String>,
{
assert!(quote == '\'' || quote == '"' || quote == '`' || quote == '[');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the assert I think we can remove to avoid panicking in any case. if we want to restrict the quote char the fn would rather return a Result<Self> I imagine, though in this case I think it could make more sense to let the parser accept any char

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it sounds good to me, I just wanted to have the same behaviour as Ident::with_quote which still has the check, so it should prob. be removed in both places then

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the verdict here? Remove or not? :P I am not sure it's in the scope of this PR tbh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't realise the behavior exists already, it sounds reasonable to follow suit. The current behavior feels unusual to me though, and I agree that it'd be out of scope for this PR to change it. I can take a closer look at it and we can have the behavior here match the outcome

src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@iffyio iffyio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nyrox! The changes look reasonable to me overall given the discussion in the GH issue. Left some comments, one mostly wondering around the equality behavior now that the token location is embedded within the AST

src/ast/mod.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/tokenizer.rs Show resolved Hide resolved
@@ -9805,6 +9853,7 @@ fn parse_map_access_expr() {
#[test]
fn parse_connect_by() {
let expect_query = Select {
select_token: TokenWithLocation::wrap(Token::make_keyword("SELECT")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, wondering, I'm not sure if the behavior to include tokenlocation in comparisons is desirable, thinking rather comparing any two select statements or ast nodes should have the same behavior as today, ignoring any metadata?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it can be a bit tricky. For this particular case it doesn't matter I think, because TokenWithLocation's custom Eq implementation ignoring spans. So as long as selects actually have a select token they will compare.
What actually is the use case of having Eq on ast nodes? I guess it's so you can throw them into HashMaps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no my mistake I missed that the span is being ignored in the comparison here as well. the actual token isn't currently ignored however? thinking that potentially makes it awkward needing to figure out what a token on a property should be, and not sure if there are guarantees provided by the parser on its value? (I'm assuming having the token included in the node is useful)

Not only Eq, but PartialEq as well, basically thinking as with today, any two subexpressions should remain equal regardless of what part of the tree they show up in (for example required when taking fingerprints of sql and comparing them)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah. I mean "Token" in most of these cases is going to be Keyword, so there is some certainty to them being predictable. I.e. if two queries have different tokens, I would very much expect them to actually be different queries.

But I also get if it opens too many cans of worms, so we could just omit it from the trait implementations again, it's just a bit annoying then that we would have to start writing custom implementations for every type that is storing a Token.

The Token is useful, because you need to store it (or a at least it's Span) so you can include it for reported spans.
I.e. select a from b, if you don't store the token for 'select' can't include that as part of it's span :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it can become unweildy to try to workaround this by needing custom impls for each type, also as you mention its probably a reasonable assumption that the queries are different if they have different tokens (given that they come from the same parser)

src/tokenizer.rs Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
src/ast/spans.rs Outdated Show resolved Hide resolved
@Nyrox
Copy link
Author

Nyrox commented Oct 22, 2024

Bump 😅

@lustefaniak
Copy link
Contributor

lustefaniak commented Oct 22, 2024

Hi @Nyrox!
Thanks for this PR and pushing locations support further!

We had an internal discussion to try to upstream the location changes we use in SYNQ. I found some time to finish rebase today and then noticed your PR. (#1480)

I like your suggested changes, as they are very noninvasive for the downstream consumers. If this could be included in the main, that would be absolutely amazing. Having location inside every Ident is probably best. In my PR, I wrapped all Idents with location, but our fork uses them only when we need them. Having it always available inside of Indent sounds much better.

It would also be good to design and include wrapping of nodes. It is essential, especially for enum cases where adding location to each of the enum cases would be too verbose. Also in some places those "smaller" locations just don't make much sense. In my PR, I'm using fc8e44a#diff-fc6a8d66b8cb6bd48a119a70e489cbcef82e7f551e7cf08e8e972ef4e774ef49R12259-R12308 to align captured span to a non-whitespace portion of the query based on token indexes. It is very fast to add such spans with accurate capture this way.

@Nyrox
Copy link
Author

Nyrox commented Oct 23, 2024

Hi @lustefaniak, thanks I remember seeing your fork when initially researching this 😅
I agree something like a WithSpan<T> type will probably be needed to improve spans, especially in the case of enums as you said. I think for now the focus, as by the discussions with @alamb was mostly on just reducing the breaking surface as much as we can and get something in. Maybe in a future PR we can start thinking about wrapping all the enums, but then at least we have a gradual roll out 😃

@Nyrox
Copy link
Author

Nyrox commented Oct 25, 2024

Hey all, think we've gotten quite far here - is there anything actively blocking this from getting approved? Are there any actions I can take to help you guys or do you just need the time to get to this item? 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants