Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-102856: Python tokenizer implementation for PEP 701 #104323

Merged
merged 20 commits into from
May 21, 2023

Conversation

mgmacias95
Copy link
Contributor

@mgmacias95 mgmacias95 commented May 9, 2023

@sunmy2019
Copy link
Member

A thought: should this be aligned with the C tokenizer?

If so, we can add tests to compare python tokenizer and the internal C tokenizer.

@sunmy2019 sunmy2019 requested a review from isidentical May 9, 2023 14:13
@mgmacias95
Copy link
Contributor Author

It should be aligned with the c tokenizer, but there are some tokens that differ. For example the c tokenizer returns lbrace and rbrace ({ and }) tokens while the python one just returns an OP token.

Matching tests sound a good idea to make sure they are both aligned. I can add it :).

@lysnikolaou
Copy link
Member

Not sure about matching tests. There are many and often very slight differences between the implementations of the C tokenizer and the Python tokenize module and that's something we've been okay with for a long time. I wonder whether writing those tests is unnecessary effort.

Copy link
Member

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mgmacias95 for working on this! I just had a first look at it and it looks great in general.

However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect the tokenize module to support that as well, especially since we're putting in the work to support the new tokens.

Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?

@pablogsal
Copy link
Member

pablogsal commented May 12, 2023

Thanks @mgmacias95 for working on this! I just had a first look at it and it looks great in general.

However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect the tokenize module to support that as well, especially since we're putting in the work to support the new tokens.

Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?

Sorry for the lack of context, let me explain the plan:

  • The current tokenize implementation is based on regular expressions. Unfortunately, this makes chunking the f-string into the parts that we need very difficult because we technically need to parse a character-at-a-time to properly stop when we need but the current design makes this very difficult and requires a full reimplementation that makes the whole ordeal much much more tricky if we don't want to break anything.
  • The plan in general is that first we identify the full f-string and then we pass this to a post-process function that chunks it in the appropriate tokens. The challenge with this is to properly identify the full f-string when there are repeated nested quotes. This is possible but requires a special branch in the tokenizer for normal mode when f-strings are detected. This will switch to a custom parsing using character-at-a-time that identifies the { and } and knows how to match the quotes using a stack. This allows us to reuse as much as possible of the non-fstring code while using our new strategy.
  • As we are a bit short on time the strategy that I think is better is to merge a first version of this that doesn't take into account the nested quotes and make sure that it works in general and then fix the nested quotes separately (which then is just a matter of implementing the code that correctly matches the end and start quote so we can pass the real f-string to the chunking code).
  • As we are a bit short of time we want to merge as much as possible before beta freeze and then fix bugs and edge cases later, and possibly the most contentious code is this one (and not the 'identify-nested-fstrings').

Not sure about matching tests.

I don't think this is possible as both tokenizers are incompatible. One of them emits tokens that the other does not and they also emit different tokens (the Python tokenizer emits OP for operators and the C tokenizer emits generic tokens). Also, there are tokens that are not emitted in the C tokenizer (such as encoding). We would loose more time trying to match them and find differences than just treating them separately.

What do you think?

@sunmy2019
Copy link
Member

  • As we are a bit short of time we want to merge as much as possible before beta freeze and then fix bugs and edge cases later, and possibly the most contentious code is this one (and not the 'identify-nested-fstrings').

It makes sense

We would lose more time trying to match them and find differences than just treating them separately.

I see the point here. I agree

@lysnikolaou
Copy link
Member

The plan makes sense! Thanks for the thorough explanation @pablogsal!

@pablogsal pablogsal marked this pull request as ready for review May 18, 2023 16:21
@pablogsal
Copy link
Member

CC: @isidentical @lysnikolaou

Ok, we are switching directions. Turns out the handling I described was even more difficult because we need to also factor in code that knows how to handle the scaped {{ and }} and this would be even more difficult if we also need to parse these to identify the entire f-string code (with nested quotes) to post-process it later. We did some prototypes and it was quite verbose and unmaintainable and that's on top of the current tokenizer.

So we had an idea: what if we could reuse the c tokenizer as it already knows how to handle this?. The problem as I said before is that the C tokenizer emits tokens in a different fashion and it doesn't even bother with some others (like COMMENT or NL tokens). So what we have decided is to teach the C tokenizer to optionally emit these tokens and adapt it so we can mimic the output of the Python tokenizer as much as possible. There are some very minor things that the Python tokenizer does that are quite odd (like emitting DEDENT tokens with line numbers that do not exist at the end) that we have decided not to replicate but otherwise this is a HUGE win because:

  • We can remove the custom python tokenizer entirely
  • The Python tokenizer and the C tokenizer will never get out of sync
  • The tokenization now is MUCH faster because it runs at C level
  • We have much better coverage over weird subtle things that the C tokenizer does (we found a bunch of actual errors in the C implementation because of this).

@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 19, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit f1a5090 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 19, 2023
@isidentical
Copy link
Member

isidentical commented May 19, 2023

Quick question @pablogsal: Do we still maintain the untokenize(tokenize($src)) == $src guarantee with this switch (with the alternative mode enabled in the C tokenizer)?

Stuff like unnecessary DEDENTs might be relevant to the pretty code generation phase here although I am not super sure (just a hunch):

cpython/Lib/tokenize.py

Lines 204 to 207 in d78c3bc

elif tok_type == DEDENT:
indents.pop()
self.prev_row, self.prev_col = end
continue

@pablogsal
Copy link
Member

Quick question @pablogsal: Do we still maintain the untokenize(tokenize($src)) == $src guarantee with this switch (with the alternative mode enabled in the C tokenizer)?

I think we stil do (the test pass and we spent a huge amount of time making sure they do with all the files, which was non-trivial). I may still be missing something though.

self.assertEqual(tokens, expected_tokens,
"bytes not decoded with encoding")

def test__tokenize_does_not_decode_with_encoding_none(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being removed because it was testing the _tokenize implementation that doesn't exist anymore and is not public

@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 20, 2023
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 7fb58b0 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 20, 2023
@sunmy2019
Copy link
Member

what if we could reuse the c tokenizer as it already knows how to handle this?

+1 for this. Just like I have mentioned here pablogsal#67 (comment)

Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job! 🚀

@AlexWaygood
Copy link
Member

Looks like this change broke IDLE:

@mgmacias95 mgmacias95 deleted the python_tokenizer branch May 21, 2023 15:20
@jayaddison
Copy link

It's possible that the tokenizer changes here introduced some parsing-related test failures in Sphinx: sphinx-doc/sphinx#11436 - we've begun looking into the cause.

(it may turn out to be a problem to resolve on the Sphinx side; I'm not yet sure whether it suggests a bug in cPython, but thought it'd be worthwhile to mention)

@pablogsal
Copy link
Member

Would it be possible to give us a self-contained reproducer? With that we may be able to indicate if this is an expected change or a bug.

@pablogsal
Copy link
Member

Also, could you please open an issue for this once you have your reproducer?

@jayaddison
Copy link

Yep, sure thing @pablogsal (on both counts: a repro case and a bugreport to go along with it). I'll send those along when available (or will similarly confirm if it turns out to be a non-issue).

@jayaddison
Copy link

Thanks to @mgmacias95's explanation, I believe that the updated tokenizer is working as-expected.

There is some code in Sphinx to handle what I'd consider a quirk of the previous tokenizer, and that code isn't compatible with the updated (improved, I think) representation of dedent tokens.

(apologies for the distraction, and thank you for the help)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants