gh-102856: Python tokenizer implementation for PEP 701 #104323

mgmacias95 · 2023-05-09T13:30:44Z

Issue: PEP 701 – Syntactic formalization of f-strings #102856

sunmy2019 · 2023-05-09T14:12:57Z

A thought: should this be aligned with the C tokenizer?

If so, we can add tests to compare python tokenizer and the internal C tokenizer.

mgmacias95 · 2023-05-09T14:25:45Z

It should be aligned with the c tokenizer, but there are some tokens that differ. For example the c tokenizer returns lbrace and rbrace ({ and }) tokens while the python one just returns an OP token.

Matching tests sound a good idea to make sure they are both aligned. I can add it :).

lysnikolaou · 2023-05-11T08:12:50Z

Not sure about matching tests. There are many and often very slight differences between the implementations of the C tokenizer and the Python tokenize module and that's something we've been okay with for a long time. I wonder whether writing those tests is unnecessary effort.

lysnikolaou

Thanks @mgmacias95 for working on this! I just had a first look at it and it looks great in general.

However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect the tokenize module to support that as well, especially since we're putting in the work to support the new tokens.

Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?

pablogsal · 2023-05-12T17:30:08Z

Thanks @mgmacias95 for working on this! I just had a first look at it and it looks great in general.

However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect the tokenize module to support that as well, especially since we're putting in the work to support the new tokens.

Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?

Sorry for the lack of context, let me explain the plan:

The current tokenize implementation is based on regular expressions. Unfortunately, this makes chunking the f-string into the parts that we need very difficult because we technically need to parse a character-at-a-time to properly stop when we need but the current design makes this very difficult and requires a full reimplementation that makes the whole ordeal much much more tricky if we don't want to break anything.
The plan in general is that first we identify the full f-string and then we pass this to a post-process function that chunks it in the appropriate tokens. The challenge with this is to properly identify the full f-string when there are repeated nested quotes. This is possible but requires a special branch in the tokenizer for normal mode when f-strings are detected. This will switch to a custom parsing using character-at-a-time that identifies the { and } and knows how to match the quotes using a stack. This allows us to reuse as much as possible of the non-fstring code while using our new strategy.
As we are a bit short on time the strategy that I think is better is to merge a first version of this that doesn't take into account the nested quotes and make sure that it works in general and then fix the nested quotes separately (which then is just a matter of implementing the code that correctly matches the end and start quote so we can pass the real f-string to the chunking code).
As we are a bit short of time we want to merge as much as possible before beta freeze and then fix bugs and edge cases later, and possibly the most contentious code is this one (and not the 'identify-nested-fstrings').

Not sure about matching tests.

I don't think this is possible as both tokenizers are incompatible. One of them emits tokens that the other does not and they also emit different tokens (the Python tokenizer emits OP for operators and the C tokenizer emits generic tokens). Also, there are tokens that are not emitted in the C tokenizer (such as encoding). We would loose more time trying to match them and find differences than just treating them separately.

What do you think?

sunmy2019 · 2023-05-13T06:26:48Z

As we are a bit short of time we want to merge as much as possible before beta freeze and then fix bugs and edge cases later, and possibly the most contentious code is this one (and not the 'identify-nested-fstrings').

It makes sense

We would lose more time trying to match them and find differences than just treating them separately.

I see the point here. I agree

lysnikolaou · 2023-05-15T13:10:34Z

The plan makes sense! Thanks for the thorough explanation @pablogsal!

pablogsal · 2023-05-19T16:08:57Z

CC: @isidentical @lysnikolaou

Ok, we are switching directions. Turns out the handling I described was even more difficult because we need to also factor in code that knows how to handle the scaped {{ and }} and this would be even more difficult if we also need to parse these to identify the entire f-string code (with nested quotes) to post-process it later. We did some prototypes and it was quite verbose and unmaintainable and that's on top of the current tokenizer.

So we had an idea: what if we could reuse the c tokenizer as it already knows how to handle this?. The problem as I said before is that the C tokenizer emits tokens in a different fashion and it doesn't even bother with some others (like COMMENT or NL tokens). So what we have decided is to teach the C tokenizer to optionally emit these tokens and adapt it so we can mimic the output of the Python tokenizer as much as possible. There are some very minor things that the Python tokenizer does that are quite odd (like emitting DEDENT tokens with line numbers that do not exist at the end) that we have decided not to replicate but otherwise this is a HUGE win because:

We can remove the custom python tokenizer entirely
The Python tokenizer and the C tokenizer will never get out of sync
The tokenization now is MUCH faster because it runs at C level
We have much better coverage over weird subtle things that the C tokenizer does (we found a bunch of actual errors in the C implementation because of this).

bedevere-bot · 2023-05-19T16:33:37Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit f1a5090 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

isidentical · 2023-05-19T16:51:23Z

Quick question @pablogsal: Do we still maintain the untokenize(tokenize($src)) == $src guarantee with this switch (with the alternative mode enabled in the C tokenizer)?

Stuff like unnecessary DEDENTs might be relevant to the pretty code generation phase here although I am not super sure (just a hunch):

cpython/Lib/tokenize.py

Lines 204 to 207 in d78c3bc

    
           elif tok_type == DEDENT: 
        
               indents.pop() 
        
               self.prev_row, self.prev_col = end 
        
               continue

pablogsal · 2023-05-19T17:21:16Z

Quick question @pablogsal: Do we still maintain the untokenize(tokenize($src)) == $src guarantee with this switch (with the alternative mode enabled in the C tokenizer)?

I think we stil do (the test pass and we spent a huge amount of time making sure they do with all the files, which was non-trivial). I may still be missing something though.

pablogsal · 2023-05-19T17:36:50Z

Lib/test/test_tokenize.py

        self.assertEqual(tokens, expected_tokens,
                         "bytes not decoded with encoding")

-    def test__tokenize_does_not_decode_with_encoding_none(self):


This is being removed because it was testing the _tokenize implementation that doesn't exist anymore and is not public

bedevere-bot · 2023-05-20T18:03:48Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 7fb58b0 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

sunmy2019 · 2023-05-20T18:14:17Z

what if we could reuse the c tokenizer as it already knows how to handle this?

+1 for this. Just like I have mentioned here pablogsal#67 (comment)

pablogsal

LGTM! Great job! 🚀

AlexWaygood · 2023-05-21T11:46:27Z

Looks like this change broke IDLE:

IDLE is unable to open any .py files #104719

jayaddison · 2023-05-22T16:57:13Z

It's possible that the tokenizer changes here introduced some parsing-related test failures in Sphinx: sphinx-doc/sphinx#11436 - we've begun looking into the cause.

(it may turn out to be a problem to resolve on the Sphinx side; I'm not yet sure whether it suggests a bug in cPython, but thought it'd be worthwhile to mention)

pablogsal · 2023-05-22T17:15:46Z

Would it be possible to give us a self-contained reproducer? With that we may be able to indicate if this is an expected change or a bug.

pablogsal · 2023-05-22T17:16:09Z

Also, could you please open an issue for this once you have your reproducer?

jayaddison · 2023-05-22T17:21:04Z

Yep, sure thing @pablogsal (on both counts: a repro case and a bugreport to go along with it). I'll send those along when available (or will similarly confirm if it turns out to be a non-issue).

jayaddison · 2023-05-22T18:11:50Z

Thanks to @mgmacias95's explanation, I believe that the updated tokenizer is working as-expected.

There is some code in Sphinx to handle what I'd consider a quirk of the previous tokenizer, and that code isn't compatible with the updated (improved, I think) representation of dedent tokens.

(apologies for the distraction, and thank you for the help)

bedevere-bot mentioned this pull request May 9, 2023

PEP 701 – Syntactic formalization of f-strings #102856

Closed

4 tasks

sunmy2019 requested review from pablogsal, lysnikolaou and sunmy2019 May 9, 2023 14:10

sunmy2019 requested a review from isidentical May 9, 2023 14:13

lysnikolaou reviewed May 11, 2023

View reviewed changes

pablogsal force-pushed the python_tokenizer branch from 765c2de to 681f2a5 Compare May 18, 2023 16:05

mgmacias95 and others added 14 commits May 18, 2023 17:21

First iteration

008f8e5

Handle escaping {

67a6ad6

nested expressions

f58104d

Recursive expression tokenization

26102cc

Remove intermediate token created for dev purposes

a5f4b40

More improvements

598bab4

fix handling of } tokens

a0ed816

other tokenizer

90b4ab1

Some progress

63ef1c1

Fix more bugs

6833b1a

Fix more problems

90da796

Use IA to clean code

b5ccd94

Remove lel

b1c3b2a

Remove whitespace

e941f12

pablogsal force-pushed the python_tokenizer branch from 5830232 to e941f12 Compare May 18, 2023 16:21

pablogsal marked this pull request as ready for review May 18, 2023 16:21

bedevere-bot added the awaiting review label May 18, 2023

Some cleanups

fd8b60a

pass the vacuum cleaner

f1a5090

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 19, 2023

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 19, 2023

pablogsal reviewed May 19, 2023

View reviewed changes

Fix refleaks

7fb58b0

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 20, 2023

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label May 20, 2023

📜🤖 Added by blurb_it.

e1b5d35

pablogsal approved these changes May 20, 2023

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels May 20, 2023

pablogsal merged commit 6715f91 into python:main May 21, 2023

bedevere-bot removed the awaiting merge label May 21, 2023

mgmacias95 deleted the python_tokenizer branch May 21, 2023 15:20

sunmy2019 mentioned this pull request May 23, 2023

generate_tokens tokenizes $ differently with Python 3.12 than earlier #104802

Closed

lysnikolaou mentioned this pull request May 24, 2023

Tokenizer module does not handle backslash characters correctly #90432

Closed

hugovk mentioned this pull request Dec 10, 2023

Change in tokenize.generate_tokens behaviour with non-ASCII #112943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-102856: Python tokenizer implementation for PEP 701 #104323

gh-102856: Python tokenizer implementation for PEP 701 #104323

mgmacias95 commented May 9, 2023 •

edited by bedevere-bot

Loading

sunmy2019 commented May 9, 2023

mgmacias95 commented May 9, 2023

lysnikolaou commented May 11, 2023

lysnikolaou left a comment

pablogsal commented May 12, 2023 •

edited

Loading

sunmy2019 commented May 13, 2023

lysnikolaou commented May 15, 2023

pablogsal commented May 19, 2023

bedevere-bot commented May 19, 2023

isidentical commented May 19, 2023 •

edited

Loading

pablogsal commented May 19, 2023

pablogsal May 19, 2023

bedevere-bot commented May 20, 2023

sunmy2019 commented May 20, 2023

pablogsal left a comment

AlexWaygood commented May 21, 2023

jayaddison commented May 22, 2023

pablogsal commented May 22, 2023

pablogsal commented May 22, 2023

jayaddison commented May 22, 2023

jayaddison commented May 22, 2023

gh-102856: Python tokenizer implementation for PEP 701 #104323

gh-102856: Python tokenizer implementation for PEP 701 #104323

Conversation

mgmacias95 commented May 9, 2023 • edited by bedevere-bot Loading

sunmy2019 commented May 9, 2023

mgmacias95 commented May 9, 2023

lysnikolaou commented May 11, 2023

lysnikolaou left a comment

Choose a reason for hiding this comment

pablogsal commented May 12, 2023 • edited Loading

sunmy2019 commented May 13, 2023

lysnikolaou commented May 15, 2023

pablogsal commented May 19, 2023

bedevere-bot commented May 19, 2023

isidentical commented May 19, 2023 • edited Loading

pablogsal commented May 19, 2023

pablogsal May 19, 2023

Choose a reason for hiding this comment

bedevere-bot commented May 20, 2023

sunmy2019 commented May 20, 2023

pablogsal left a comment

Choose a reason for hiding this comment

AlexWaygood commented May 21, 2023

jayaddison commented May 22, 2023

pablogsal commented May 22, 2023

pablogsal commented May 22, 2023

jayaddison commented May 22, 2023

jayaddison commented May 22, 2023

mgmacias95 commented May 9, 2023 •

edited by bedevere-bot

Loading

pablogsal commented May 12, 2023 •

edited

Loading

isidentical commented May 19, 2023 •

edited

Loading