-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-102856: Python tokenizer implementation for PEP 701 #104323
Conversation
mgmacias95
commented
May 9, 2023
•
edited by bedevere-bot
Loading
edited by bedevere-bot
- Issue: PEP 701 – Syntactic formalization of f-strings #102856
A thought: should this be aligned with the C tokenizer? If so, we can add tests to compare python tokenizer and the internal C tokenizer. |
It should be aligned with the c tokenizer, but there are some tokens that differ. For example the c tokenizer returns lbrace and rbrace ({ and }) tokens while the python one just returns an OP token. Matching tests sound a good idea to make sure they are both aligned. I can add it :). |
Not sure about matching tests. There are many and often very slight differences between the implementations of the C tokenizer and the Python tokenize module and that's something we've been okay with for a long time. I wonder whether writing those tests is unnecessary effort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mgmacias95 for working on this! I just had a first look at it and it looks great in general.
However, I think that the regex for matching f-strings is going to fail for nested strings that use the same quote. Since this was something that has been explicitly allowed in the PEP and also somewhat "advertised", I feel that most people would expect the tokenize
module to support that as well, especially since we're putting in the work to support the new tokens.
Am I missing something? Is that maybe handled a different way? How do others feel about not supporting nested strings with the same quotation?
Sorry for the lack of context, let me explain the plan:
I don't think this is possible as both tokenizers are incompatible. One of them emits tokens that the other does not and they also emit different tokens (the Python tokenizer emits What do you think? |
It makes sense
I see the point here. I agree |
The plan makes sense! Thanks for the thorough explanation @pablogsal! |
765c2de
to
681f2a5
Compare
5830232
to
e941f12
Compare
Ok, we are switching directions. Turns out the handling I described was even more difficult because we need to also factor in code that knows how to handle the scaped So we had an idea: what if we could reuse the c tokenizer as it already knows how to handle this?. The problem as I said before is that the C tokenizer emits tokens in a different fashion and it doesn't even bother with some others (like
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit f1a5090 🤖 If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
Quick question @pablogsal: Do we still maintain the Stuff like unnecessary Lines 204 to 207 in d78c3bc
|
I think we stil do (the test pass and we spent a huge amount of time making sure they do with all the files, which was non-trivial). I may still be missing something though. |
self.assertEqual(tokens, expected_tokens, | ||
"bytes not decoded with encoding") | ||
|
||
def test__tokenize_does_not_decode_with_encoding_none(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being removed because it was testing the _tokenize
implementation that doesn't exist anymore and is not public
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit 7fb58b0 🤖 If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
+1 for this. Just like I have mentioned here pablogsal#67 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great job! 🚀
Looks like this change broke IDLE: |
It's possible that the tokenizer changes here introduced some parsing-related test failures in Sphinx: sphinx-doc/sphinx#11436 - we've begun looking into the cause. (it may turn out to be a problem to resolve on the Sphinx side; I'm not yet sure whether it suggests a bug in cPython, but thought it'd be worthwhile to mention) |
Would it be possible to give us a self-contained reproducer? With that we may be able to indicate if this is an expected change or a bug. |
Also, could you please open an issue for this once you have your reproducer? |
Yep, sure thing @pablogsal (on both counts: a repro case and a bugreport to go along with it). I'll send those along when available (or will similarly confirm if it turns out to be a non-issue). |
Thanks to @mgmacias95's explanation, I believe that the updated tokenizer is working as-expected. There is some code in Sphinx to handle what I'd consider a quirk of the previous tokenizer, and that code isn't compatible with the updated (improved, I think) representation of dedent tokens. (apologies for the distraction, and thank you for the help) |