Skip to content

Commit

Permalink
gh-102856: Python tokenizer implementation for PEP 701 (#104323)
Browse files Browse the repository at this point in the history
This commit replaces the Python implementation of the tokenize module with an implementation
that reuses the real C tokenizer via a private extension module. The tokenize module now implements
a compatibility layer that transforms tokens from the C tokenizer into Python tokenize tokens for backward
compatibility.

As the C tokenizer does not emit some tokens that the Python tokenizer provides (such as comments and non-semantic newlines), a new special mode has been added to the C tokenizer mode that currently is only used via
the extension module that exposes it to the Python layer. This new mode forces the C tokenizer to emit these new extra tokens and add the appropriate metadata that is needed to match the old Python implementation.

Co-authored-by: Pablo Galindo <pablogsal@gmail.com>
  • Loading branch information
mgmacias95 and pablogsal authored May 21, 2023
1 parent 3ed57e4 commit 6715f91
Show file tree
Hide file tree
Showing 22 changed files with 426 additions and 376 deletions.
4 changes: 4 additions & 0 deletions Doc/library/token-list.inc

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Doc/library/token.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,13 @@ The following token type values aren't used by the C tokenizer but are needed fo
the :mod:`tokenize` module.

.. data:: COMMENT
:noindex:

Token value used to indicate a comment.


.. data:: NL
:noindex:

Token value used to indicate a non-terminating newline. The
:data:`NEWLINE` token indicates the end of a logical line of Python code;
Expand Down
4 changes: 2 additions & 2 deletions Grammar/Tokens
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ SOFT_KEYWORD
FSTRING_START
FSTRING_MIDDLE
FSTRING_END
COMMENT
NL
ERRORTOKEN

# These aren't used by the C tokenizer but are needed for tokenize.py
COMMENT
NL
ENCODING
1 change: 1 addition & 0 deletions Include/internal/pycore_global_objects_fini_generated.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Include/internal/pycore_global_strings.h
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,7 @@ struct _Py_global_strings {
STRUCT_FOR_ID(exception)
STRUCT_FOR_ID(exp)
STRUCT_FOR_ID(extend)
STRUCT_FOR_ID(extra_tokens)
STRUCT_FOR_ID(facility)
STRUCT_FOR_ID(factory)
STRUCT_FOR_ID(false)
Expand Down
1 change: 1 addition & 0 deletions Include/internal/pycore_runtime_init_generated.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion Include/internal/pycore_token.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,9 @@ extern "C" {
#define FSTRING_START 61
#define FSTRING_MIDDLE 62
#define FSTRING_END 63
#define ERRORTOKEN 64
#define COMMENT 64
#define NL 65
#define ERRORTOKEN 66
#define N_TOKENS 68
#define NT_OFFSET 256

Expand Down
3 changes: 3 additions & 0 deletions Include/internal/pycore_unicodeobject_generated.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Lib/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -2187,15 +2187,15 @@ def _signature_strip_non_python_syntax(signature):
if string == ',':
current_parameter += 1

if (type == ERRORTOKEN) and (string == '$'):
if (type == OP) and (string == '$'):
assert self_parameter is None
self_parameter = current_parameter
continue

add(string)
if (string == ','):
add(' ')
clean_signature = ''.join(text)
clean_signature = ''.join(text).strip()
return clean_signature, self_parameter


Expand Down
10 changes: 10 additions & 0 deletions Lib/tabnanny.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ def check(file):
errprint("%r: Token Error: %s" % (file, msg))
return

except SyntaxError as msg:
errprint("%r: Token Error: %s" % (file, msg))
return

except IndentationError as msg:
errprint("%r: Indentation Error: %s" % (file, msg))
return
Expand Down Expand Up @@ -272,6 +276,12 @@ def format_witnesses(w):
return prefix + " " + ', '.join(firsts)

def process_tokens(tokens):
try:
_process_tokens(tokens)
except TabError as e:
raise NannyNag(e.lineno, e.msg, e.text)

def _process_tokens(tokens):
INDENT = tokenize.INDENT
DEDENT = tokenize.DEDENT
NEWLINE = tokenize.NEWLINE
Expand Down
4 changes: 2 additions & 2 deletions Lib/test/test_tabnanny.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ def test_when_nannynag_error_verbose(self):
with TemporaryPyFile(SOURCE_CODES["nannynag_errored"]) as file_path:
out = f"{file_path!r}: *** Line 3: trouble in tab city! ***\n"
out += "offending line: '\\tprint(\"world\")\\n'\n"
out += "indent not equal e.g. at tab size 1\n"
out += "inconsistent use of tabs and spaces in indentation\n"

tabnanny.verbose = 1
self.verify_tabnanny_check(file_path, out=out)
Expand Down Expand Up @@ -315,7 +315,7 @@ def validate_cmd(self, *args, stdout="", stderr="", partial=False, expect_failur
def test_with_errored_file(self):
"""Should displays error when errored python file is given."""
with TemporaryPyFile(SOURCE_CODES["wrong_indented"]) as file_path:
stderr = f"{file_path!r}: Indentation Error: "
stderr = f"{file_path!r}: Token Error: "
stderr += ('unindent does not match any outer indentation level'
' (<tokenize>, line 3)')
self.validate_cmd(file_path, stderr=stderr, expect_failure=True)
Expand Down
Loading

0 comments on commit 6715f91

Please sign in to comment.