Untokenize and retokenize does not round-trip #83134

Zac-HD · 2019-12-02T11:06:14Z

BPO	38953
Nosy	@terryjreedy, @meadori, @Zac-HD

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-12-02.11:06:14.288>
labels = ['type-bug', 'library', '3.10']
title = 'Untokenize and retokenize does not round-trip'
updated_at = <Date 2020-10-24.00:58:47.260>
user = 'https://github.com/Zac-HD'

bugs.python.org fields:

activity = <Date 2020-10-24.00:58:47.260>
actor = 'Zac Hatfield-Dodds'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2019-12-02.11:06:14.288>
creator = 'Zac Hatfield-Dodds'
dependencies = []
files = []
hgrepos = []
issue_num = 38953
keywords = []
message_count = 3.0
messages = ['357704', '379498', '379506']
nosy_count = 3.0
nosy_names = ['terry.reedy', 'meador.inge', 'Zac Hatfield-Dodds']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue38953'
versions = ['Python 3.10']

Zac-HD · 2019-12-02T11:06:14Z

I've been working on a tool called Hypothesmith - https://github.com/Zac-HD/hypothesmith - to generate arbitrary Python source code, inspired by CSmith's success in finding C compiler bugs. It's based on the grammar but ultimately only generates strings which compile accepts; this is the only way I know to answer the question "is the string valid Python"!

I should be clear that I don't think the minimal examples are representative of real problems that users may encounter! However, fuzzing is very effective at finding important bugs if we can get these apparently-trivial ones out of the way by changing either the code or the test :-)

@example("#")
@example("\n\\\n")
@example("#\n\x0cpass#\n")
@given(source_code=hypothesmith.from_grammar().map(fixup).filter(str.strip))
def test_tokenize_round_trip_string(source_code):
    tokens = list(tokenize.generate_tokens(io.StringIO(source_code).readline))
    outstring = tokenize.untokenize(tokens)  # may have changed whitespace from source
    output = tokenize.generate_tokens(io.StringIO(outstring).readline)
    assert [(t.type, t.string) for t in tokens] == [(t.type, t.string) for t in output]

Each of the @example cases are accepted by compile but fail the test; the @given case describes how to generate more such strings. You can read more details in the Hypothesmith repo if interested.

I think these are real and probably unimportant bugs, but I'd love to start a conversation about what properties should *always* hold for functions dealing with Python source code - and how best to report research results if I can demonstrate that they don't!

(for example, lib2to3 has many similar failures but I don't want to open a long list of low-value issues)

terryjreedy · 2020-10-24T00:10:56Z

Zac, thank you for Hypothesmith. I am thinking about how to maybe use it to test certain IDLE functions. But your 2nd example, as posted, does not compile, even with 3.8. Typo?

Thank you also for the two failure examples. I worked on untokenize in 2013 and an not surprised that there are still bugs. The test assert matches the doc claim that the untokenize return "is guaranteed to tokenize back to match the input" of untokenize as far as type and string. To get output that could be used to fix the bugs, I converted to unittest (and ran with 3.10).

from io import StringIO as SIO
import tokenize
import unittest

class RoundtripTest(unittest.TestCase):
    def test_examples(self):
        examples = ("#", "\n\\\n", "#\n\x0cpass#\n",)
        for code in examples:
            with self.subTest(code=code):
                tokens = list(tokenize.generate_tokens(SIO(code).readline))
                print(tokens)
                outstring = tokenize.untokenize(tokens)  # may change whitespace from source
                print(outstring)
                output = tokenize.generate_tokens(SIO(outstring).readline)
                self.assertEqual([(t.type, t.string) for t in tokens],
                                 [(t.type, t.string) for t in output])

unittest.main()

"#" compiles: untokenize calls add_whitespace, which failed on line 173 with
ValueError: start (1,1) precedes previous end (2,0)
tokens = [
TokenInfo(type=60 (COMMENT), string='#', start=(1, 0), end=(1, 1), line='#'),
TokenInfo(type=61 (NL), string='', start=(1, 1), end=(1, 1), line='#'),
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line=''),
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')]

The doc for NL, a tokenize-only token, says "Token value used to indicate a non-terminating newline. The NEWLINE token indicates the end of a logical line of Python code; NL tokens are generated when a logical line of code is continued over multiple physical lines." The NL token seems to be a mistake here.

Calling add_whitespace also seems like a mistake. In any case, raising on a valid token stream is obviously bad.

"\n\\\n" does not compile in 3.8 or 3.10.
>>> compile("\n\\\n", '', 'exec')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "", line 2
    \
     ^
SyntaxError: unexpected EOF while parsing

generate_tokens calls _tokenize, which failed on line 521, as it should, with
tokenize.TokenError: ('EOF in multi-line statement', (3, 0))
Nothing to fix here.

"#\n\x0cpass#\n": outstring is '#\n pass#\n', which fails compile with IndentationError.
[(60, '#'),(61, '\n'), (1, 'pass'),(60, '#'),(4, '\n'), (0, '')] !=
[(60, '#'),(61, '\n'),(5, ' '),(1, 'pass'),(60, '#'),(4, '\n'),(6, ''),(0, '')]

test_tokenize tests the roundtrip with various modules and test strings, but perhaps none with formfeed. I think the bug is tokenizing 'pass' as starting in column 1 instead of column 0.
TokenInfo(type=1 (NAME), string='pass', start=(2, 1), end=(2, 5), line='\x0cpass#\n')

Formfeed = '\f' = '\x0c' is legal here precisely because it is non-space 'whitespace' that does not advance the column counter. In _tokenize, \f sets the column counter for indentation to 0. Otherwise, there would be an IndentationError, as there is with the outstring. But the string position ignores the indentation counter. Either the string position must be adjusted, so \f is replaced with nothing, or a token for \f must be emitted so it is not replaced.

Tokens 5 and 6 are INDENT and DEDENT, so the latter will go away with the former.

What is a bit silly about untokenize is that it ignores the physical line of 5-tuples even when present. Another issue, along with the dreadful API.

A note could be added to https://docs.python.org/3/reference/lexical_analysis.html#whitespace-between-tokens when _tokenize is patched.
---

BPO is aimed at facilitating patches. Other discussions are best done elsewhere. But I have a quick question and comment.

Can hypothesis be integrated as is with unittest? Does it work to decorate test_xyz and get a sensible report of multiple failures? Is there now or possibly in the future an iterator interface, so one could write "for testcase in testcases: with subTest...."

About properties: your blog post https://github.com/Zac-HD/stdlib-property-tests pointed me to metamorphic testing https://www.hillelwayne.com/post/metamorphic-testing/ which lead me to the PDF version of "“Metamorphic Testing: A Review of Challenges and Opportunities”. Most properties I have seen correspond to metamorphic relations. The 'metamorphic' is broader than may be immediately obvious. I would like to discuss this more on a better channel.

Zac-HD · 2020-10-24T00:58:47Z

Thanks for your comments Terry - I'm delighted that it's useful. It's been a while since I wrote that, and entirely possible that it's just a typo.

Hypothesis does indeed support unittest, including for multiple-failure reporting and so on.
You can see my unittest implementation of the tokenise tests at https://github.com/Zac-HD/stdlib-property-tests/blob/b5ef97f9e7fd1b0e7a028823e436f78f374cf0dc/tests/test_source_code.py#L87-L133

Subtests are a little tricky, because the default interaction is to store the subtests for *all* test cases generated by Hypothesis. We therefore monkeypatch it to a no-op, but more sophisticated handling is almost certainly possible. More generally, when using Hypothesis I would usually ask @given for single inputs and call the test method; which replaces the usual loop over a list of inputs with subTest.

An iterator interface is not available because in general (see e.g. hypothesis.strategies.data() or stateful testing) it is not possible to separate data generation from test execution, and also because Hypothesis uses feedback from previous inputs in deciding what to generate. Instead of "for testcase in testcases: with subTest: ...", I'd write "@given(testcase=...) def test_foo(self, testcase): ...".

I've sent you an email for the other conversation. (anyone else interested is invited to get in touch via zhd.dev :-))

Zac-HD mannequin added 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 2, 2019

terryjreedy added 3.10 only security fixes and removed 3.8 only security fixes labels Oct 24, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Zac-HD mentioned this issue Aug 12, 2023

[Meta] Research: what can we test with Hypothesis? #107862

Open

Zac-HD mentioned this issue May 22, 2024

gh-107862: Add property-based round-trip tests for base64 #119406

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Untokenize and retokenize does not round-trip #83134

Untokenize and retokenize does not round-trip #83134

Zac-HD mannequin commented Dec 2, 2019

Zac-HD mannequin commented Dec 2, 2019

terryjreedy commented Oct 24, 2020

Zac-HD mannequin commented Oct 24, 2020

Untokenize and retokenize does not round-trip #83134

Untokenize and retokenize does not round-trip #83134

Comments

Zac-HD mannequin commented Dec 2, 2019

Zac-HD mannequin commented Dec 2, 2019

terryjreedy commented Oct 24, 2020

Zac-HD mannequin commented Oct 24, 2020