List index out of range + unparseable UTF8 chars #430

Warchant · 2023-10-04T12:00:49Z

After toml was not able to parse no-break-space (u'\uc2a0'), I became curious - what other utf8 chars cannot be parsed by toml.

Came up with a following script:

import struct
import sys
import traceback

import toml


def find_all_chars_that_are_unsupported_by_toml_parser():
    bad = []
    for a in range(0, 256):
        for b in range(0, 256):
            c = None
            try:
                c = struct.pack("BB", a, b)
                toml.loads(f"{c.decode('utf-8')}")
            except UnicodeDecodeError:
                pass
            except IndexError as e:
                traceback.print_exc()
                print(c)
                sys.exit(1)
            except toml.TomlDecodeError as e:
                print(e)
                bad.append(c)

    return bad



bad = find_all_chars_that_are_unsupported_by_toml_parser()
print(len(bad))

One interesting thing is that when c=b'\x00\r', then toml.loads fails with IndexError:

    toml.loads(f"{c.decode('utf-8')}")
  File "C:\W\cli\.venv\lib\site-packages\toml\decoder.py", line 207, in loads
    if item == '\r' and sl[i + 1] == '\n':
IndexError: list index out of range

This issue is a request to fix IndexError.
If you replace this:

-toml.loads(f"{c.decode('utf-8')}")
+toml.loads(f"c = '{c.decode('utf-8')}'")

Then IndexError is gone, but there are 506 (on Windows) symbols cannot be parsed by toml.

The text was updated successfully, but these errors were encountered:

JamesParrott · 2023-10-06T11:33:48Z

Well done enumerating all the possibilities. I hope the devs deem fit to address it and give this one (and all the others) the attention they deserve.

In the mean time, while you wait for a fix, there are plenty of other great options. Don't feel that you too need to fork your own TOML reader and writer library like I did....

Why don't you formalise your findings, and add a test to: https://github.com/uiri/toml/blob/master/tests/test_api.py as a PR?

You'll face the same problem I did - you'll submit a PR that causes the CI pipeline to run the test to fail.

However this is because the underlying problems are: firstly the code in toml is broken, and secondly the existing test masks this problem so is also broken.

When code is broken, it should fail a test. Writing a test that fails is step 1 in Test Driven Development.

Just like the lack of tests does not imply code is correct, the existence of a broken test, no other test covering that area, and then passing all the tests, does not imply code is correct either.

Warchant · 2023-10-08T13:36:03Z

A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail?

Now it fails.

JamesParrott · 2023-10-08T20:41:00Z

A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail?

Now it fails.

If I understand it and recall it correctly:

String values must always be quoted, so the file ought to be parsed as long as non-standard white space is quoted (all else being well).

In Toml 1.0.0, no type of white space at all can be in a bare key (unquoted).

unquoted-key = 1*( ALPHA / DIGIT / %x2D / %x5F ) ; A-Z / a-z / 0-9 / - / _

So if the file contains unquoted non-standard whitespace, correct behaviour of a "strict-mode" Toml 1.0.0 parser is to raise an error. But I think one test suites lets the tester choose to allow things like this still to be parsed.

toml = expression *( newline expression )

expression =  ws [ comment ]
expression =/ ws keyval ws [ comment ]
expression =/ ws table ws [ comment ]

;; Whitespace

ws = *wschar
wschar =  %x20  ; Space
wschar =/ %x09  ; Horizontal tab

https://github.com/toml-lang/toml/blob/8eae5e1c005bc5836098505f85a7aa06568999dd/toml.abnf#L18C1-L28C33

But Toml is still a language under active development. In the latest WIP, even Emoji could be legal in bare keys. I'm not familiar with ABNF notation or unicode ranges to say for sure what the ranges below contain
, but I believe the intention was still to exclude any type of white space from bare keys.

;; Unquoted key

unquoted-key = 1*unquoted-key-char
unquoted-key-char = ALPHA / DIGIT / %x2D / %x5F         ; a-z A-Z 0-9 - _
unquoted-key-char =/ %xB2 / %xB3 / %xB9 / %xBC-BE       ; superscript digits, fractions
unquoted-key-char =/ %xC0-D6 / %xD8-F6 / %xF8-37D       ; non-symbol chars in Latin block
unquoted-key-char =/ %x37F-1FFF                         ; exclude GREEK QUESTION MARK, which is basically a semi-colon
unquoted-key-char =/ %x200C-200D / %x203F-2040          ; from General Punctuation Block, include the two tie symbols and ZWNJ, ZWJ
unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces
unquoted-key-char =/ %xF900-FDCF / %xFDF0-FFFD          ; skip D800-DFFF surrogate block, E000-F8FF Private Use area, FDD0-FDEF intended for process-internal use (unicode)
unquoted-key-char =/ %x10000-EFFFF                      ; all chars outside BMP range, excluding Private Use planes (F0000-10FFFF)

toml-lang/toml#891
https://github.com/toml-lang/toml/blob/23c3fb79f3f54ebc01110b963d7119006d91facc/toml.abnf#L55

pradyunsg added component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals labels Oct 7, 2023

Warchant pushed a commit to Warchant/toml that referenced this issue Oct 8, 2023

Fix uiri#430: list index out of range

7d2fae0

Warchant pushed a commit to Warchant/toml that referenced this issue Oct 8, 2023

Fix uiri#430: list index out of range

41aaa56

uiri closed this as completed in 06f54ed Oct 8, 2023

uiri mentioned this issue Oct 8, 2023

Fix #430: list index out of range #431

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List index out of range + unparseable UTF8 chars #430

List index out of range + unparseable UTF8 chars #430

Warchant commented Oct 4, 2023

JamesParrott commented Oct 6, 2023 •

edited

Loading

Warchant commented Oct 8, 2023

JamesParrott commented Oct 8, 2023 •

edited

Loading

List index out of range + unparseable UTF8 chars #430

List index out of range + unparseable UTF8 chars #430

Comments

Warchant commented Oct 4, 2023

JamesParrott commented Oct 6, 2023 • edited Loading

Warchant commented Oct 8, 2023

JamesParrott commented Oct 8, 2023 • edited Loading

JamesParrott commented Oct 6, 2023 •

edited

Loading

JamesParrott commented Oct 8, 2023 •

edited

Loading