Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List index out of range + unparseable UTF8 chars #430

Closed
Warchant opened this issue Oct 4, 2023 · 3 comments
Closed

List index out of range + unparseable UTF8 chars #430

Warchant opened this issue Oct 4, 2023 · 3 comments
Labels
component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals

Comments

@Warchant
Copy link
Contributor

Warchant commented Oct 4, 2023

After toml was not able to parse no-break-space (u'\uc2a0'), I became curious - what other utf8 chars cannot be parsed by toml.

Came up with a following script:

import struct
import sys
import traceback

import toml


def find_all_chars_that_are_unsupported_by_toml_parser():
    bad = []
    for a in range(0, 256):
        for b in range(0, 256):
            c = None
            try:
                c = struct.pack("BB", a, b)
                toml.loads(f"{c.decode('utf-8')}")
            except UnicodeDecodeError:
                pass
            except IndexError as e:
                traceback.print_exc()
                print(c)
                sys.exit(1)
            except toml.TomlDecodeError as e:
                print(e)
                bad.append(c)

    return bad



bad = find_all_chars_that_are_unsupported_by_toml_parser()
print(len(bad))

One interesting thing is that when c=b'\x00\r', then toml.loads fails with IndexError:

    toml.loads(f"{c.decode('utf-8')}")
  File "C:\W\cli\.venv\lib\site-packages\toml\decoder.py", line 207, in loads
    if item == '\r' and sl[i + 1] == '\n':
IndexError: list index out of range

This issue is a request to fix IndexError.
If you replace this:

-toml.loads(f"{c.decode('utf-8')}")
+toml.loads(f"c = '{c.decode('utf-8')}'")

Then IndexError is gone, but there are 506 (on Windows) symbols cannot be parsed by toml.

@JamesParrott
Copy link

JamesParrott commented Oct 6, 2023

Well done enumerating all the possibilities. I hope the devs deem fit to address it and give this one (and all the others) the attention they deserve.

In the mean time, while you wait for a fix, there are plenty of other great options. Don't feel that you too need to fork your own TOML reader and writer library like I did....

Why don't you formalise your findings, and add a test to: https://github.com/uiri/toml/blob/master/tests/test_api.py as a PR?

You'll face the same problem I did - you'll submit a PR that causes the CI pipeline to run the test to fail.

However this is because the underlying problems are: firstly the code in toml is broken, and secondly the existing test masks this problem so is also broken.

When code is broken, it should fail a test. Writing a test that fails is step 1 in Test Driven Development.

Just like the lack of tests does not imply code is correct, the existence of a broken test, no other test covering that area, and then passing all the tests, does not imply code is correct either.

@pradyunsg pradyunsg added component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals labels Oct 7, 2023
Warchant pushed a commit to Warchant/toml that referenced this issue Oct 8, 2023
@Warchant
Copy link
Contributor Author

Warchant commented Oct 8, 2023

A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail?

Now it fails.

Warchant pushed a commit to Warchant/toml that referenced this issue Oct 8, 2023
@uiri uiri closed this as completed in 06f54ed Oct 8, 2023
@JamesParrott
Copy link

JamesParrott commented Oct 8, 2023

A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail?

Now it fails.

If I understand it and recall it correctly:

String values must always be quoted, so the file ought to be parsed as long as non-standard white space is quoted (all else being well).

In Toml 1.0.0, no type of white space at all can be in a bare key (unquoted).

unquoted-key = 1*( ALPHA / DIGIT / %x2D / %x5F ) ; A-Z / a-z / 0-9 / - / _

So if the file contains unquoted non-standard whitespace, correct behaviour of a "strict-mode" Toml 1.0.0 parser is to raise an error. But I think one test suites lets the tester choose to allow things like this still to be parsed.

toml = expression *( newline expression )

expression =  ws [ comment ]
expression =/ ws keyval ws [ comment ]
expression =/ ws table ws [ comment ]

;; Whitespace

ws = *wschar
wschar =  %x20  ; Space
wschar =/ %x09  ; Horizontal tab

https://github.com/toml-lang/toml/blob/8eae5e1c005bc5836098505f85a7aa06568999dd/toml.abnf#L18C1-L28C33

But Toml is still a language under active development. In the latest WIP, even Emoji could be legal in bare keys. I'm not familiar with ABNF notation or unicode ranges to say for sure what the ranges below contain
, but I believe the intention was still to exclude any type of white space from bare keys.

;; Unquoted key

unquoted-key = 1*unquoted-key-char
unquoted-key-char = ALPHA / DIGIT / %x2D / %x5F         ; a-z A-Z 0-9 - _
unquoted-key-char =/ %xB2 / %xB3 / %xB9 / %xBC-BE       ; superscript digits, fractions
unquoted-key-char =/ %xC0-D6 / %xD8-F6 / %xF8-37D       ; non-symbol chars in Latin block
unquoted-key-char =/ %x37F-1FFF                         ; exclude GREEK QUESTION MARK, which is basically a semi-colon
unquoted-key-char =/ %x200C-200D / %x203F-2040          ; from General Punctuation Block, include the two tie symbols and ZWNJ, ZWJ
unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces
unquoted-key-char =/ %xF900-FDCF / %xFDF0-FFFD          ; skip D800-DFFF surrogate block, E000-F8FF Private Use area, FDD0-FDEF intended for process-internal use (unicode)
unquoted-key-char =/ %x10000-EFFFF                      ; all chars outside BMP range, excluding Private Use planes (F0000-10FFFF)

toml-lang/toml#891
https://github.com/toml-lang/toml/blob/23c3fb79f3f54ebc01110b963d7119006d91facc/toml.abnf#L55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: decoder Related to parsing in `toml.load` syntax: strings Related to string literals
Projects
None yet
Development

No branches or pull requests

3 participants