-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List index out of range + unparseable UTF8 chars #430
Comments
Well done enumerating all the possibilities. I hope the devs deem fit to address it and give this one (and all the others) the attention they deserve. In the mean time, while you wait for a fix, there are plenty of other great options. Don't feel that you too need to fork your own TOML reader and writer library like I did.... Why don't you formalise your findings, and add a test to: https://github.com/uiri/toml/blob/master/tests/test_api.py as a PR? You'll face the same problem I did - you'll submit a PR that causes the CI pipeline to run the test to fail. However this is because the underlying problems are: firstly the code in toml is broken, and secondly the existing test masks this problem so is also broken. When code is broken, it should fail a test. Writing a test that fails is step 1 in Test Driven Development. Just like the lack of tests does not imply code is correct, the existence of a broken test, no other test covering that area, and then passing all the tests, does not imply code is correct either. |
A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail? Now it fails. |
If I understand it and recall it correctly: String values must always be quoted, so the file ought to be parsed as long as non-standard white space is quoted (all else being well). In Toml 1.0.0, no type of white space at all can be in a bare key (unquoted).
So if the file contains unquoted non-standard whitespace, correct behaviour of a "strict-mode" Toml 1.0.0 parser is to raise an error. But I think one test suites lets the tester choose to allow things like this still to be parsed.
But Toml is still a language under active development. In the latest WIP, even Emoji could be legal in bare keys. I'm not familiar with ABNF notation or unicode ranges to say for sure what the ranges below contain
toml-lang/toml#891 |
After
toml
was not able to parseno-break-space
(u'\uc2a0'
), I became curious - what other utf8 chars cannot be parsed bytoml
.Came up with a following script:
One interesting thing is that when
c=b'\x00\r'
, then toml.loads fails with IndexError:This issue is a request to fix IndexError.
If you replace this:
Then IndexError is gone, but there are 506 (on Windows) symbols cannot be parsed by
toml
.The text was updated successfully, but these errors were encountered: