Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 680: Further explain why files are read as binary #2281

Merged
merged 2 commits into from
Jan 27, 2022

Conversation

hauntsaninja
Copy link
Contributor

@hauntsaninja hauntsaninja commented Jan 26, 2022

On the discuss thread, questions were brought up about the binary file type: https://discuss.python.org/t/pep-680-tomllib-support-for-parsing-toml-in-the-standard-library/13040/43
Ensuring that it's easy to correctly parse TOML files containing non-ASCII characters on Windows is a good reason to use binary files. This PR adds some words to highlight this reason for using binary files.

cc @hukkin

Copy link
Contributor

@hukkin hukkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the change (if it helps anyone understand the issue better), but also don't think it's really necessary. My thinking is that even if all platforms defaulted to UTF-8, it makes no sense providing an API that allows changing that default, because TOML strictly requires UTF-8. Why allow configuring incorrect parse results?

Copy link
Member

@CAM-Gerlach CAM-Gerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes LGTM, thanks! I just have one comment to clarify something that confused me when I was reading the PEP initially.

pep-0680.rst Outdated
Comment on lines 293 to 294
encodings, such as Windows), and avoid incorrectly parsing single carriage
returns as valid TOML due to universal newlines in text mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
encodings, such as Windows), and avoid incorrectly parsing single carriage
returns as valid TOML due to universal newlines in text mode.
encodings, such as Windows), and avoid incorrectly parsing files using a
single carriage return (``\r``) for the end of line character as valid TOML
(due to Python's universal newline conversion in text mode).

Despite careful re-reading, I was initially confused by this line when reading the original PEP, and then misunderstood it to mean files that were just single CRs, rather than files that use single CRs as the newline character. I suggest clarifying this as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit to clarify this. I think this reason is much less important than the other one, so don't wish to spend too many words on this.

Copy link
Member

@CAM-Gerlach CAM-Gerlach Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit doesn't really clarify the confusion I had over its intended meaning, sorry. The problematic usage is with using a single CR as an end of line character, not single CRs anywhere in the file being an invalid character as the current wording implies (which both won't universal newlines unless it is the EOL character, and appears to be legal TOML inside multi-line double-quoted strings, at least per the verbatim letter of the spec).

If it isn't that important, such that it wouldn't be worth adding a few extra words to make its intended meaning clear and explicit and avoid introducing additional confusion, then maybe it would be best to just remove it entirely?

Copy link
Contributor

@hukkin hukkin Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference on the wording here, but I do think this is important and should not be removed.

Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf (I've also made a clarification of this in the Markdown spec, but that isn't included in TOML v1.0.0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference on the wording here, but I do think this is important and should not be removed.

In that case, I suggest including the above clarification (or similar), as the present wording was the source of confusion on what you meant here (by someone who makes heavy use of TOML, though lacking your depth of expertise into the intricacies of the format).

Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf.

Ah, thanks for the clarification. I was mislead by this line,

Any Unicode character may be used except those that must be escaped: backslash and the control characters other than tab, line feed, and carriage return (U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F).

and the lack of explicitly requiring any CF to always be followed by LF, but indeed the formal ABNF is correct.

@CAM-Gerlach CAM-Gerlach changed the title PEP 680: further explain binary files PEP 680: Further explain why files are read as binary Jan 27, 2022
@encukou encukou merged commit 5796270 into python:main Jan 27, 2022
@hauntsaninja hauntsaninja deleted the pep-680-edits-2 branch January 27, 2022 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants