Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 680: Further explain why files are read as binary #2281

Merged
merged 2 commits into from
Jan 27, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions pep-0680.rst
Original file line number Diff line number Diff line change
Expand Up @@ -289,8 +289,9 @@ be worked around in user code, or by using a third-party library.

The proposed API takes a binary file, while ``toml.load`` takes a text file and
``json.load`` takes either. Using a binary file allows us to ensure UTF-8 is
the encoding used, and avoid incorrectly parsing single carriage returns as
valid TOML due to universal newlines in text mode.
the encoding used (ensuring correct parsing on platforms with other default
encodings, such as Windows), and avoid incorrectly parsing single carriage
returns as valid TOML due to universal newlines in text mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
encodings, such as Windows), and avoid incorrectly parsing single carriage
returns as valid TOML due to universal newlines in text mode.
encodings, such as Windows), and avoid incorrectly parsing files using a
single carriage return (``\r``) for the end of line character as valid TOML
(due to Python's universal newline conversion in text mode).

Despite careful re-reading, I was initially confused by this line when reading the original PEP, and then misunderstood it to mean files that were just single CRs, rather than files that use single CRs as the newline character. I suggest clarifying this as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a commit to clarify this. I think this reason is much less important than the other one, so don't wish to spend too many words on this.

Copy link
Member

@CAM-Gerlach CAM-Gerlach Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit doesn't really clarify the confusion I had over its intended meaning, sorry. The problematic usage is with using a single CR as an end of line character, not single CRs anywhere in the file being an invalid character as the current wording implies (which both won't universal newlines unless it is the EOL character, and appears to be legal TOML inside multi-line double-quoted strings, at least per the verbatim letter of the spec).

If it isn't that important, such that it wouldn't be worth adding a few extra words to make its intended meaning clear and explicit and avoid introducing additional confusion, then maybe it would be best to just remove it entirely?

Copy link
Contributor

@hukkin hukkin Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference on the wording here, but I do think this is important and should not be removed.

Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf (I've also made a clarification of this in the Markdown spec, but that isn't included in TOML v1.0.0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference on the wording here, but I do think this is important and should not be removed.

In that case, I suggest including the above clarification (or similar), as the present wording was the source of confusion on what you meant here (by someone who makes heavy use of TOML, though lacking your depth of expertise into the intricacies of the format).

Regarding TOML spec confusion: note that CR character is always part of an EOL sequence (CR+LF). If it isn't then the TOML is invalid. A bare CR is not valid in multi-line strings (double or single quoted) as per toml.abnf.

Ah, thanks for the clarification. I was mislead by this line,

Any Unicode character may be used except those that must be escaped: backslash and the control characters other than tab, line feed, and carriage return (U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, U+007F).

and the lack of explicitly requiring any CF to always be followed by LF, but indeed the formal ABNF is correct.



Type accepted as the first argument of ``tomllib.loads``
Expand Down