Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of UTF-16-LE encoded .pth files #119496

Open
ncoghlan opened this issue May 24, 2024 · 3 comments
Open

Improve handling of UTF-16-LE encoded .pth files #119496

ncoghlan opened this issue May 24, 2024 · 3 comments
Labels
type-feature A feature request or enhancement

Comments

@ncoghlan
Copy link
Contributor

ncoghlan commented May 24, 2024

Feature or enhancement

I just finished an extended Windows bug hunt that I eventually tracked down to a .pth file being encoded in UTF-16-LE rather than an ASCII-compatible encoding.

I only figured it out by turning off frozen modules and hacking site.py to dump the output of .pth files as it tried to process them (at which point I checked the .pth file encoding in VSCode and sure enough, UTF-16-LE was down in the corner of the file window).

I hit the bug by porting a Linux shell script to Windows PowerShell, not thinking about the fact that | Out-File on Windows Powershell 5.1 defaults to UTF-16-LE (newer versions of PowerShell that aren't the ones baked into the OS default to UTF-8 without a BOM).

Given the inevitable presence of NULLs in a UTF-16-LE file, while there shouldn't be any in a UTF-8 or locale encoding file, it seems to me we should be able to handle such situations more gracefully (at the very least logging an error if NULL bytes are present in the file, but potentially even just making UTF-16-LE encoded .pth files straight up work by checking for NULL bytes, and using UTF-16-LE instead of UTF-8 and the locale encoding if we find one, or else trying UTF-16-LE before trying the locale encoding on Windows)

(This is somewhat related to #77102)

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

@ncoghlan ncoghlan added the type-feature A feature request or enhancement label May 24, 2024
@ncoghlan
Copy link
Contributor Author

ncoghlan commented May 24, 2024

Edit: the #77102 implementation has been updated to use utf-8-sig, so handling UTF-8 with a BOM is no longer a problem. It's just UTF-16-LE being both easy to accidentally generate and nightmarish to debug when it happens that's still a potential concern.


Experimenting a bit further, I suspect utf-8-sig is going to be a problem, too, since it decodes cleanly with utf-8, but gives garbage data:

>>> "text".encode("utf-8-sig").decode("utf-8")
'\ufefftext'
>>> "text".encode("utf-8-sig").decode("utf-8-sig")
'text'
>>> "text".encode("utf-8").decode("utf-8-sig")
'text'
>>>

The problem child is Windows PowerShell 5.1, where Out-File and friends default to UTF16-LE and allow UTF-8-BOM to be selected, but getting them to emit UTF-8 without a BOM is a major pain (see the voluminous essays on https://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom as well as the PowerShell 5.1 Out-File docs)

As the examples above show, I think decoding with utf-8-sig instead of utf-8 will solve that part of the problem.

ncoghlan added a commit that referenced this issue May 24, 2024
`Out-File -Encoding utf8` and similar commands in Windows Powershell 5.1 emit
UTF-8 with a BOM marker, which the regular `utf-8` codec decodes incorrectly.

`utf-8-sig` accepts a BOM, but also works correctly without one.

This change also makes .pth files match the way Python source files are handled.

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 24, 2024
`Out-File -Encoding utf8` and similar commands in Windows Powershell 5.1 emit
UTF-8 with a BOM marker, which the regular `utf-8` codec decodes incorrectly.

`utf-8-sig` accepts a BOM, but also works correctly without one.

This change also makes .pth files match the way Python source files are handled.

(cherry picked from commit bf5b646)

Co-authored-by: Alyssa Coghlan <ncoghlan@gmail.com>
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 24, 2024
`Out-File -Encoding utf8` and similar commands in Windows Powershell 5.1 emit
UTF-8 with a BOM marker, which the regular `utf-8` codec decodes incorrectly.

`utf-8-sig` accepts a BOM, but also works correctly without one.

This change also makes .pth files match the way Python source files are handled.

(cherry picked from commit bf5b646)

Co-authored-by: Alyssa Coghlan <ncoghlan@gmail.com>
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
ncoghlan added a commit that referenced this issue May 24, 2024
`Out-File -Encoding utf8` and similar commands in Windows Powershell 5.1 emit
UTF-8 with a BOM marker, which the regular `utf-8` codec decodes incorrectly.

`utf-8-sig` accepts a BOM, but also works correctly without one.

This change also makes .pth files match the way Python source files are handled.

(cherry picked from commit bf5b646)

Co-authored-by: Alyssa Coghlan <ncoghlan@gmail.com>
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
ncoghlan added a commit that referenced this issue May 24, 2024
`Out-File -Encoding utf8` and similar commands in Windows Powershell 5.1 emit
UTF-8 with a BOM marker, which the regular `utf-8` codec decodes incorrectly.

`utf-8-sig` accepts a BOM, but also works correctly without one.

This change also makes .pth files match the way Python source files are handled.

(cherry picked from commit bf5b646)

Co-authored-by: Alyssa Coghlan <ncoghlan@gmail.com>
Co-authored-by: Inada Naoki <songofacandy@gmail.com>
@hugovk
Copy link
Member

hugovk commented Jun 15, 2024

Triage: can this be closed or is there more to do?

@ncoghlan
Copy link
Contributor Author

ncoghlan commented Jun 16, 2024

UTF-16-LE still fails silently (the NULLs mean decoding fails, so the file gets ignored).

The already merged PRs just made utf-8-bom work in the #77102 implementation (since that was just a change of input codec to accept UTF-8 both with and without a BOM, rather than adding an entirely new encoding to try)

estyxx pushed a commit to estyxx/cpython that referenced this issue Jul 17, 2024
`Out-File -Encoding utf8` and similar commands in Windows Powershell 5.1 emit
UTF-8 with a BOM marker, which the regular `utf-8` codec decodes incorrectly.

`utf-8-sig` accepts a BOM, but also works correctly without one.

This change also makes .pth files match the way Python source files are handled.

Co-authored-by: Inada Naoki <songofacandy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants