-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subprocess.run() defaults to the wrong text encoding under Windows #105312
Comments
I get the same in main with US machine. |
If Python is attached to a console session, the console's current input code page is The CMD shell's internal commands such as >>> encoding = os.device_encoding(1)
>>> encoding
'cp850'
>>> p = subprocess.run("echo ö", shell=True, stdout=subprocess.PIPE, encoding=encoding)
>>> p.stdout.strip()
'ö' Alternatively, since the console defaults to using the system OEM code page, you can use There is no universal I/O encoding or API query. For example, "sort.exe" use the process OEM code page instead of the console output code page. In the following example, I've set the console code pages to 850, and I chose the character "¢" because it's encoded differently in each of the code pages: 437 (process OEM), 1252 (process ANSI), and 850. >>> open('f.txt', 'w', encoding='utf-16').write('¢')
1
>>> p = subprocess.run('sort.exe "f.txt"', stdout=subprocess.PIPE, encoding='oem')
>>> p.stdout.strip()
'¢'
>>> p = subprocess.run('sort.exe "f.txt"', stdout=subprocess.PIPE, encoding='ansi')
>>> p.stdout.strip() # wrong
'›'
>>> p = subprocess.run('sort.exe "f.txt"', stdout=subprocess.PIPE, encoding='cp850')
>>> p.stdout.strip() # wrong
'ø' For another example, "attrib.exe" uses the process ANSI code page. >>> os.mkdir('spam')
>>> open(r'spam\¢', 'w').close()
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, text=True)
>>> p.stdout.strip()
'A C:\\Temp\\spam\\¢'
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, encoding='ansi')
>>> p.stdout.strip()
'A C:\\Temp\\spam\\¢'
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, encoding='oem')
>>> p.stdout.strip() # wrong
'A C:\\Temp\\spam\\ó'
>>> p = subprocess.run(r'attrib.exe "spam\*"', stdout=subprocess.PIPE, encoding='cp850')
>>> p.stdout.strip() # wrong
'A C:\\Temp\\spam\\ó' The list of mutually inconsistent examples could go on. There is no standard. Common choices are the process ANSI code page, process OEM code page, UTF-16, UTF-8, or the current input code page or current output code page of a console session. The current console code page in general has nothing to do with the user locale (e.g. day/month names, number/currency symbols) or the user's preferred UI language (text resources, messages). It's a bad choice for the locale encoding, unless it's UTF-8. The best choice in general is the ANSI code page of the user locale, unless the process ANSI code page is UTF-8. Next best is the process ANSI code page, which is normally based on the system locale and commonly matches the user locale. Python uses the process ANSI code page as the locale encoding, unless it's overridden by UTF-8 mode.
That revision changed their Since 3.6, Python's I/O stack uses the |
This is great detail and information about the behaviour of Windows. Thanks, @eryksun ! Still, is the expectation that any user of So, while I agree with you that text encoding clearly needs to be configurable because there is not clear standard, I get back to my initial impression that the default is wrongly chosen. FWIW (probably not too much): the OEM codepage is also used by any dotnet executable (no need to include it's source code here): >>> subprocess.run("SayHi.exe", text=True, stdout=subprocess.PIPE).stdout
'”\n' |
Some user expectations: I came across print(subprocess.run(["python", "-c", "print('Я')"], shell=True, text=True, capture_output=True).stderr) results in the following:
|
Unfortunately, we're somewhat restricted in how many people we can break here. Python on Windows uses a different encoding based on whether it's attached to a console (no compatibility requirements) or to a file/pipe (many compatibility requirements). In the former, it uses the native console APIs with UTF-16-LE text to write output, and so virtually any character will work. For file/pipes, we use the normal method of looking at If you set If someone comes up with a clever way we can change the default encoding to UTF-8 without upsetting everyone who needs it to behave the way it currently does, or at least safely deprecating/warning them about an upcoming change, we'd be open to changing it. Until then, we're kinda stuck. |
Thank you for explaining in detail, this works like a charm:
Also tried with no
As a side note - one might want tochange the codepage if output is in ???? characters, for example code below with
|
The Your original example captured >>> os.environ['PYTHONIOENCODING']
'UTF-8'
>>> p = subprocess.run(["python", "-c", "print('Я')"], text=True, capture_output=True)
>>> p.stdout
'Я\n' On the parent side of the pipes, the >>> os.environ['PYTHONIOENCODING']
'UTF-8'
>>> p = subprocess.run(["python", "-c", "print('Я')"], encoding='utf-8', capture_output=True)
>>> p.stdout
'Я\n' Alternatively, override all I/O to UTF-8 by setting >>> os.environ['PYTHONUTF8']
'1'
>>> sys.flags.utf8_mode
1
>>> p = subprocess.run(["python", "-c", "print('Я')"], text=True, capture_output=True)
>>> p.stdout
'Я\n'
The "chcp.com" command gets and sets the current code page of the console session. I assume your shell is either CMD or another that works similarly. The CMD shell is partly a legacy console application. When attached to a console session, CMD uses the console code page as the I/O encoding for files and pipes, such as text that's piped to a child process from the Python is a legacy Windows application that uses the system code page as its default I/O encoding for a file or pipe, whether or not it's attached to a console session. Like the CMD shell, Python 3.6+ uses Unicode for console I/O via the console's wide-character API. Note that the system code page (i.e. system ANSI) and default console code page (i.e. system OEM) are potentially inconsistent with text that's based on the locale (e.g. number, currency, and time formatting characters; names of weekdays and months) and UI language (e.g. the messages and strings used by the system or a library/application). Usually the locale and UI language are configured consistently with each other for the current user, but the system code page and default console code page are configured at the system level. A good option for the I/O encoding is to use the code page that's configured for the current user locale, but override to UTF-8 if either the user locale is Unicode-only (e.g. Hindi, India) or the system code page is set to UTF-8 (i.e. code page 65001). That said, it's increasingly a non-issue as an increasing number of systems nowadays configure the system code page and default console code page as UTF-8. All of this mess with legacy encodings will one day be a footnote in history. |
subprocess.check_output also same error in python3.10.10 |
This is a western Europe Windows 11 machine:
As you can see, there is codepage confusion. You don't get back what you wrote out.
Windows has different codepage settings applied, depending on context. File encoding (also called ANSI codepage) is not necessarily identical with console encoding (also called OEM codepage), see https://stackoverflow.com/a/43194047. The OEM codepage contains legacy graphical symbols like "╣" or "▒".
On my machine:
The character "ö" has codepoint 0x94 in CP850 (see table there). In CP1252, this codepoint maps to "”".
The suggestion here is that
subprocess
related things should not pass the choice of the default encoding toio.TextWrapper
(which is documented to takelocale.getencoding()
), but should instead default to the value returned byGetConsoleCP()
.This would be exactly the same as the GO people decided to do.
The text was updated successfully, but these errors were encountered: