Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusable on Windows/Python2.7 with unicode paths #56

Closed
Suor opened this issue Jun 9, 2019 · 12 comments
Closed

Unusable on Windows/Python2.7 with unicode paths #56

Suor opened this issue Jun 9, 2019 · 12 comments

Comments

@Suor
Copy link

Suor commented Jun 9, 2019

The issue is that sys.getfilesystemencoding() == "mcbs" on Windows/Python 2.7, which is lossy. So one looses info when passes path as unicode to Path.

@Suor Suor changed the title Doesn't work on Windows/Python2.7 with unicode paths Unusable on Windows/Python2.7 with unicode paths Jun 9, 2019
@mcmtroffaes
Copy link
Collaborator

I'm sorry, but I cannot reproduce your problem. Unicode paths work fine as long as they can be encoded in the underlying file system encoding. See also the advice offered at

https://www.python.org/dev/peps/pep-0529/

@Suor
Copy link
Author

Suor commented Jun 10, 2019

Here is a test script:

# encoding: utf-8
import pathlib2
import os
import sys
print 'filesystemencoding', sys.getfilesystemencoding()

test = u'тест' # This is "test" in russian in utf-8

open(test, 'w').close()
p = pathlib2.Path(test)
print 'path', p
print 'os.path.exists', os.path.exists(test)
print 'pathlib.Path.exists', p.exists()

And its output:

filesystemencoding mbcs
path ????
os.path.exists True
pathlib.Path.exists
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    print 'pathlib.Path.exists', p.exists()
  File "C:\Users\User\Envs\dvc-p2\lib\site-packages\pathlib2\__init__.py", line 1549, in exists
    self.stat()
  File "C:\Users\User\Envs\dvc-p2\lib\site-packages\pathlib2\__init__.py", line 1356, in stat
    return self._accessor.stat(self)
  File "C:\Users\User\Envs\dvc-p2\lib\site-packages\pathlib2\__init__.py", line 541, in wrapped
    return strfunc(str(pathobj), *args)
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: '????'

I use Python 2.7.14 on Windows 10 Dev Evaluation image. As you can see open() and os.path work fine with utf-8 path, but not pathlib2.

See also the advice offered at

What advice do you mean?

@mcmtroffaes mcmtroffaes reopened this Jun 10, 2019
@mcmtroffaes
Copy link
Collaborator

Thanks for the test case, that really helps.

Can you check the result of u'тест'.encode("mbcs") (I do not immediately have access to a windows machine so I cannot test it right now myself)? If that works, then the issue is not caused by using a path that cannot be encoded in mbcs in Windows (as you mentioned in the issue originally), but rather, it looks like for some reason the accessor functions aren't properly working.

If the encode works, can you also see if open(u'тест'.encode("mbcs"), 'w').close() works and creates the correct file?

@mcmtroffaes
Copy link
Collaborator

Well, from the tests I've done so far it looks like u'тест' is not encodable by sys.filesystemencoding() on Windows. Why os.path methods work on these paths when passed as unicode is a mystery. To make things even worse, encode with mbcs does not respect the "strict" argument.

This is not advertised behaviour (see https://docs.python.org/2/howto/unicode.html#unicode-filenames ...), and it looks like Python 2.7 is broken in this respect, at least on Windows. :-(

@Suor
Copy link
Author

Suor commented Jun 10, 2019

u'тест'.encode('mbcs') returns '????', so data is lost after that.

@mcmtroffaes
Copy link
Collaborator

Thanks for testing - I've seen the same on appveyor. I'm afraid this looks like a bug in Python 2.7.

@Suor
Copy link
Author

Suor commented Jun 10, 2019

Why are you encoding strings anyway? This is not done in Python 3 and is weird to see in a backport. I was quite surprised when my tests failed because Path.name is an encoded string in Python 2 - a real headache for a cross-python app.

@mcmtroffaes
Copy link
Collaborator

mcmtroffaes commented Jun 10, 2019

The original authors designed pathlib to store file paths as byte strings in the file system encoding. I'm afraid I do not know why. From a user perspective, it would make much more sense for unicode to be used for filenames consistently everywhere.

The fact that for instance open(u'тест'.encode(sys.getfilesystemencoding()), 'w') is not the same as open(u'тест', 'w') seems to me to be a bug in Python 2.7. The backport was designed with the assumption that these would be equivalent.

The way to fix this is to figure out which encoding to use in place of mbcs on Windows. I'm sorry to say that I have no idea on how to do this correctly. From what I understand, the whole purpose of "mbcs" is for Python to do this correctly for us. But it doesn't...

@Suor
Copy link
Author

Suor commented Jun 10, 2019

Or you can provide a way for users to set this encoding. I would just say something like:

pathlib.setfilesystemencoding('utf-8')

and be ok with it. And then drop it in a half of the year)

@mcmtroffaes
Copy link
Collaborator

Well, you can just piggy back sys.getfilesystemencoding() to return the correct encoding for your file system. For example:

import sys
sys.getfilesystemencoding = lambda: "utf16"

if your filesystem uses utf16 encoding (which, I understand, is the standard used in windows). Obviously, if you specify the wrong encoding, things will go horribly wrong, so do this at your own risk.

pathlib2 uses sys.getfilesystemencoding() to get the encoding everywhere, so this hack should work as foreseen (although I have not tested it).

Does this help?

@Suor
Copy link
Author

Suor commented Jun 14, 2019

I don't want to hack that globally. I've already worked around it by fixing my PathInfo subclass.

@mcmtroffaes
Copy link
Collaborator

Sure. I'll close this as it's an upstream bug in Python 2.7 (which is fixed in Python 3.6 as per the PEP I linked above in my first reply). In retrospect, I wish I had been aware earlier of how big of a problem this was.

mcmtroffaes added a commit that referenced this issue Jun 14, 2019
Warn users about unicode on Python 2.7 on Windows, and point to #56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants