Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ASCII characters in filenames break git ls-files parsing #23

Closed
myint opened this issue Feb 2, 2014 · 7 comments · Fixed by #24
Closed

non-ASCII characters in filenames break git ls-files parsing #23

myint opened this issue Feb 2, 2014 · 7 comments · Fixed by #24

Comments

@myint
Copy link
Contributor

myint commented Feb 2, 2014

I'm not sure whether I tried check-manifest on this repository before. So I'm not sure whether this is a regression or whether it never worked before. I can reproduce this on 0.18 and 0.19.dev0. I've seen this on Python 3.3, but it seems to also occur if I try Python 2.7.

$ git clone https://github.com/hhatto/autopep8
$ cd autopep8
$ check-manifest --version
check-manifest version 0.19.dev0
$ check-manifest
listing source files under version control: 96 files and directories
copying source files to a temporary directoryTraceback (most recent call last):
  File "/Users/myint/Library/Python/3.3/bin/check-manifest", line 9, in <module>
    load_entry_point('check-manifest==0.19.dev0', 'console_scripts', 'check-manifest')()
  File "/Users/myint/Library/Python/3.3/lib/python/site-packages/check_manifest.py", line 658, in main
    update=args.update, python=args.python):
  File "/Users/myint/Library/Python/3.3/lib/python/site-packages/check_manifest.py", line 572, in check_manifest
    copy_files(source_files, tempsourcedir)
  File "/Users/myint/Library/Python/3.3/lib/python/site-packages/check_manifest.py", line 191, in copy_files
    shutil.copy2(filename, destfile)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/shutil.py", line 243, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/shutil.py", line 109, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '"test'
@mgedmin
Copy link
Owner

mgedmin commented Feb 2, 2014

Caused by a file named é.txt in the source repository. git ls-files outputs it as "test/example/e\314\201.txt" (with the quotes and everything).

check-manifest should be changed to parse that correctly.

mgedmin added a commit that referenced this issue Feb 3, 2014
As long as they're valid in your locale.

Solution by @myint from #24, but with added tests and an additional
charset decoding fix in run().

Closes #23.
@mgedmin
Copy link
Owner

mgedmin commented Feb 3, 2014

And now I've got Windows Jenkins failures due to this test. :(

...
  File "c:\buildslave\workspace\check-manifest-on-windows\tests.py", line 779, in _add_to_vcs
    self._run('git', 'add', '--', *filenames)
  File "c:\buildslave\workspace\check-manifest-on-windows\tests.py", line 685, in _run
    stderr=subprocess.STDOUT)
  File "c:\python27\Lib\subprocess.py", line 709, in __init__
    errread, errwrite)
  File "c:\python27\Lib\subprocess.py", line 957, in _execute_child
    startupinfo)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 11: ordinal not in range(128)

@mgedmin
Copy link
Owner

mgedmin commented Feb 3, 2014

With the following change I get two failures instead of four:

diff --git a/tests.py b/tests.py
index 3b7e096..c9f52e1 100644
--- a/tests.py
+++ b/tests.py
@@ -681,6 +681,8 @@ class TestZestIntegration(unittest.TestCase):
 class VCSHelper(object):

     def _run(self, *command):
+        import locale
+        command = [s.encode(locale.getpreferredencoding()) for s in command]
         p = subprocess.Popen(command, stdout=subprocess.PIPE,
                              stderr=subprocess.STDOUT)
         stdout, stderr = p.communicate()

The failures are:

  • Bzr returns [u'\u201a.txt'] from get_vcs_files() (that's a u'‚' character, how does this make sense?)
  • Git returns [u'\xc3\xa9.txt'] from get_vcs_files(), probably because no I have no idea.

The expected value is [u'\xe9.txt'], obviously. Also, locale.getpreferredencoding() returns 'cp1252' on my Windows slave.

@mgedmin mgedmin reopened this Feb 3, 2014
@mgedmin
Copy link
Owner

mgedmin commented Feb 3, 2014

For the record, LC_ALL=en_US.ISO-8859-1 detox passes all tests on Linux. (I had to Google how to build a Latin-1 locale on Ubuntu.)

@mgedmin
Copy link
Owner

mgedmin commented Feb 3, 2014

Incidentally, this repo (hhatto/autopep8) cannot be checked with latest check-manifest master on Windows because git treats filenames as bytes and doesn't convert UTF-8 to the locale encoding (cp1252), and then check-manifest fails:

listing source files under version controlTraceback (most recent call last):
  File "c:/buildslave/workspace/check-manifest-on-windows/check_manifest.py", line 708, in <module>
    main()
  File "c:/buildslave/workspace/check-manifest-on-windows/check_manifest.py", line 669, in main
    update=args.update, python=args.python):
  File "c:/buildslave/workspace/check-manifest-on-windows/check_manifest.py", line 578, in check_manifest
    all_source_files = sorted(get_vcs_files())
  File "c:/buildslave/workspace/check-manifest-on-windows/check_manifest.py", line 326, in get_vcs_files
    return normalize_names(vcs.get_versioned_files())
  File "c:/buildslave/workspace/check-manifest-on-windows/check_manifest.py", line 274, in get_versioned_files
    output = run(['git', 'ls-files', '-z'])
  File "c:/buildslave/workspace/check-manifest-on-windows/check_manifest.py", line 142, in run
    output = pipe.communicate()[0].decode(locale.getpreferredencoding())
  File "c:\python27\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 337: character maps to <undefined>

mgedmin added a commit that referenced this issue Feb 3, 2014
I'm actually surprised -- don't all the Win32 APIs have Unicode
versions?  Eh, Python.

Related to issue #23.
@mgedmin
Copy link
Owner

mgedmin commented Feb 3, 2014

I'm closing this issue and creating #26 for the remainder.

I should probably also create an issue for "check-manifest can't deal with filenames not expressible in your locale", i.e. the above comment, but it seems like a large task and I don't know how to approach it. Let's wait for users to complain (and, perhaps, offer solutions).

@mgedmin mgedmin closed this as completed Feb 3, 2014
@mgedmin
Copy link
Owner

mgedmin commented Feb 3, 2014

Bzr returns [u'\u201a.txt'] from get_vcs_files() (that's a u'‚' character, how does this make sense?)

FWIW this error just disappeared for no reason.

Maybe there's something different in my environment that doesn't show up in a Jenkins job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants