-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode character support in filenames when reading a zip created in OSX #131
Comments
In practice, if bit 11 is unset, the encoding of file name can be anything compatible with ASCII, not just CodePage 437. For example, on Windows with locale zh-CN, 7-zip uses GBK (the default encoding of this locale on Windows) to encode file names. I believe most compression tools on Windows also do that. So there is no way to determine the encoding of file name of a zip file from an unkown source if bit 11 is unset. If we assume UTF-8 even if bit 11 is unset, zip files using CP437 will break (I doubt there are many now), but those created by OSX are rescued, and most of them created on non-English Windows keep breaking. So it seems not a bad idea. Exposing the original bytes of the filename is easy, but there is no reliable way to guess the actual encoding (especially in browser), so it is not very useful currently. |
Exposing the original data is useful because in our application we can detect the presence of Although the above is less ideal than your suggestion of just assuming they're UTF-8 in the first place. @gildas-lormeau any thoughts? |
What about a new options object, used as follows (e.g. using the same decoders for filename and comment):
Or a low-level hook to be run before/after a zip entry is created, which allows the caller to "patch" a zip file before/after it is processed, e.g. (before) Line 658 in 1bead0a
or (after) Line 673 in 1bead0a
|
The hard parts of this issue are:
However, algorithms above are quite complicated and requires large amount of data, so I doubt a lot of web apps want to do that. See also Zip files and Encoding – I hate you, and Non-UTF-8 encoding in ZIP file. |
I created a zip file with a UTF-8 filename, using the
Trying to default to UTF-8 might be a good option since it is compatible with ASCII (not extended ASCII). |
One random idea, why not just return the binary data (TypedArray) of the filename as part of |
As suggested by @greggman, I added a new property named |
Expected: Given I've created a zip file using the out-of-the-box zip functionality in OSX, when I read it and examine the file names of the files in the zip, then it should return the same name correctly encoded.
Actual: The file names are parsed as if they are ASCII and not UTF-8.
I'm currently running OS X 10.10.4.
Why is this the case?
There's a configuration bit set in a zip file entry that describes whether the file names and comment must conform to UTF-8 (the specifics below). In OSX, the "Compress" option that comes out of the box (selecting files, right click, "Compress") doesn't seem to set this bit. Currently the zip.js library parses strings as ASCII unless this bit is set.
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Solutions?
After some more digging I thought we might be able to just assume UTF-8 always (as ASCII is a subset), but since the default is extended ASCII (Code Page 437) we probably shouldn't as characters 128-255 will be different to those in UTF-8.
Some other ideas:
The text was updated successfully, but these errors were encountered: