Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character support in filenames when reading a zip created in OSX #131

Closed
timclipsham opened this issue Jul 16, 2015 · 7 comments
Closed

Comments

@timclipsham
Copy link

Expected: Given I've created a zip file using the out-of-the-box zip functionality in OSX, when I read it and examine the file names of the files in the zip, then it should return the same name correctly encoded.

Actual: The file names are parsed as if they are ASCII and not UTF-8.

I'm currently running OS X 10.10.4.

Why is this the case?
There's a configuration bit set in a zip file entry that describes whether the file names and comment must conform to UTF-8 (the specifics below). In OSX, the "Compress" option that comes out of the box (selecting files, right click, "Compress") doesn't seem to set this bit. Currently the zip.js library parses strings as ASCII unless this bit is set.

D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification...

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Solutions?
After some more digging I thought we might be able to just assume UTF-8 always (as ASCII is a subset), but since the default is extended ASCII (Code Page 437) we probably shouldn't as characters 128-255 will be different to those in UTF-8.

Some other ideas:

  1. Allow an override option (we can probably have a good guess and set it if they're using a browser on a mac)
  2. Expose another property that returns the original byte array of the filename
  3. Attempt to detect for UTF-8 and then use it. May work most of the time?
@duanyao
Copy link
Contributor

duanyao commented Jul 16, 2015

In practice, if bit 11 is unset, the encoding of file name can be anything compatible with ASCII, not just CodePage 437. For example, on Windows with locale zh-CN, 7-zip uses GBK (the default encoding of this locale on Windows) to encode file names. I believe most compression tools on Windows also do that. So there is no way to determine the encoding of file name of a zip file from an unkown source if bit 11 is unset.

If we assume UTF-8 even if bit 11 is unset, zip files using CP437 will break (I doubt there are many now), but those created by OSX are rescued, and most of them created on non-English Windows keep breaking. So it seems not a bad idea.

Exposing the original bytes of the filename is easy, but there is no reliable way to guess the actual encoding (especially in browser), so it is not very useful currently.

@timclipsham
Copy link
Author

Exposing the original data is useful because in our application we can detect the presence of __MACOSX folder inside the zip and then assume with a high level of confidence the zip archive was made on OSX and then decode the file names ourselves as UTF-8, ignoring the decoded strings given by zip.js.

Although the above is less ideal than your suggestion of just assuming they're UTF-8 in the first place.

@gildas-lormeau any thoughts?

@Rob--W
Copy link
Collaborator

Rob--W commented Jul 17, 2015

What about a new options object, used as follows (e.g. using the same decoders for filename and comment):

function decodeMyStuff(isBit11Set, rawOctetString) {
    return rawOctetStream; // TODO: Decode
}
zipReader.getEntries({
    decodeFilename: decodeMyStuff,
    decodeComment: decodeMyStuff
}, function(entries) { ... });

Or a low-level hook to be run before/after a zip entry is created, which allows the caller to "patch" a zip file before/after it is processed, e.g. (before)

if (data.view.getUint32(index) != 0x504b0102) {

or (after)
entries.push(entry);
.

@duanyao
Copy link
Contributor

duanyao commented Jul 17, 2015

The hard parts of this issue are:

  • Figure out the actual encoding of file names
    This may be achived by running a encoding detection algorithm. Such algorithm needs sufficient texts as input to deliever reliable result, so one may want to input all raw file names at once instead of one by one. So I think it is better to simply expose filenameBytes and isBit11Set via Entry.
  • Decode file names in that encoding
    For non-UTF encodings, this usually means looking up in a large table, such as GB18030 table for Chinese encoding. Encoding detection algorithm also needs such tables.

However, algorithms above are quite complicated and requires large amount of data, so I doubt a lot of web apps want to do that.

See also Zip files and Encoding – I hate you, and Non-UTF-8 encoding in ZIP file.

@Rob--W
Copy link
Collaborator

Rob--W commented Nov 30, 2016

I created a zip file with a UTF-8 filename, using the zip command on OS X, and confirmed that the 11th bit is NOT set. I checked whether the file name may be hidden somewhere else (in an exttra field called "up" - see "(UPath) 0x7075" in https://fossies.org/linux/zip/proginfo/extrafld.txt), but did not find anything else (there are two extra fields, a timestamp (UT) and permission bits (ux), but these are of course not relevant) :

00000000: 504b 0304 0a00 0000 0000 9b03 7e49 0000  PK..........~I..
00000010: 0000 0000 0000 0000 0000 0400 1c00 f09f  ................
00000020: 92a9 5554 0900 0335 0f3e 5835 0f3e 5875  ..UT...5.>X5.>Xu
00000030: 780b 0001 04f5 0100 0004 1400 0000 504b  x.............PK
00000040: 0102 1e03 0a00 0000 0000 9b03 7e49 0000  ............~I..
00000050: 0000 0000 0000 0000 0000 0400 1800 0000  ................
00000060: 0000 0000 0000 b081 0000 0000 f09f 92a9  ................
00000070: 5554 0500 0335 0f3e 5875 780b 0001 04f5  UT...5.>Xux.....
00000080: 0100 0004 1400 0000 504b 0506 0000 0000  ........PK......
00000090: 0100 0100 4a00 0000 3e00 0000 0000       ....J...>.....

Trying to default to UTF-8 might be a good option since it is compatible with ASCII (not extended ASCII).
We don't need to bundle large tables if we use the new-ish TextDecoder API.

@greggman
Copy link

One random idea, why not just return the binary data (TypedArray) of the filename as part of Entry and whatever other bits/flags that might be useful and let users of the library apply whatever algorithms they want to that binary data of the name? Then this library doesn't have to decide how to handle the names but would make it easy for others to decide what they want to do?

@gildas-lormeau
Copy link
Owner

As suggested by @greggman, I added a new property named rawFilename which will allow the user to decode the string properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants