Unicode character support in filenames when reading a zip created in OSX #131

timclipsham · 2015-07-16T15:59:23Z

Expected: Given I've created a zip file using the out-of-the-box zip functionality in OSX, when I read it and examine the file names of the files in the zip, then it should return the same name correctly encoded.

Actual: The file names are parsed as if they are ASCII and not UTF-8.

I'm currently running OS X 10.10.4.

Why is this the case?
There's a configuration bit set in a zip file entry that describes whether the file names and comment must conform to UTF-8 (the specifics below). In OSX, the "Compress" option that comes out of the box (selecting files, right click, "Compress") doesn't seem to set this bit. Currently the zip.js library parses strings as ASCII unless this bit is set.

D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification...

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Solutions?
After some more digging I thought we might be able to just assume UTF-8 always (as ASCII is a subset), but since the default is extended ASCII (Code Page 437) we probably shouldn't as characters 128-255 will be different to those in UTF-8.

Some other ideas:

Allow an override option (we can probably have a good guess and set it if they're using a browser on a mac)
Expose another property that returns the original byte array of the filename
Attempt to detect for UTF-8 and then use it. May work most of the time?

duanyao · 2015-07-16T18:00:56Z

In practice, if bit 11 is unset, the encoding of file name can be anything compatible with ASCII, not just CodePage 437. For example, on Windows with locale zh-CN, 7-zip uses GBK (the default encoding of this locale on Windows) to encode file names. I believe most compression tools on Windows also do that. So there is no way to determine the encoding of file name of a zip file from an unkown source if bit 11 is unset.

If we assume UTF-8 even if bit 11 is unset, zip files using CP437 will break (I doubt there are many now), but those created by OSX are rescued, and most of them created on non-English Windows keep breaking. So it seems not a bad idea.

Exposing the original bytes of the filename is easy, but there is no reliable way to guess the actual encoding (especially in browser), so it is not very useful currently.

timclipsham · 2015-07-17T05:41:32Z

Exposing the original data is useful because in our application we can detect the presence of __MACOSX folder inside the zip and then assume with a high level of confidence the zip archive was made on OSX and then decode the file names ourselves as UTF-8, ignoring the decoded strings given by zip.js.

Although the above is less ideal than your suggestion of just assuming they're UTF-8 in the first place.

@gildas-lormeau any thoughts?

Rob--W · 2015-07-17T12:24:12Z

What about a new options object, used as follows (e.g. using the same decoders for filename and comment):

function decodeMyStuff(isBit11Set, rawOctetString) {
    return rawOctetStream; // TODO: Decode
}
zipReader.getEntries({
    decodeFilename: decodeMyStuff,
    decodeComment: decodeMyStuff
}, function(entries) { ... });

Or a low-level hook to be run before/after a zip entry is created, which allows the caller to "patch" a zip file before/after it is processed, e.g. (before)

zip.js/WebContent/zip.js

Line 658 in 1bead0a

if (data.view.getUint32(index) != 0x504b0102) {

or (after)

zip.js/WebContent/zip.js

Line 673 in 1bead0a

entries.push(entry);

.

duanyao · 2015-07-17T14:53:08Z

The hard parts of this issue are:

Figure out the actual encoding of file names
This may be achived by running a encoding detection algorithm. Such algorithm needs sufficient texts as input to deliever reliable result, so one may want to input all raw file names at once instead of one by one. So I think it is better to simply expose filenameBytes and isBit11Set via Entry.
Decode file names in that encoding
For non-UTF encodings, this usually means looking up in a large table, such as GB18030 table for Chinese encoding. Encoding detection algorithm also needs such tables.

However, algorithms above are quite complicated and requires large amount of data, so I doubt a lot of web apps want to do that.

See also Zip files and Encoding – I hate you, and Non-UTF-8 encoding in ZIP file.

Rob--W · 2016-11-30T00:06:15Z

I created a zip file with a UTF-8 filename, using the zip command on OS X, and confirmed that the 11th bit is NOT set. I checked whether the file name may be hidden somewhere else (in an exttra field called "up" - see "(UPath) 0x7075" in https://fossies.org/linux/zip/proginfo/extrafld.txt), but did not find anything else (there are two extra fields, a timestamp (UT) and permission bits (ux), but these are of course not relevant) :

00000000: 504b 0304 0a00 0000 0000 9b03 7e49 0000  PK..........~I..
00000010: 0000 0000 0000 0000 0000 0400 1c00 f09f  ................
00000020: 92a9 5554 0900 0335 0f3e 5835 0f3e 5875  ..UT...5.>X5.>Xu
00000030: 780b 0001 04f5 0100 0004 1400 0000 504b  x.............PK
00000040: 0102 1e03 0a00 0000 0000 9b03 7e49 0000  ............~I..
00000050: 0000 0000 0000 0000 0000 0400 1800 0000  ................
00000060: 0000 0000 0000 b081 0000 0000 f09f 92a9  ................
00000070: 5554 0500 0335 0f3e 5875 780b 0001 04f5  UT...5.>Xux.....
00000080: 0100 0004 1400 0000 504b 0506 0000 0000  ........PK......
00000090: 0100 0100 4a00 0000 3e00 0000 0000       ....J...>.....

Trying to default to UTF-8 might be a good option since it is compatible with ASCII (not extended ASCII).
We don't need to bundle large tables if we use the new-ish TextDecoder API.

greggman · 2017-12-31T05:17:09Z

One random idea, why not just return the binary data (TypedArray) of the filename as part of Entry and whatever other bits/flags that might be useful and let users of the library apply whatever algorithms they want to that binary data of the name? Then this library doesn't have to decide how to handle the names but would make it easy for others to decide what they want to do?

gildas-lormeau · 2021-01-13T18:44:28Z

As suggested by @greggman, I added a new property named rawFilename which will allow the user to decode the string properly.

duanyao mentioned this issue Jul 25, 2016

unzip the filename not support chinese #152

Closed

This was referenced Dec 31, 2017

unzip the filename not support chinese （解压360压缩的文件名称中文乱码） #181

Closed

Support UTF-8 filename #176

Closed

benoitsan mentioned this issue May 7, 2018

Filename decoding issue for zip files archived by macOS weichsel/ZIPFoundation#63

Closed

TalAloni mentioned this issue Nov 23, 2018

ZipFile: Bugfix: Use unicode encoding to read the name string if UseUnicode is set icsharpcode/SharpZipLib#284

Closed

gildas-lormeau closed this as completed in 0e14812 Jan 13, 2021

rikyoz mentioned this issue May 7, 2024

[Bug]: Some zip file names are garbled rikyoz/bit7z#207

Closed

1 task

eliandoran mentioned this issue Jul 30, 2024

Support ignoring Language Encoding flag #521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode character support in filenames when reading a zip created in OSX #131

Unicode character support in filenames when reading a zip created in OSX #131

timclipsham commented Jul 16, 2015

duanyao commented Jul 16, 2015

timclipsham commented Jul 17, 2015

Rob--W commented Jul 17, 2015

duanyao commented Jul 17, 2015

Rob--W commented Nov 30, 2016

greggman commented Dec 31, 2017

gildas-lormeau commented Jan 13, 2021

Unicode character support in filenames when reading a zip created in OSX #131

Unicode character support in filenames when reading a zip created in OSX #131

Comments

timclipsham commented Jul 16, 2015

duanyao commented Jul 16, 2015

timclipsham commented Jul 17, 2015

Rob--W commented Jul 17, 2015

duanyao commented Jul 17, 2015

Rob--W commented Nov 30, 2016

greggman commented Dec 31, 2017

gildas-lormeau commented Jan 13, 2021