Filename characters are displaying incorrectly #8

adyster · 2022-09-17T21:36:59Z

Filenames that contain non-English characters (e.g. ä,é,è,ö ...) are not listed correctly.

When I run normal command line 7z, it works fine:
7z l Archive.7z

I get following output:

7-Zip [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

Scanning the drive for archives:
1 file, 25043856 bytes (24 MiB)

Listing archive: Archive.7z

--
Path = Archive.7z
Type = 7z
Physical Size = 25043856
Headers Size = 250
Method = LZMA2:24
Solid = +
Blocks = 1

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-09-17 08:23:47 ....A      6104109     25043606  image-one.png
2022-09-17 08:22:45 ....A      4844990               imagè-thréê.png
2022-09-17 08:23:25 ....A      8346941               imäge-twö.png
2022-09-17 08:22:08 ....A      5759066               ìmágé-fòûr.png
------------------- ----- ------------ ------------  ------------------------

But when I run the 7z-wasm version:
npx 7z-wasm l Archive.7z

I get following output:

7-Zip (z) 22.01 (LE) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
 32-bit ILP32 locale=C.UTF-8 Threads:1

Scanning the drive for archives:
1 file, 25043856 bytes (24 MiB)

Listing archive: Archive.7z

--
Path = Archive.7z
Type = 7z
Physical Size = 25043856
Headers Size = 250
Method = LZMA2:24
Solid = +
Blocks = 1

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-09-17 08:23:47 ....A      6104109     25043606  image-one.png
2022-09-17 08:22:45 ....A      4844990               imagￃﾨ-thrￃﾩￃﾪ.png
2022-09-17 08:23:25 ....A      8346941               imￃﾤge-twￃﾶ.png
2022-09-17 08:22:08 ....A      5759066               ￃﾬmￃﾡgￃﾩ-fￃﾲￃﾻr.png
------------------- ----- ------------ ------------  ------------------------

I wonder if it has something to do with the locale? Is C.UTF-8 more restrictive? How can that be changed? Any ideas?

The text was updated successfully, but these errors were encountered:

use-strict · 2022-09-18T07:20:11Z

I created an archive containing a single file named üöäăîș.txt. Listing the archive indeed produces incorrect output (most likely stdout is encoded as ASCII?). Extracting the archive however produced the correct filenames. This worked when extracting all files implicitly or explicitly by filename.

As a workaround, you could extract the archive and traverse the output directory instead. Of course, assuming the overhead is acceptable.

adyster · 2022-09-18T07:34:41Z

Ok, thanks for the suggestion.

Think the ASCII can be converted to UTF-8 format. Tried this utility and it converts the names correctly:
https://onlineutf8tools.com/convert-ascii-to-utf8

adyster · 2022-09-18T10:12:43Z

FYI, if anyone else stumbles upon this issue, then theres a simple little utility that can convert the ASCII to UTF8 here:
https://github.com/mathiasbynens/utf8.js

use-strict · 2022-09-18T12:27:38Z

The culprit is this callback: https://github.com/use-strict/7z-wasm/blob/master/cli.js#L21

It turns out Emscripten spits out individual signed bytes, so we need to decode the UTF-8 encoded output manually instead of treating each byte as an ASCII character.

Due to lack of time, I'm going to treat this as low priority, but I am accepting PRs.

The solution is to decode UTF-8 on the fly, character by character. We DON'T want to buffer the output and print it all at once before exiting.

positive bytes represent 7-bit ASCII codes and are printed as they are (String.fromCharCode(byte))
negative bytes should be grouped and pushed to a temporary byte array (an UTF-8 character can have up to 4 bytes) for further processing
when we encounter a positive byte or 0 (null), empty the temporary array into a regular UTF-8 TextDecoder and print out the resulting character.
I'm assuming the output cannot terminate with a non-ASCII character. 7-Zip appears to always terminate the output with a newline (\n) anyway.
TextDecoder requires Node 11+ and bumping the engine requirement of the package.

RixInGithub · 2025-01-03T11:03:24Z

You can try exposing a runtime function. I know that a UTF8ArrayToString thing exists, but it only takes ArrayBuffer views. My idea is that we collect every byte onto an Uint8Array then we UTF8ArrayToString the array into a string, and problem may be solved.

use-strict mentioned this issue Mar 16, 2023

Windows gbk encoded zip archive decompression exception #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filename characters are displaying incorrectly #8

Filename characters are displaying incorrectly #8

adyster commented Sep 17, 2022

use-strict commented Sep 18, 2022

adyster commented Sep 18, 2022

adyster commented Sep 18, 2022

use-strict commented Sep 18, 2022

RixInGithub commented Jan 3, 2025

Filename characters are displaying incorrectly #8

Filename characters are displaying incorrectly #8

Comments

adyster commented Sep 17, 2022

use-strict commented Sep 18, 2022

adyster commented Sep 18, 2022

adyster commented Sep 18, 2022

use-strict commented Sep 18, 2022

RixInGithub commented Jan 3, 2025