Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filename characters are displaying incorrectly #8

Open
adyster opened this issue Sep 17, 2022 · 5 comments
Open

Filename characters are displaying incorrectly #8

adyster opened this issue Sep 17, 2022 · 5 comments

Comments

@adyster
Copy link

adyster commented Sep 17, 2022

Filenames that contain non-English characters (e.g. ä,é,è,ö ...) are not listed correctly.

When I run normal command line 7z, it works fine:
7z l Archive.7z

I get following output:

7-Zip [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,8 CPUs LE)

Scanning the drive for archives:
1 file, 25043856 bytes (24 MiB)

Listing archive: Archive.7z

--
Path = Archive.7z
Type = 7z
Physical Size = 25043856
Headers Size = 250
Method = LZMA2:24
Solid = +
Blocks = 1

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-09-17 08:23:47 ....A      6104109     25043606  image-one.png
2022-09-17 08:22:45 ....A      4844990               imagè-thréê.png
2022-09-17 08:23:25 ....A      8346941               imäge-twö.png
2022-09-17 08:22:08 ....A      5759066               ìmágé-fòûr.png
------------------- ----- ------------ ------------  ------------------------

But when I run the 7z-wasm version:
npx 7z-wasm l Archive.7z

I get following output:

7-Zip (z) 22.01 (LE) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
 32-bit ILP32 locale=C.UTF-8 Threads:1

Scanning the drive for archives:
1 file, 25043856 bytes (24 MiB)

Listing archive: Archive.7z

--
Path = Archive.7z
Type = 7z
Physical Size = 25043856
Headers Size = 250
Method = LZMA2:24
Solid = +
Blocks = 1

   Date      Time    Attr         Size   Compressed  Name
------------------- ----- ------------ ------------  ------------------------
2022-09-17 08:23:47 ....A      6104109     25043606  image-one.png
2022-09-17 08:22:45 ....A      4844990               imagᅢᄄ-thrᅢ랡.png
2022-09-17 08:23:25 ....A      8346941               imᅢᄂge-twᅢᄊ.png
2022-09-17 08:22:08 ....A      5759066               ᅢᆲmᅢᄀgᅢᄅ-fᅢ배ᄏr.png
------------------- ----- ------------ ------------  ------------------------

I wonder if it has something to do with the locale? Is C.UTF-8 more restrictive? How can that be changed? Any ideas?

@use-strict
Copy link
Owner

I created an archive containing a single file named üöäăîș.txt. Listing the archive indeed produces incorrect output (most likely stdout is encoded as ASCII?). Extracting the archive however produced the correct filenames. This worked when extracting all files implicitly or explicitly by filename.

As a workaround, you could extract the archive and traverse the output directory instead. Of course, assuming the overhead is acceptable.

@adyster
Copy link
Author

adyster commented Sep 18, 2022

Ok, thanks for the suggestion.

Think the ASCII can be converted to UTF-8 format. Tried this utility and it converts the names correctly:
https://onlineutf8tools.com/convert-ascii-to-utf8

@adyster
Copy link
Author

adyster commented Sep 18, 2022

FYI, if anyone else stumbles upon this issue, then theres a simple little utility that can convert the ASCII to UTF8 here:
https://github.com/mathiasbynens/utf8.js

@use-strict
Copy link
Owner

The culprit is this callback: https://github.com/use-strict/7z-wasm/blob/master/cli.js#L21

It turns out Emscripten spits out individual signed bytes, so we need to decode the UTF-8 encoded output manually instead of treating each byte as an ASCII character.

Due to lack of time, I'm going to treat this as low priority, but I am accepting PRs.

The solution is to decode UTF-8 on the fly, character by character. We DON'T want to buffer the output and print it all at once before exiting.

  • positive bytes represent 7-bit ASCII codes and are printed as they are (String.fromCharCode(byte))
  • negative bytes should be grouped and pushed to a temporary byte array (an UTF-8 character can have up to 4 bytes) for further processing
  • when we encounter a positive byte or 0 (null), empty the temporary array into a regular UTF-8 TextDecoder and print out the resulting character.
  • I'm assuming the output cannot terminate with a non-ASCII character. 7-Zip appears to always terminate the output with a newline (\n) anyway.
  • TextDecoder requires Node 11+ and bumping the engine requirement of the package.

@RixInGithub
Copy link

You can try exposing a runtime function. I know that a UTF8ArrayToString thing exists, but it only takes ArrayBuffer views. My idea is that we collect every byte onto an Uint8Array then we UTF8ArrayToString the array into a string, and problem may be solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants