Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.3.0] testing_phase2 Please ignore. #374

Open
iancha1992 opened this issue Oct 10, 2023 · 2 comments
Open

[6.3.0] testing_phase2 Please ignore. #374

iancha1992 opened this issue Oct 10, 2023 · 2 comments

Comments

@iancha1992
Copy link
Owner

Forked from #346

@iancha1992 iancha1992 added this to the 6.3.0 release blockers milestone Oct 10, 2023
@iancha1992
Copy link
Owner Author

Cherry-picked in bazelbuild#19776

iancha1992 pushed a commit that referenced this issue Nov 8, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards #374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
iancha1992 pushed a commit that referenced this issue Nov 8, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards #374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 90 days unless any other activity occurs. If you think this issue is still relevant and should stay open, please post any comment here and the issue will no longer be marked as stale.

@github-actions github-actions bot added the stale label Dec 14, 2024
iancha1992 pushed a commit that referenced this issue Jan 24, 2025
Work towards #374

Closes bazelbuild#24935.

PiperOrigin-RevId: 718549143
Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68
iancha1992 pushed a commit that referenced this issue Jan 29, 2025
…bazelbuild#25097)

Work towards #374

Closes bazelbuild#24935.

PiperOrigin-RevId: 718549143
Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
iancha1992 pushed a commit that referenced this issue Jan 30, 2025
If enabled (or set to `error`), fail if Starlark files are not UTF-8 encoded. If set to `warning` (the default), emits a warning instead.

Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.

Work towards #374

Closes bazelbuild#24944.

PiperOrigin-RevId: 721513249
Change-Id: I1d3363168c6cd5d37abf96e0401e34866b6679d7
iancha1992 pushed a commit that referenced this issue Jan 30, 2025
If enabled (or set to `error`), fail if Starlark files are not UTF-8 encoded. If set to `warning` (the default), emits a warning instead.

Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible.

Work towards #374

Closes bazelbuild#24944.

PiperOrigin-RevId: 721513249
Change-Id: I1d3363168c6cd5d37abf96e0401e34866b6679d7
iancha1992 pushed a commit that referenced this issue Feb 3, 2025
If enabled (or set to `error`), fail if Starlark files are not UTF-8
encoded. If set to `warning` (the default), emits a warning instead.

Bazel already assumes that Starlark files are UTF-8 encoded for e.g.
filenames in actions executed remotely. This flag doesn't affect this,
it only makes encoding failures more visible.

Work towards #374

Closes bazelbuild#24944.

PiperOrigin-RevId: 721513249
Change-Id: I1d3363168c6cd5d37abf96e0401e34866b6679d7

(cherry picked from commit e7934ce)

Fixes bazelbuild#25148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant