-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUILD file parsing is not symmetric w.r.t. actions.write when processing UTF-8 encoded BUILD files. #10174
Comments
cc @alandonovan Without looking into details, I'd assume that |
What also works is if bazel just reads octets, with no interpretation. File write actions should also not have an encoding. This make bazel an efficient passthrough. |
This is essentially a duplicate of #374: Bazel treats BUILD and .bzl files, and file names returned by readdir, as if they are encoded as Latin 1, and then uses many weird hacks to try to correct for that mistake. (When we made this decision, a long time ago, it was in fear that if we chose UTF-8, then it would be possible to create directories that couldn't be deleted because their file names (byte strings) were not valid UTF-8 strings. Of course, the solution to that is: make the DeleteTree operation more robust. I also blame my youthful ignorance of string encoding.) I tried a fix a few months back, but it's a complex and potentially breaking change. For example, it changes the result of |
Going back to the comment of 2019-11-29, yes. I think file write is broken, in that it converts to UTF-8, whereas it should just write what was parsed in. bazel/src/main/java/com/google/devtools/build/lib/analysis/actions/FileWriteAction.java Line 238 in 1557ac8
|
Since Bazel reads BUILD and .bzl files as (essentially) raw bytes, any UTF-8 encoded input is essentially pre-UTF-8 encoded on the way out. Closes bazelbuild#10174 DO NOT SUBMIT: This is a breaking change, so it needs a migration path. Current idea: - introduce a new attribute to write 'encode_as_utf_8'. Default is gated by incompatible_write_action_encodes_as_utf_8. - for bazel we do the multi-release cycle flip - for google we agressively try to change the default - add encode_as_utf_8 to the few broken rules - flip the default - fix the user.
And yes. Removing any mention of UTF8 in that file does fix this issue. The sequence of octets from my BUILD file are perfectly preserved in the file created by actions.write. |
Tony: it is correct that the file actually written by the action contain UTF-8 encoded text. The question is whether Either way, please document in FileWriteAction whatever you discover. |
Good question. It is probably a starlark string value, since the code creating the content is taking a string attribute from a rule. You can see it in the repo repository. I expect a 27 character string of text, but len(x) gives me 28, hinting that the copyright symbol is taking 2 chars instead of 1. When we write that out, we get 4 bytes, since each of the 2 chars is getting the UTF-8 externalization. After I do the global presubmit with no encodings, I'll try ISO8859_1. |
Baseline: 7404d17 Cherry picks: + a4334be: fixup! Gracefully handle the lack of subreaper support in Linux. Incompatible changes: - This removes the short-lived --process_wrapper_extra_flags flag, which was introduced primarily to roll out a bug fix. Unfortunately, this made us inadvertently expose all of the process-wrapper's command line interface to the public, which should not have happened. Given the corner case of the utility of this flag, the lack of documentation for it, and the fact that it only appeared in a single release, we are treating this as a bug instead of a backwards compatibility breakage. New features: - bazel info: Allow to specify multiple keys. - Support code coverage with GCC 9. Important changes: - Allow InstrumentedFilesInfo fields to be read from Starlark. - The --starlark_cpu_profile=<file> flag writes a profile in pprof format containing a statistical summary of CPU usage by all Starlark execution during the bazel command. Use it to identify slow Starlark functions in loading and analysis. - The --debug_depset_flag has been removed as it is in effect always on at no cost. - Rule authors should use the incompatible_use_toolchain_transition rule attribute to migrate to using the toolchain transition. jcater to udpate notes further. - `apple_binary` rules now accept the `stamp` attribute with the same semantics that it has in `cc_binary` rules. - --incompatible_objc_provider_remove_compile_info turns off the compile info/mege_zip Starlark APIs in ObjcProvider. See #11359. - The --debug_depset_flag has been removed as it is in effect always on at no cost. - Fix behavior of ctx.actions.write so content is written without an incorrect encoding to UTF-8. See #10174 for details. - Collect more performance metrics for worker execution. - Add flag --incompatible_force_strict_header_check_from_starlark - Configure coverage and runfiles for sh_library. - Adds --incompatible_blacklisted_protos_requires_proto_info to indicate whether proto_lang_toolchain.blacklisted_protos requires ProtoInfo. This release contains contributions from many people at Google, as well as Andrzej Guszak, Benjamin Peterson, Benjamin Romano, Carlos Eduardo Seo, Claudio Bley, dannysullivan, David Ostrovsky, George Gensure, Graham Jenson, Grzegorz Lukasik, Gunnar Wagenknecht, Henk van der Laan, Jin, John Millikin, Marin Baron, Nikhil Marathe, Robin Nabel, Ryan Beasley, Samuel Giddins, Sergey Balabanov, utsav-dbx, Vo Van Nghia, Yannic Bonenberger.
Baseline: 7404d17 Cherry picks: + a4334be: fixup! Gracefully handle the lack of subreaper support in Linux. Incompatible changes: - This removes the short-lived --process_wrapper_extra_flags flag, which was introduced primarily to roll out a bug fix. Unfortunately, this made us inadvertently expose all of the process-wrapper's command line interface to the public, which should not have happened. Given the corner case of the utility of this flag, the lack of documentation for it, and the fact that it only appeared in a single release, we are treating this as a bug instead of a backwards compatibility breakage. New features: - bazel info: Allow to specify multiple keys. - Support code coverage with GCC 9. Important changes: - Allow InstrumentedFilesInfo fields to be read from Starlark. - The --starlark_cpu_profile=<file> flag writes a profile in pprof format containing a statistical summary of CPU usage by all Starlark execution during the bazel command. Use it to identify slow Starlark functions in loading and analysis. - The --debug_depset_flag has been removed as it is in effect always on at no cost. - Rule authors should use the incompatible_use_toolchain_transition rule attribute to migrate to using the toolchain transition. jcater to udpate notes further. - `apple_binary` rules now accept the `stamp` attribute with the same semantics that it has in `cc_binary` rules. - --incompatible_objc_provider_remove_compile_info turns off the compile info/mege_zip Starlark APIs in ObjcProvider. See #11359. - The --debug_depset_flag has been removed as it is in effect always on at no cost. - Fix behavior of ctx.actions.write so content is written without an incorrect encoding to UTF-8. See #10174 for details. - Collect more performance metrics for worker execution. - Add flag --incompatible_force_strict_header_check_from_starlark - Configure coverage and runfiles for sh_library. - Adds --incompatible_blacklisted_protos_requires_proto_info to indicate whether proto_lang_toolchain.blacklisted_protos requires ProtoInfo. This release contains contributions from many people at Google, as well as Andrzej Guszak, Benjamin Peterson, Benjamin Romano, Carlos Eduardo Seo, Claudio Bley, dannysullivan, David Ostrovsky, George Gensure, Graham Jenson, Grzegorz Lukasik, Gunnar Wagenknecht, Henk van der Laan, Jin, John Millikin, Marin Baron, Nikhil Marathe, Robin Nabel, Ryan Beasley, Samuel Giddins, Sergey Balabanov, utsav-dbx, Vo Van Nghia, Yannic Bonenberger.
Description of the problem / feature request:
Multi-byte UTF-8 sequences in BUILD files get parsed as multiple distinct characters instead of a single code point
Feature requests: what underlying problem are you trying to solve with this feature?
It is useful to have full Unicode support for attributes.
Bugs: what's the simplest, easiest way to reproduce this bug?
Repo here: https://github.com/tonyaiuto/bazel/tree/master/utf8_encode
What operating system are you running Bazel on?
Linux
What's the output of
bazel info release
?bazel 1.1.0
Have you found anything relevant by searching the web?
#4551 is similar.
Analysis from the sample repo.
The iso8859-Latin-1 copyright symbol is 0xA9. The UTF-8 encoding of that
is the two bytes [c2, a9]. What we see in the output file is clearly
those two bytes again encoded as utf-8.
The BUILD file is in UTF-8 format:
We can see the signature of the encoded copyright symbol as 302 251.
It would seem that
distinct character [0xc2, 0xa9].
characters into the 4 need for their UTF-8 representation.
Potential fixes
BUILD files are UTF-8
broken anyway, so who could be using anything beyond ASCII successfully)
Allow BUILD files to specify an encoding
We could borrow from (PEP-263)[https://www.python.org/dev/peps/pep-0263/]
and use:
Byte are bytes
ctx.actions.write() should not assume an encoding for the output. It
would emit exactly what it is given.
More switches and bells.
ctx.actions.write() could have an encoding attribute. Use 'none' for this case.
The text was updated successfully, but these errors were encountered: