Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip UTF-8 BOM sequence when reading BUILD & .bzl files. #4551

Closed
excitoon opened this issue Jan 31, 2018 · 13 comments
Closed

Skip UTF-8 BOM sequence when reading BUILD & .bzl files. #4551

excitoon opened this issue Jan 31, 2018 · 13 comments
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols type: feature request
Milestone

Comments

@excitoon
Copy link
Contributor

excitoon commented Jan 31, 2018

Description of the problem / feature request:

Unfortunately bazel does not support UTF-8 BOM sequences in BUILD files.

Output of bazel build ...:

C:\Users\Chebotarev_V\Documents\bazel-issues\utf-8-bom-support>bazel build ...
........................
ERROR: C:/users/chebotarev_v/documents/bazel-issues/utf-8-bom-support/BUILD:1:1: invalid character: 'я'
ERROR: C:/users/chebotarev_v/documents/bazel-issues/utf-8-bom-support/BUILD:1:2: invalid character: '╗'
ERROR: C:/users/chebotarev_v/documents/bazel-issues/utf-8-bom-support/BUILD:1:3: invalid character: '┐'
ERROR: package contains errors:
ERROR: error loading package '': Package '' contains errors
INFO: Elapsed time: 5.202s
FAILED: Build did NOT complete successfully (1 packages loaded)

Please find an example in my repository:
https://github.com/excitoon/bazel-issues/tree/master/utf-8-bom-support

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

https://github.com/excitoon/bazel-issues/tree/master/utf-8-bom-support

bazel build ...

What operating system are you running Bazel on?

Windows 10 x64

What's the output of bazel info release?

Build label: 0.9.0
Build target: bazel-out/x64_windows-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Dec 19 09:32:04 2017 (1513675924)
Build timestamp: 1513675924
Build timestamp as int: 1513675924

Have you found anything relevant by searching the web?

Nothing.

@meteorcloudy
Copy link
Member

@philwo Did we have a discussion about this a long time ago?

@meteorcloudy meteorcloudy added type: feature request P2 We'll consider working on this in future. (Assignee optional) labels Jan 31, 2018
@philwo
Copy link
Member

philwo commented Jan 31, 2018

Well... if you ask me, then BUILD files should be parsed as UTF-8 by default, file names contain human readable text and we should support this and also https://www.python.org/dev/peps/pep-0263/.

Unfortunately BUILD files are parsed as latin1 and.. you know what, we could just still support this and ignore the BOM, it can't get much worse than the current situation anyway. Let's do this.

@philwo
Copy link
Member

philwo commented Jan 31, 2018

@dslomov @laurentlb Your opinion on this?

@laurentlb
Copy link
Contributor

Seems reasonable to skip those bytes.

@aiuto
Copy link
Contributor

aiuto commented Dec 10, 2018

In the short term, skipping the BOM bytes is an obvious win.
Does anyone have an opinion about parsing in UTF-8? I agree with @philwo that we should be doing that.

@aiuto aiuto added team-Bazel General Bazel product/strategy issues and removed category: misc > misc labels Mar 29, 2019
@aiuto aiuto added this to the 1.0 milestone Mar 29, 2019
@dslomov dslomov added team-Starlark and removed team-Bazel General Bazel product/strategy issues labels Jul 23, 2019
@meisterT meisterT removed this from the 1.0 milestone May 12, 2020
@aiuto aiuto changed the title Support of UTF-8 BOM sequence Skip UTF-8 BOM sequence when reading BUILD & .bzl files. Aug 19, 2020
@aiuto
Copy link
Contributor

aiuto commented Aug 19, 2020

Now that I understand Bazel internals better, I have updated my opinion from 2018.

  • We already do effectively parse BUILD files as UTF-8.
    • We read them as raw bytes and represent our strings as raw bytes rather than UTF-16 strings. (Just like C strings).
    • We ctx.actions.write does not encode to UTF-8. An attribute containing multi-byte sequences in a BUILD file will be written out exactly the same way.
  • This essentially means that we support any other encoding you happen to be using, as long as your source tree and all the tools expect the same encoding.
    • What we can't do is let you declare your BUILD files are in Shift_JIS and you want actions to write files in UTF-8. I don't consider this an important enough use case to worry about at this time.

Changing this behavior is expensive, so the I am back to just skipping the BOM sequence as the win.

@excitoon
Copy link
Contributor Author

Cool!

@alandonovan
Copy link
Contributor

I agree: UTF-8 completely eliminates the need for BOMs; the fact that some Microsoft tools put BOMs into UTF-8 text files is really a bug. (And I agree that for our purposes here, Bazel treats BUILD files as UTF-8, even though internally it's a horror show.)

@aiuto
Copy link
Contributor

aiuto commented Aug 19, 2020

Shall we label this "help wanted" and/or good first issue?
The behavior is well specified and it should be easy to write the tests.

@alandonovan
Copy link
Contributor

Because U+FEFF is a space (albeit zero-width), I was hoping we could just leave it there and let the Starlark scanner treat it like any other space---except of course spaces are significant to the syntax, so we can't do that. I think the correct solution involves two changes:

  1. Make the Starlark scanner skip past an initial U+FEFF char, as required by Unicode, without advancing the column number. This would handle cases when encodings are being used correctly, such as Copybara's use of ParserInput.fromUTF8.
  2. Make the ParserInput.fromLatin1 function skip an initial [EF BB BF] prefix. This corresponds to the string "" in Latin1, which should never appear in a Starlark file, so although it is not the correct way to handle Latin1, in practice it is safe. The fromLatin1 function exists only for Blaze's use-case in which the bytes are in fact UTF-8, and once Blaze starts using string encodings correctly this function will go away.

@brandjon
Copy link
Member

Update: Frontend team takes the position that BUILD and .bzl files are UTF-8, and has no plans to add a PEP 263-style encoding declaration. It sounds like we're in agreement that the BOM should be permitted, and I've filed bazelbuild/starlark#170 to change the Starlark spec accordingly (not that a Starlark change is necessarily a prerequisite for a Bazel change).

Since we intend to eliminate the latin-1 hack I don't think we need to worry about accommodating the BOM in it.

@brandjon brandjon added P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols and removed P2 We'll consider working on this in future. (Assignee optional) team-Starlark labels Feb 16, 2021
@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label May 10, 2023
@github-actions
Copy link

This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage). Thanks!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 This is either out of scope or we don't have bandwidth to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-Starlark-Integration Issues involving Bazel's integration with Starlark, excluding builtin symbols type: feature request
Projects
None yet
Development

No branches or pull requests

10 participants