Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify whether files may begin with a UTF-8 BOM #170

Open
brandjon opened this issue Feb 16, 2021 · 1 comment
Open

Specify whether files may begin with a UTF-8 BOM #170

brandjon opened this issue Feb 16, 2021 · 1 comment

Comments

@brandjon
Copy link
Member

The language spec currently says that files are UTF-8 encoded. Following the FR in bazelbuild/bazel#4551, we should decide whether to allow an optional BOM (EF BB BF) at the beginning of the file, which would be stripped before lexxing.

BOMs are unnecessary and not recommended for UTF-8, but prohibiting them is hostile to some windows text editors. Conversely, allowing them seems harmless.

From what I can tell, standard UTF-8 passes a decoded BOM through unmodified without stripping. But that doesn't stop plenty of decoders from stripping the BOM, e.g. Python's utf-8-sig codec (as distinct from its utf-8 codec).

@adonovan
Copy link
Contributor

adonovan commented May 6, 2021

Conversely, allowing them seems harmless.

Not harmless: it has a complexity cost, and the lexer is already complicated.

From what I can tell, standard UTF-8 passes a decoded BOM through unmodified without stripping.

A BOM is just a special kind of space character that our lexer rejects (outside of a string literal). I would prefer that we teach people to fix their misconfigured editors to stop putting unwanted invisible spaces in text files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants