Specify whether files may begin with a UTF-8 BOM #170

brandjon · 2021-02-16T18:01:06Z

The language spec currently says that files are UTF-8 encoded. Following the FR in bazelbuild/bazel#4551, we should decide whether to allow an optional BOM (EF BB BF) at the beginning of the file, which would be stripped before lexxing.

BOMs are unnecessary and not recommended for UTF-8, but prohibiting them is hostile to some windows text editors. Conversely, allowing them seems harmless.

From what I can tell, standard UTF-8 passes a decoded BOM through unmodified without stripping. But that doesn't stop plenty of decoders from stripping the BOM, e.g. Python's utf-8-sig codec (as distinct from its utf-8 codec).

The text was updated successfully, but these errors were encountered:

adonovan · 2021-05-06T15:22:59Z

Conversely, allowing them seems harmless.

Not harmless: it has a complexity cost, and the lexer is already complicated.

From what I can tell, standard UTF-8 passes a decoded BOM through unmodified without stripping.

A BOM is just a special kind of space character that our lexer rejects (outside of a string literal). I would prefer that we teach people to fix their misconfigured editors to stop putting unwanted invisible spaces in text files.

brandjon added P4 type: feature request labels Feb 16, 2021

brandjon mentioned this issue Feb 16, 2021

Skip UTF-8 BOM sequence when reading BUILD & .bzl files. bazelbuild/bazel#4551

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify whether files may begin with a UTF-8 BOM #170

Specify whether files may begin with a UTF-8 BOM #170

brandjon commented Feb 16, 2021

adonovan commented May 6, 2021

Specify whether files may begin with a UTF-8 BOM #170

Specify whether files may begin with a UTF-8 BOM #170

Comments

brandjon commented Feb 16, 2021

adonovan commented May 6, 2021