Allow any characters in filenames / labels #374

damienmg · 2015-08-13T08:24:10Z

Ultimately any character can be part of a filename. We should probably allow that.

Some mangling to generate the corresponding label should probably be done.

Original report on the mailing-list:
https://groups.google.com/d/msgid/bazel-discuss/CAN0GiO3__5jXo5rZqroSj0mFxpqCzUZZVkY%3DSNsJK1%2BZ1BdJLg%40mail.gmail.com

abergmeier · 2015-10-19T18:37:04Z

So are we talking Unicode?
Where does any character stop?
how do you treat characters that are not allowed on certain platforms?
When using mangling, how do you handle collisions?

kayasoze · 2015-10-19T18:59:37Z

In POSIX, filenames are "bags of bytes"--there is no encoding; however, NUL and / are not allowed. Windows has a few more restrictions. Perhaps the BUILD file should be parsed in the encoding of the system locale, usually UTF-8, and filenames run though a ValidForCurrentPlatform() function which checks for disallowed characters. However, opting for strict platform neutrality in this way means that Bazel would have to represent filenames as a bag of bytes and not a Unicode string, as there is no guarantee that the filename will roundtrip through Unicode correctly. The problem can probably be simplified by restricting filenames to be UTF-8 or UTF-16, which should cover most people's needs even though that's not strict POSIX.

ulfjack · 2015-10-22T11:05:41Z

Well, I think we can probably require valid UTF-8 file names and strongly recommend that people use UTF-8 for their file system. For labels / BUILD files, we probably need an escaping scheme, at least for the control characters. If there's a file that isn't valid UTF-8, we give an error message?

btelles · 2015-11-18T19:30:20Z

Our company codes mainly in C++, but our frontend uses a lot of JS and nodejs modules which have all sorts of characters in the filenames--for example, -, #, @, (, and ).

Right now this is a major blocker for getting all our codebase under one build system since we can't reference files with semi-special characters. I don't think Bazel should decide what characters are acceptable in file names, as that reduces file names to those that fit both (1) supported languages and (2) supported platforms. This seems unnecessarily restrictive, and is becoming a major pain point for us.

ulfjack · 2015-11-19T08:18:38Z

Agreed. Unfortunately, it's a bit tricky to fix, as a lot of code assumes that the mapping from labels to file names (and vice versa) is trivial, and doesn't require escaping. Any suggestions on an escaping scheme?

damienmg · 2015-11-19T09:02:46Z

URL based?

abergmeier · 2015-11-19T09:23:19Z

You mean an own URI scheme? Sounds good.

damienmg · 2015-11-19T09:25:17Z

I mean replacing special characters by %XX where XX is the UTF-8 code in hexa.

ulfjack · 2016-03-02T19:36:11Z

Sorry, I won't be able to work on this. @philwo had an interest, maybe he can make some progress here. :-/

kayasoze · 2016-03-22T17:16:30Z

This blocks our Bazel deployment as well.

RonnieAtOracle · 2016-03-22T17:20:02Z

This is blocking us. We have a templating system where we need to build our template files. The filenames themselves contains template variables (e.g. ${ServiceName}.java ). Both $, {, and } are not supported by Bazel in file names.

philwo · 2016-05-25T12:59:09Z

I totally agree that this is important, should be done, I want this myself, however I don't have the time to work on it in the coming months, thus I have to unassign it.

DemiMarie · 2016-09-01T17:14:51Z

Here is my proposal:

Metacharacters (:, %, =, any others) must be %-encoded
All other characters are allowed. This includes Unicode characters.
Non-UTF8 names are not allowed, even if escaped. This is because there is no good way to handle them cross-platform, nor to display them to the user.

mihnita · 2016-10-06T17:25:23Z

Plain ASCII (and even that partial) makes this feels like we are in the early 90s.

There are reasonable ways to handle that.
For POSIX using the default locale would be good enough.

If my project is C/C++, and it is cross-platform, and I have problems handling Unicode, then I will not use Unicode in file names. And the fact that bazel "explodes" is not such a problem.
But if I use something like Java, then "it just works", and bazel would work too.

Even better would be to to allow for a character-set option in the project file.
This is what maven does. And what Java does with -Dfile.encoding=UTF-8
So if it is there, use it. If not, then take the system charset.

I did not move one project to bazel because test units check that Unicode file names work.
So the files are there, make it through git, work with maven and ant and gradle and java.
But bazel fails because there is an "@" in the file name... Which is supported on all OSes.

phst · 2025-01-15T21:53:14Z

Thanks!
What about the runfiles manifest on Windows? It can't be UTF-16 (that would be the appropriate OS encoding, but not ASCII-compatible), so is it UTF-8 even on Windows?

phst · 2025-01-15T21:56:56Z

Starlark and other Bazel files (e.g. .bazelrc, manifests, ...) are assumed (but not enforced) to be encoded in UTF-8

Stardoc assumes Latin-1 for docstrings, though. Encoding a Starlark file in UTF-8 will result in double-encoding, cf. https://github.com/phst/rules_elisp/blob/master/docs/generate.py#L236-L239

phst · 2025-01-15T22:02:11Z

If Starlark files are now asssumed to be UTF-8, then I guess for Stardoc https://github.com/bazelbuild/bazel/blob/8.0.0/src/main/java/com/google/devtools/build/lib/starlarkdocextract/RuleInfoExtractor.java#L65 and similar occurrences (basically wherever a string proto field in the Stardoc proto is set) need to be fixed

fmeum · 2025-01-16T07:26:58Z

Thanks! What about the runfiles manifest on Windows? It can't be UTF-16 (that would be the appropriate OS encoding, but not ASCII-compatible), so is it UTF-8 even on Windows?

Yes, all output files produced by Bazel should use UTF-8 and \n line endings on all platforms, including Windows.

If Starlark files are now asssumed to be UTF-8, then I guess for Stardoc https://github.com/bazelbuild/bazel/blob/8.0.0/src/main/java/com/google/devtools/build/lib/starlarkdocextract/RuleInfoExtractor.java#L65 and similar occurrences (basically wherever a string proto field in the Stardoc proto is set) need to be fixed

Thanks for pointing that out, I sent #24935 to fix this.

See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows.

phst · 2025-01-22T03:08:43Z

I wasn't aware of that docs statement, I will update it.

Here's another doc that I guess is outdated now: https://bazel.build/concepts/labels

Repository names: No documentation, but I'd assume that even with this change repository names (both canonical and apparent) are ASCII-only, with some more restrictions (no newlines, spaces, slashes, ...)
Package names: Since package names correspond to directory names, these can now also contain non-ASCII characters?
Target names: Definitely can contain non-ASCII characters now.

phst · 2025-01-22T03:19:40Z

Yes, all output files produced by Bazel should use UTF-8 and \n line endings on all platforms, including Windows.

OK, then the runfiles libraries also need to be adapted.

Go: Probably no change needed, Go uses WTF-8 for filenames on Windows
Python: Probably only need to make sure that files are opened with the correct encoding (fix: Fix encoding of runfiles manifest and repository mapping files. rules_python#2568)
C++: A lot more changes required. The use of narrow strings throughout makes manifest files and directories with non-ASCII characters nonportable/impossible. We need at least overloads for Create and Rlocation with std::wstring on Windows.
Others: ???

See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows.

fmeum · 2025-01-22T07:34:28Z

Thanks for sending the fix for Python!

C++: A lot more changes required. The use of narrow strings throughout makes manifest files and directories with non-ASCII characters nonportable/impossible. We need at least overloads for Create and Rlocation with std::wstring on Windows.

Microsoft now recommends using the -A variety of functions with the UTF-8 code page (forced via an app manifest) instead of the wide string functions for new software. Existing software probably already has its own conversion functions to and from UTF-8, so I would personally lean against complicating the API for everyone by adding more overloads.

…2568) See bazelbuild/bazel#374 (comment): > all output files produced by Bazel should use UTF-8 and \n line endings on > all platforms, including Windows. Previously this would use the legacy ANSI codepage on Windows.

Work towards #374 Closes #24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

bazelbuild/bazel#24935 changes the observable behavior of starlark_doc_extract, and consumers need to adapt. Work towards bazelbuild/bazel#374 Work towards phst/rules_elisp#818

Work towards bazelbuild#374 Closes bazelbuild#24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

…#25097) Work towards #374 Closes #24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68 Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>

bazelbuild/bazel#24935 changes the observable behavior of starlark_doc_extract, and consumers need to adapt. Work towards bazelbuild/bazel#374 Work towards phst/rules_elisp#818

Work towards bazelbuild#374 Closes bazelbuild#24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

If enabled (or set to `error`), fail if Starlark files are not UTF-8 encoded. If set to `warning` (the default), emits a warning instead. Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible. Work towards #374 Closes #24944. PiperOrigin-RevId: 721513249 Change-Id: I1d3363168c6cd5d37abf96e0401e34866b6679d7

If enabled (or set to `error`), fail if Starlark files are not UTF-8 encoded. If set to `warning` (the default), emits a warning instead. Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible. Work towards bazelbuild#374 Closes bazelbuild#24944. PiperOrigin-RevId: 721513249 Change-Id: I1d3363168c6cd5d37abf96e0401e34866b6679d7 (cherry picked from commit e7934ce)

If enabled (or set to `error`), fail if Starlark files are not UTF-8 encoded. If set to `warning` (the default), emits a warning instead. Bazel already assumes that Starlark files are UTF-8 encoded for e.g. filenames in actions executed remotely. This flag doesn't affect this, it only makes encoding failures more visible. Work towards #374 Closes #24944. PiperOrigin-RevId: 721513249 Change-Id: I1d3363168c6cd5d37abf96e0401e34866b6679d7 (cherry picked from commit e7934ce) Fixes #25148

damienmg added type: bug P2 We'll consider working on this in future. (Assignee optional) labels Aug 13, 2015

kchodorow mentioned this issue Aug 31, 2015

Bazel doesn't like directories containing '+' #399

Closed

kchodorow assigned ulfjack Aug 31, 2015

kchodorow mentioned this issue Oct 19, 2015

Extracting file permission in zip fails in 0.1.1 #518

Closed

kchodorow mentioned this issue Nov 2, 2015

new_git_repository unable to handle Unicode characters in paths #551

Closed

ulfjack assigned philwo and unassigned ulfjack Mar 2, 2016

philwo removed their assignment May 25, 2016

damienmg added the category: misc > misc label Jun 14, 2016

damienmg modified the milestone: 0.7 Jun 14, 2016

dfabulich mentioned this issue Jun 20, 2016

Distributed caching: build fails when build inputs include directories #1415

Closed

nelhage mentioned this issue Aug 16, 2016

new_http_archive can't handle archives containing unicode-encoded filenames #1653

Closed

helenalt added the onbas-p1 label Aug 17, 2016

damienmg mentioned this issue Oct 31, 2016

# is not supported in srcs (duplicate of #374) #2006

Closed

fmeum mentioned this issue Jan 16, 2025

Convert proto output from internal string encoding to Unicode #24935

Closed

fmeum mentioned this issue Jan 16, 2025

Add --incompatible_enforce_starlark_utf8 #24944

Closed

phst mentioned this issue Jan 20, 2025

fix: Fix encoding of runfiles manifest and repository mapping files. bazelbuild/rules_python#2568

Merged

copybara-service bot pushed a commit that referenced this issue Jan 22, 2025

Convert proto output from internal string encoding to Unicode

ffc2989

Work towards #374 Closes #24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

phst mentioned this issue Jan 26, 2025

Add feature check for UTF-8 Stardoc input. bazel-contrib/bazel_features#92

Merged

tjgq pushed a commit to tjgq/bazel that referenced this issue Jan 27, 2025

[8.1.0] Convert proto output from internal string encoding to Unicode

945b922

Work towards bazelbuild#374 Closes bazelbuild#24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

tjgq mentioned this issue Jan 27, 2025

[8.1.0] Convert proto output from internal string encoding to Unicode #25097

Merged

fmeum added a commit to fmeum/bazel that referenced this issue Jan 29, 2025

Convert proto output from internal string encoding to Unicode

4fc30d3

Work towards bazelbuild#374 Closes bazelbuild#24935. PiperOrigin-RevId: 718549143 Change-Id: Ibe6c685a2f8dd75430cae7f770d392de35bdeb68

iancha1992 mentioned this issue Jan 30, 2025

[8.1.0] Add --incompatible_enforce_starlark_utf8 #25148

Closed

fmeum mentioned this issue Jan 31, 2025

[8.1.0] Add --incompatible_enforce_starlark_utf8 #25152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow any characters in filenames / labels #374

Allow any characters in filenames / labels #374

damienmg commented Aug 13, 2015

abergmeier commented Oct 19, 2015

kayasoze commented Oct 19, 2015

ulfjack commented Oct 22, 2015

btelles commented Nov 18, 2015

ulfjack commented Nov 19, 2015

damienmg commented Nov 19, 2015

abergmeier commented Nov 19, 2015

damienmg commented Nov 19, 2015

ulfjack commented Mar 2, 2016

kayasoze commented Mar 22, 2016

RonnieAtOracle commented Mar 22, 2016

philwo commented May 25, 2016

DemiMarie commented Sep 1, 2016

mihnita commented Oct 6, 2016

phst commented Jan 15, 2025 •

edited

Loading

phst commented Jan 15, 2025

phst commented Jan 15, 2025

fmeum commented Jan 16, 2025

phst commented Jan 22, 2025

phst commented Jan 22, 2025

fmeum commented Jan 22, 2025 •

edited

Loading

Allow any characters in filenames / labels #374

Allow any characters in filenames / labels #374

Comments

damienmg commented Aug 13, 2015

abergmeier commented Oct 19, 2015

kayasoze commented Oct 19, 2015

ulfjack commented Oct 22, 2015

btelles commented Nov 18, 2015

ulfjack commented Nov 19, 2015

damienmg commented Nov 19, 2015

abergmeier commented Nov 19, 2015

damienmg commented Nov 19, 2015

ulfjack commented Mar 2, 2016

kayasoze commented Mar 22, 2016

RonnieAtOracle commented Mar 22, 2016

philwo commented May 25, 2016

DemiMarie commented Sep 1, 2016

mihnita commented Oct 6, 2016

phst commented Jan 15, 2025 • edited Loading

phst commented Jan 15, 2025

phst commented Jan 15, 2025

fmeum commented Jan 16, 2025

phst commented Jan 22, 2025

phst commented Jan 22, 2025

fmeum commented Jan 22, 2025 • edited Loading

phst commented Jan 15, 2025 •

edited

Loading

fmeum commented Jan 22, 2025 •

edited

Loading