starlark: add 'bytes' data type, for binary strings #330

alandonovan · 2020-12-10T21:08:20Z

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

lib/proto/proto.go

starlark/testdata/bytes.star

lib/proto/proto.go

starlark/testdata/bytes.star

jayconrod · 2020-12-11T21:31:30Z

starlark/testdata/bytes.star

+#
+# string to number:
+# - ord(bytes)    returns numeric byte value of bytes[0] (requires len=1).
+# - ord(bytes[i]) returns numeric byte value of bytes[i].


ord(bytes[i]) is ord(int); I don't think that should be valid.

Ah, indexing a Python3 byte string returns a numeric byte value, not a bytes of length 1 (like text strings). I've changed Starlark/Go to match.

It occurs to me that ord(string) is not very useful on its own, because it requires the string to encode exactly one code point, yet the language provides no way to ensure that (other than iterating over string.codepoints()). Perhaps ord(string) should return the first code point? Or it could have an optional index=int parameter for the start offset, which avoids the need to allocate a substring by using ord(string[i:]).

starlark/testdata/string.star

starlark/testdata/bytes.star

alandonovan · 2021-02-05T22:40:31Z

Hi Jon, Jay,

I have updated this change to the Go impl, and the spec change in bazelbuild/starlark#161 to clean up the following primary parts of the problem:

bytes values
bytes literal syntax
\x \X \u \U escapes
portability of strings

The remaining parts of the problem are:

conversions (string, bytes, numbers), ord, chr, and strictness of conversion failures
additional methods, including iterators
hash(bytes)

To make this process more manageable and reduce the number of rounds of review on an increasingly large pair of PRs, I propose that we aim to commit the spec change and the Go implementation once we are happy with the "phase 1" features, and then make follow-up changes to address the "phase 2" items. Given that no-one is yet using bytes, I don't think this will be disruptive.

cheers
alan

jayconrod

Quick first look: this phased approach sounds good to me. I still need to look at the parser change and tests in more detail, but looks good so far.

internal/compile/compile.go

alandonovan · 2021-02-10T23:26:21Z

PTAL; thanks.

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

jayconrod

All looks good. Sorry for the delay; busy week.

adonovan force-pushed the bytes branch 2 times, most recently from 13022e3 to 6444f7e Compare December 10, 2020 21:35

alandonovan requested review from brandjon and jayconrod December 10, 2020 21:35

adonovan force-pushed the bytes branch 2 times, most recently from 045b668 to ff8aef7 Compare December 10, 2020 21:58

jayconrod approved these changes Dec 11, 2020

View reviewed changes

lib/proto/proto.go Show resolved Hide resolved

adonovan force-pushed the bytes branch from ff8aef7 to 00c391f Compare December 11, 2020 17:15

alandonovan commented Dec 11, 2020

View reviewed changes

starlark/testdata/bytes.star Outdated Show resolved Hide resolved

adonovan force-pushed the bytes branch from 00c391f to 3b93dd8 Compare December 11, 2020 19:57

jayconrod reviewed Dec 11, 2020

View reviewed changes

adonovan force-pushed the bytes branch 2 times, most recently from aba7d20 to f02c162 Compare December 13, 2020 16:23

brandjon reviewed Dec 22, 2020

View reviewed changes

starlark/testdata/string.star Outdated Show resolved Hide resolved

brandjon reviewed Dec 23, 2020

View reviewed changes

brandjon mentioned this pull request Dec 23, 2020

spec: new 'bytes' data type bazelbuild/starlark#112

Open

brandjon reviewed Dec 23, 2020

View reviewed changes

starlark/testdata/bytes.star Show resolved Hide resolved

adonovan force-pushed the bytes branch 2 times, most recently from 7df1736 to 7a5ed99 Compare February 5, 2021 22:18

jayconrod reviewed Feb 8, 2021

View reviewed changes

internal/compile/compile.go Show resolved Hide resolved

brandjon approved these changes Feb 9, 2021

View reviewed changes

adonovan force-pushed the bytes branch from 7a5ed99 to a348d16 Compare February 10, 2021 23:25

adonovan force-pushed the bytes branch 3 times, most recently from a596359 to 392640f Compare February 11, 2021 23:50

adonovan force-pushed the bytes branch from 392640f to 2e65a7e Compare February 12, 2021 00:47

jayconrod approved these changes Feb 12, 2021

View reviewed changes

alandonovan merged commit ebe61bd into master Feb 12, 2021

alandonovan deleted the bytes branch February 12, 2021 21:57

emcfarlane mentioned this pull request Feb 16, 2022

Buildifer 2 support byte string and load comments bazelbuild/buildtools#1049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

starlark: add 'bytes' data type, for binary strings #330

starlark: add 'bytes' data type, for binary strings #330

alandonovan commented Dec 10, 2020

jayconrod Dec 11, 2020

alandonovan Dec 11, 2020

alandonovan commented Feb 5, 2021

jayconrod left a comment

alandonovan commented Feb 10, 2021

jayconrod left a comment

starlark: add 'bytes' data type, for binary strings #330

starlark: add 'bytes' data type, for binary strings #330

Conversation

alandonovan commented Dec 10, 2020

jayconrod Dec 11, 2020

Choose a reason for hiding this comment

alandonovan Dec 11, 2020

Choose a reason for hiding this comment

alandonovan commented Feb 5, 2021

jayconrod left a comment

Choose a reason for hiding this comment

alandonovan commented Feb 10, 2021

jayconrod left a comment

Choose a reason for hiding this comment