Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

starlark: add 'bytes' data type, for binary strings #330

Merged
merged 1 commit into from
Feb 12, 2021
Merged

Conversation

alandonovan
Copy link
Contributor

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

@adonovan adonovan force-pushed the bytes branch 2 times, most recently from 13022e3 to 6444f7e Compare December 10, 2020 21:35
@adonovan adonovan force-pushed the bytes branch 2 times, most recently from 045b668 to ff8aef7 Compare December 10, 2020 21:58
lib/proto/proto.go Show resolved Hide resolved
lib/proto/proto.go Show resolved Hide resolved
starlark/testdata/bytes.star Outdated Show resolved Hide resolved
starlark/testdata/bytes.star Outdated Show resolved Hide resolved
#
# string to number:
# - ord(bytes) returns numeric byte value of bytes[0] (requires len=1).
# - ord(bytes[i]) returns numeric byte value of bytes[i].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ord(bytes[i]) is ord(int); I don't think that should be valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indexing a Python3 byte string returns a numeric byte value, not a bytes of length 1 (like text strings). I've changed Starlark/Go to match.

It occurs to me that ord(string) is not very useful on its own, because it requires the string to encode exactly one code point, yet the language provides no way to ensure that (other than iterating over string.codepoints()). Perhaps ord(string) should return the first code point? Or it could have an optional index=int parameter for the start offset, which avoids the need to allocate a substring by using ord(string[i:]).

@adonovan adonovan force-pushed the bytes branch 2 times, most recently from aba7d20 to f02c162 Compare December 13, 2020 16:23
starlark/testdata/bytes.star Outdated Show resolved Hide resolved
starlark/testdata/bytes.star Show resolved Hide resolved
starlark/testdata/bytes.star Show resolved Hide resolved
starlark/testdata/bytes.star Outdated Show resolved Hide resolved
starlark/testdata/bytes.star Outdated Show resolved Hide resolved
starlark/testdata/bytes.star Show resolved Hide resolved
starlark/testdata/bytes.star Show resolved Hide resolved
starlark/testdata/bytes.star Outdated Show resolved Hide resolved
starlark/testdata/bytes.star Show resolved Hide resolved
@adonovan adonovan force-pushed the bytes branch 2 times, most recently from 7df1736 to 7a5ed99 Compare February 5, 2021 22:18
@alandonovan
Copy link
Contributor Author

Hi Jon, Jay,

I have updated this change to the Go impl, and the spec change in bazelbuild/starlark#161 to clean up the following primary parts of the problem:

  • bytes values
  • bytes literal syntax
  • \x \X \u \U escapes
  • portability of strings

The remaining parts of the problem are:

  • conversions (string, bytes, numbers), ord, chr, and strictness of conversion failures
  • additional methods, including iterators
  • hash(bytes)

To make this process more manageable and reduce the number of rounds of review on an increasingly large pair of PRs, I propose that we aim to commit the spec change and the Go implementation once we are happy with the "phase 1" features, and then make follow-up changes to address the "phase 2" items. Given that no-one is yet using bytes, I don't think this will be disruptive.

cheers
alan

Copy link
Collaborator

@jayconrod jayconrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick first look: this phased approach sounds good to me. I still need to look at the parser change and tests in more detail, but looks good so far.

internal/compile/compile.go Show resolved Hide resolved
@alandonovan
Copy link
Contributor Author

PTAL; thanks.

@adonovan adonovan force-pushed the bytes branch 3 times, most recently from a596359 to 392640f Compare February 11, 2021 23:50
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote,
which is similar to Go's strconv.Quote.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \u
escapes instead if you want the UTF-8 encoding of a code point
in the range U+80 to U+FF. A string literal can no longer
denote invalid text, such as the 1-element string formerly
written "\xff".

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
Copy link
Collaborator

@jayconrod jayconrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good. Sorry for the delay; busy week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants