-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
starlark: add 'bytes' data type, for binary strings #330
Conversation
13022e3
to
6444f7e
Compare
045b668
to
ff8aef7
Compare
starlark/testdata/bytes.star
Outdated
# | ||
# string to number: | ||
# - ord(bytes) returns numeric byte value of bytes[0] (requires len=1). | ||
# - ord(bytes[i]) returns numeric byte value of bytes[i]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ord(bytes[i])
is ord(int)
; I don't think that should be valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indexing a Python3 byte string returns a numeric byte value, not a bytes of length 1 (like text strings). I've changed Starlark/Go to match.
It occurs to me that ord(string) is not very useful on its own, because it requires the string to encode exactly one code point, yet the language provides no way to ensure that (other than iterating over string.codepoints()). Perhaps ord(string) should return the first code point? Or it could have an optional index=int parameter for the start offset, which avoids the need to allocate a substring by using ord(string[i:]).
aba7d20
to
f02c162
Compare
7df1736
to
7a5ed99
Compare
Hi Jon, Jay, I have updated this change to the Go impl, and the spec change in bazelbuild/starlark#161 to clean up the following primary parts of the problem:
The remaining parts of the problem are:
To make this process more manageable and reduce the number of rounds of review on an increasingly large pair of PRs, I propose that we aim to commit the spec change and the Go implementation once we are happy with the "phase 1" features, and then make follow-up changes to address the "phase 2" items. Given that no-one is yet using bytes, I don't think this will be disruptive. cheers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick first look: this phased approach sounds good to me. I still need to look at the parser change and tests in more detail, but looks good so far.
PTAL; thanks. |
a596359
to
392640f
Compare
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good. Sorry for the delay; busy week.
This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.