Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

Closed
alandonovan opened this issue Jul 9, 2019 · 0 comments
Closed
Assignees

Comments

@alandonovan
Copy link
Contributor

alandonovan commented Jul 9, 2019

Strings are quoted as if by Go's fmt %q operator, which quotes non-printable Unicode code points using \uXXXX or \UXXXXXXXX. But this syntax is not currently recognized by the Starlark scanner, nor does the spec say anything about the form of string literals.

Welcome to Starlark (go.starlark.net)
>>> chr(0x00A0) # NO-BREAK SPACE (non printable)
"\u00a0"
>>> "\u00a0"
"\\u00a0"
>>> chr(0x400) # CYRILLIC CAPITAL LETTER IE WITH GRAVE
"Ѐ"
>>> '\u0x400'
"\\u0x400"
>>> chr(0x0001f63f) # CRYING CAT FACE
"😿"
>>> '\U0001f63f'
"\\U0001f63f"

Contrast with Python3:

Python 3.6.5 (default, Mar 31 2018, 05:34:57) 
>>> chr(0x00A0) # NO-BREAK SPACE (non printable)
'\xa0'
>>> '\xa0'
'\xa0'
>>> chr(0x400) # CYRILLIC CAPITAL LETTER IE WITH GRAVE
'Ѐ'
>>> '\u0400'
'Ѐ'
>>> '\U0001f63f'
'😿'
>>> chr(0x0001f63f) # CRYING CAT FACE
'😿'

The Starlark spec and implementations should allow \uXXXX and \UXXXXXXXX escapes within strings, with exactly 4 or 8 hex digits.

Python2 & 3 also accept \xXX escapes, with two hex digits. Should Starlark?
(FWIW: C++ and Go do too; Java does not, and furthermore its \UXXXX notation denotes a UTF-16 code, not a Unicode code point.)

adonovan added a commit that referenced this issue Feb 5, 2021
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote, which
is similar to Go's strconv.Quote, but uses \X to escape
raw code units as needed.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \X or \u
escapes instead.

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
@alandonovan alandonovan self-assigned this Feb 5, 2021
adonovan added a commit that referenced this issue Feb 10, 2021
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote,
which is similar to Go's strconv.Quote.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \u
escapes instead if you want the UTF-8 encoding of a code point
in the range U+80 to U+FF. A string literal can no longer
denote invalid text, such as the 1-element string formerly
written "\xff".

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
adonovan added a commit that referenced this issue Feb 11, 2021
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote,
which is similar to Go's strconv.Quote.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \u
escapes instead if you want the UTF-8 encoding of a code point
in the range U+80 to U+FF. A string literal can no longer
denote invalid text, such as the 1-element string formerly
written "\xff".

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
adonovan added a commit that referenced this issue Feb 11, 2021
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote,
which is similar to Go's strconv.Quote.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \u
escapes instead if you want the UTF-8 encoding of a code point
in the range U+80 to U+FF. A string literal can no longer
denote invalid text, such as the 1-element string formerly
written "\xff".

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
adonovan added a commit that referenced this issue Feb 11, 2021
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote,
which is similar to Go's strconv.Quote.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \u
escapes instead if you want the UTF-8 encoding of a code point
in the range U+80 to U+FF. A string literal can no longer
denote invalid text, such as the 1-element string formerly
written "\xff".

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
adonovan added a commit that referenced this issue Feb 12, 2021
THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below

This change defines a 'bytes' data type, an immutable string of
bytes. In this Go implementation of Starlark, ordinary strings
are also strings of bytes, so the behavior of the two is very similar.
However, that is not required by the spec. Other implementations of
Starlark, notably in Java, may use strings of UTF-16 codes for the
ordinary string type, and thus need a distinct type for byte strings.

See testdata/bytes.star for a tour of the API, and some remaining
questions. See the attached issue for an outline of the proposed
spec change. A Java implementation is underway, but is greatly
complicated by Bazel's unfortunate misdecoding of UTF-8 files as
Latin1.

The string.elems iterable view is now indexable.

The old syntax.quote function (which was in fact not used
except in tests) has been replaced by syntax.Quote,
which is similar to Go's strconv.Quote.

This change removes go.starlark.net.lib.proto.Bytes.

IMPORTANT: string literals that previously used hex escapes
\xXX or octal escapes \OOO to denote byte values greater than 127
will now result in a compile error advising you to use \u
escapes instead if you want the UTF-8 encoding of a code point
in the range U+80 to U+FF. A string literal can no longer
denote invalid text, such as the 1-element string formerly
written "\xff".

Updates bazelbuild/starlark#112
Fixes #222

Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant