scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

alandonovan · 2019-07-09T21:37:20Z

Strings are quoted as if by Go's fmt %q operator, which quotes non-printable Unicode code points using \uXXXX or \UXXXXXXXX. But this syntax is not currently recognized by the Starlark scanner, nor does the spec say anything about the form of string literals.

Welcome to Starlark (go.starlark.net)
>>> chr(0x00A0) # NO-BREAK SPACE (non printable)
"\u00a0"
>>> "\u00a0"
"\\u00a0"
>>> chr(0x400) # CYRILLIC CAPITAL LETTER IE WITH GRAVE
"Ѐ"
>>> '\u0x400'
"\\u0x400"
>>> chr(0x0001f63f) # CRYING CAT FACE
"😿"
>>> '\U0001f63f'
"\\U0001f63f"

Contrast with Python3:

Python 3.6.5 (default, Mar 31 2018, 05:34:57) 
>>> chr(0x00A0) # NO-BREAK SPACE (non printable)
'\xa0'
>>> '\xa0'
'\xa0'
>>> chr(0x400) # CYRILLIC CAPITAL LETTER IE WITH GRAVE
'Ѐ'
>>> '\u0400'
'Ѐ'
>>> '\U0001f63f'
'😿'
>>> chr(0x0001f63f) # CRYING CAT FACE
'😿'

The Starlark spec and implementations should allow \uXXXX and \UXXXXXXXX escapes within strings, with exactly 4 or 8 hex digits.

Python2 & 3 also accept \xXX escapes, with two hex digits. Should Starlark?
(FWIW: C++ and Go do too; Java does not, and furthermore its \UXXXX notation denotes a UTF-16 code, not a Unicode code point.)

The text was updated successfully, but these errors were encountered:

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote, but uses \X to escape raw code units as needed. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \X or \u escapes instead. Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

THIS IS AN INCOMPATIBLE LANGUAGE CHANGE; see below This change defines a 'bytes' data type, an immutable string of bytes. In this Go implementation of Starlark, ordinary strings are also strings of bytes, so the behavior of the two is very similar. However, that is not required by the spec. Other implementations of Starlark, notably in Java, may use strings of UTF-16 codes for the ordinary string type, and thus need a distinct type for byte strings. See testdata/bytes.star for a tour of the API, and some remaining questions. See the attached issue for an outline of the proposed spec change. A Java implementation is underway, but is greatly complicated by Bazel's unfortunate misdecoding of UTF-8 files as Latin1. The string.elems iterable view is now indexable. The old syntax.quote function (which was in fact not used except in tests) has been replaced by syntax.Quote, which is similar to Go's strconv.Quote. This change removes go.starlark.net.lib.proto.Bytes. IMPORTANT: string literals that previously used hex escapes \xXX or octal escapes \OOO to denote byte values greater than 127 will now result in a compile error advising you to use \u escapes instead if you want the UTF-8 encoding of a code point in the range U+80 to U+FF. A string literal can no longer denote invalid text, such as the 1-element string formerly written "\xff". Updates bazelbuild/starlark#112 Fixes #222 Change-Id: Ieccd177a2662ca2106016165b50073a670ae7f2c

alandonovan self-assigned this Feb 5, 2021

alandonovan closed this as completed in ebe61bd Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

alandonovan commented Jul 9, 2019 •

edited

Loading

scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

scanner: accept \uXXXX and \UXXXXXXXX escapes (and specify them!) #222

Comments

alandonovan commented Jul 9, 2019 • edited Loading

alandonovan commented Jul 9, 2019 •

edited

Loading