diff --git a/CHANGELOG.md b/CHANGELOG.md index 0e75f51..bc2c41e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,8 +8,27 @@ * Single line comments (`//`) can now be immediately followed by a newline. * All literal whitespace following a `\` in a string is now discarded. * Vertical tabs (`U+000B`) are now considered to be whitespace. -* Identifiers can't start with `r#`, so they're easy to distinguish from raw strings. (They already similarly can't start with a digit, or a sign+digit, so they're easy to distinguish from numbers.) -* The grammar syntax itself has been described, and some confusing definitions in the grammar have been fixed accordingly (mostly related to escaped characters). +* Identifiers can't start with `r#`, so they're easy to distinguish from raw + strings. (They already similarly can't start with a digit, or a sign+digit, + so they're easy to distinguish from numbers.) +* The grammar syntax itself has been described, and some confusing definitions + in the grammar have been fixed accordingly (mostly related to escaped + characters). +* `,`, `<`, and `>` are now legal identifier characters. They were previously + reserved for KQL but this is no longer necessary. +* Code points under `0x20`, code points above `0x10FFFF`, Delete control + character (`0x7F`), and the [unicode "direction control" + characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls) + are now completely banned from appearing literally in KDL documents. They + can now only be represented in regular strings, and there's no facilities to + represent them in raw strings. This should be considered a security + improvement. +* Raw strings no longer require an `r` prefix: they are now specified by using + `#""#`. +* `#` is an illegal initial identifier character, but is allowed in other + places in identifiers. +* Line continuations can be followed by an EOF now, instead of requiring a + newline (or comment). `node \` is now a legal KDL document. ### KQL diff --git a/SPEC.md b/SPEC.md index 3b5a782..9480301 100644 --- a/SPEC.md +++ b/SPEC.md @@ -94,7 +94,7 @@ foo 1 key="val" 3 { A bare Identifier is composed of any Unicode codepoint other than [non-initial characters](#non-initial-characters), followed by any number of Unicode -codepoints other than [non-identifier characters](#non-identifier-characters), +code points other than [non-identifier characters](#non-identifier-characters), so long as this doesn't produce something confusable for a [Number](#number), [Boolean](#boolean), or [Null](#null). For example, both a [Number](#number) and an Identifier can start with `-`, but when an Identifier starts with `-` @@ -122,9 +122,9 @@ of having an identifier look like a negative number. The following characters cannot be used anywhere in a bare [Identifier](#identifier): -* Any codepoint with hexadecimal value `0x20` or below. -* Any codepoint with hexadecimal value higher than `0x10FFFF`. -* Any of `\/(){}<>;[]=,"` +* Any of `\/(){};[]="` +* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL + documents. ### Line Continuation @@ -137,6 +137,7 @@ characters and an optional single-line comment. It must be terminated by a Following a line continuation, processing of a Node can continue as usual. #### Example + ```kdl my-node 1 2 \ // comments are ok after \ 3 4 // This is the actual end of the Node. @@ -309,6 +310,10 @@ String Value can encompass multiple lines without behaving like a Newline for Strings _MUST_ be represented as UTF-8 values. +Strings _MUST NOT_ include the code points for [disallowed literal +code points](#disallowed-literal-code-points) directly. If needed, they can be +specified with their corresponding `\u{}` escape. + #### Escapes In addition to literal code points, a number of "escapes" are supported. @@ -362,17 +367,27 @@ support `\`-escapes. They otherwise share the same properties as far as literal [Newline](#newline) characters go, and the requirement of UTF-8 representation. -Raw String literals are represented as `r`, followed by zero or more `#` -characters, followed by `"`, followed by any number of UTF-8 literals. The string is then -closed by a `"` followed by a _matching_ number of `#` characters. This means -that the string sequence `"` or `"#` and such must not match the closing `"` -with the same or more `#` characters as the opening `r`. +Raw String literals are represented with one or more `#` characters, followed +by `"`, followed by any number of UTF-8 literals. The string is then closed by +a `"` followed by a _matching_ number of `#` characters. This means that the +string sequence `"` or `"#` and such must not match the closing `"` with the +same or more `#` characters as the opening `#`, in the body of the string. + +Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal +code-points](#disallowed-literal-code-points) as code points in their body. +Unlike with Strings, these cannot simply be escaped, and are thus +unrepresentable when using Raw Strings. + +Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal +code-points](#disallowed-literal-code-points) as code points in their body. +Unlike with Strings, these cannot simply be escaped, and are thus +unrepresentable when using Raw Strings. #### Example ```kdl -just-escapes r"\n will be literal" -quotes-and-escapes r#"hello\n\r\asd"world"# +just-escapes #"\n will be literal"# +quotes-and-escapes ##"hello\n\r\asd"#world"## ``` ### Number @@ -470,6 +485,16 @@ lines](https://www.unicode.org/versions/Unicode13.0.0/ch05.pdf): Note that for the purpose of new lines, CRLF is considered _a single newline_. +### Disallowed Literal Code Points + +The following code points may not appear literally anywhere in the document. +They may be represented in Strings (but not Raw Strings) using `\u{}`. + +* Any codepoint with hexadecimal value `0x20` or below (various control characters). +* `0x7F` (the Delete control character). +* Any codepoint with hexadecimal value higher than `0x10FFFF`. +* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls) + ## Full Grammar This is the full official grammar for KDL and should be considered @@ -494,25 +519,24 @@ node-children := '{' nodes '}' node-terminator := single-line-comment | newline | ';' | eof identifier := string | bare-identifier -bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword -unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char* +bare-identifier := (unambiguous-ident | numberish-ident) - keyword +unambiguous-ident := (identifier-char - digit - sign - "#") identifier-char* numberish-ident := sign ((identifier-char - digit) identifier-char*)? -stringish-ident := "r" ((identifier-char - "#") identifier-char*)? -identifier-char := unicode - line-space - [\\/(){}<>;\[\]=,"] +identifier-char := unicode - line-space - [\\/(){};\[\]="] - disallowed-literal-code-points + keyword := boolean | 'null' -prop := identifier '=' value +prop := identifier '=' valuel value := type? (string | number | keyword) type := '(' identifier ')' string := raw-string | escaped-string -escaped-string := '"' character* '"' -character := '\' escape | [^\"] +escaped-string := '"' string-character* '"' +string-character := '\' escape | [^\"] - disallowed-literal-code-points escape := ["\\bfnrt] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+ hex-digit := [0-9a-fA-F] -raw-string := 'r' raw-string-hash -raw-string-hash := '#' raw-string-hash '#' | raw-string-quotes -raw-string-quotes := '"' .* '"' +raw-string := '#' raw-string-quotes '#' | '#' raw-string '#' +raw-string-quotes := '"' (unicode - disallowed-literal-code-points) '"' number := decimal | hex | octal | binary @@ -528,7 +552,7 @@ binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')* boolean := 'true' | 'false' -escline := '\\' ws* (single-line-comment | newline) +escline := '\\' ws* (single-line-comment | newline | eof) newline := See Table (All line-break white_space) @@ -536,6 +560,8 @@ ws := bom | unicode-space | multi-line-comment bom := '\u{FEFF}' +disallowed-literal-code-points := See Table (Disallowed Literal Code Points) + unicode-space := See Table (All White_Space unicode characters which are not `newline`) single-line-comment := '//' ^newline* (newline | eof)