[spec] Text format #471

rossberg · 2017-05-11T16:27:07Z

This PR specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. Preview at http://webassembly.github.io/spec/

The changes relative to the .wast format currently implemented in the interpreter and other tools are the following.

Removals:

some of the more baroque forms of sugar for if
binary module bodies
anything script related (assertions, invokes, etc)
infinity as a secondary spelling for float inf
(we had discussed removing the optional module name as well, but in the light of design PR make land problems with python and cygwin #1055, we probably want to keep it?)

Additions:

\u{...} escapes in strings (see below)
more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import
the toplevel (module ...) is optional

Changes:

I took the liberty to propose one breaking change to make the syntax forward compatible with some of the future extensions that have been discussed:

non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures

Unicode:

the lexical syntax is defined in terms of Unicode characters (i.e., code points)
comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII
in strings, a Unicode character denotes its UTF-8 encoding
in strings, Unicode characters can be given explicitly with \u{...} notation
.wat files are assumed to be encoded in UTF-8

The PR includes changes to the interpreter implementing the Unicode support (but not yet the other changes).

Misc Remarks:

formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that
Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?
binary module bodies: they would seem pretty unusual for a "text" format, so are not included. Is there a reason to keep them?
abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax
inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)
formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well
PDF: currently doesn't build; due to various limitations of MathJax (such as the inability to use packages) I had to resort to some hacks for special characters that apparently don't work in proper LaTeX; will fix later
tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...

jfbastien · 2017-05-11T16:32:52Z

On Unicode comments: are U+2028 and U+2029 allowed?

rossberg · 2017-05-11T16:38:46Z

@jfbastien, yes, currently, any code point is allowed, with no special interpretation. The only ones with specific meaning are ; ( ) and \n. (In particular, U+2028/9 would not end a line comment, but nor does ASCII \r, \f, or \v.)

jfbastien · 2017-05-11T16:46:39Z

Gotcha, that makes sense. I'm not sure we want to restrict more than what you have, silly Unicode-isms are silly. Were we to try to fool-proof things there's other semicolons such as U+FF1B ； and friends ؛፤⁏⍮⸵︔﹔
❨ and more ❩ ❪ parens than ❫ ⦗ one can ⦘ ﹙ keep track ﹚ ﹝ of ﹞ （ because lol Unicode ）.

I just would rather ask so we all think about JSONP and shed a Unicode tear 💧 for it.

sunfishcode · 2017-05-11T16:52:09Z

Concerning module names and WebAssembly/design#1055, the module name in the name section has no semantic effect, and so seems like it could differ from the module name used for linking. I'd suggest this is an interesting enough corner case that it's worth testing, which would mean we'd want distinct syntax for the name-section module name.

Are the baroque forms of sugar mentioned above being removed from the .wast format too? Similarly, are the various Additions and Changes being made to the .wast format too?

What is the spec.exec.1 branch? This PR contains some changes that look like changes already in master, so it's not clear what the specific changes are here.

rossberg · 2017-05-11T16:59:53Z

@sunfishcode, yes, I plan to make the changes to the interpreter, since .wast should remain a strict super set of .wat that differs only in terms of the additional script constructs.

Branch spec.exec.1 corresponds to PR #467 -- I used that as a baseline here, since there were too many dependent changes.

rossberg · 2017-05-11T17:03:47Z

One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it?

jfbastien · 2017-05-11T17:09:25Z

One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it?

Can the proposed format generate all interesting invalid inputs without this?

Additionally, can the proposed format generate equivalent modules (say, non-canonical LEBs)?

rossberg · 2017-05-11T17:16:30Z

@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests.

jfbastien · 2017-05-11T17:17:37Z

@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests.

Yes, I think that's the right question to ask. I think we can move forward with your proposal, without this addition, and then add it later if we decide we need it. Correct?

rossberg · 2017-05-11T17:18:58Z

@jfbastien, correct.

rossberg · 2017-05-17T13:51:57Z

I included the changes necessary to both interpreter and test suite to adjust to the listed grammar modifications. Also a few fixes to the spec, notably including #478 and clarification of the ability to combine inline import/export sugar.

rossberg · 2017-05-22T12:21:38Z

Anybody opposed to landing this? Anybody willing to review the PR? :)

lukewagner · 2017-05-25T02:25:01Z

document/text/conventions.rst

+
+* Terminal symbols are either literal strings of characters enclosed in quotes: :math:`\text{module}`;
+  or expressed as `Unicode <http://www.unicode.org/versions/latest/>`_ code points: :math:`\unicode{0A}`.
+  (All characters written literally are unambguously drawn from the `7-bit ASCII <http://webstore.ansi.org/RecordDetail.aspx?sku=INCITS+4-1986%5bR2012%5d>`_ subset of Unicode.)


*unambiguously

lukewagner · 2017-05-25T02:38:02Z

document/text/lexical.rst

+     \Tkeyword ~|~ \TuN ~|~ \TsN ~|~ \TfN ~|~ \Tstring ~|~ \Tid ~|~
+     \text{(} ~|~ \text{)} ~|~ \Treserved \\
+   \production{keyword} & \Tkeyword &::=&
+     \mbox{(any terminal symbol in the grammar that is non of the above)} \\


lukewagner · 2017-05-25T02:43:34Z

document/text/values.rst

+Values
+------
+
+The grammar produtions in this section define *lexical syntax*,


*productions

lukewagner · 2017-05-25T03:38:02Z

Nice work! lgtm with a few nits above and here:

lexical.html#characters : the Note seems a bit confusing to me given the preceding para just said all valid unicode code points are characters. I think what's being said is that, across the entire set of rules which define the text format, it is noted to be the case that, outside of comments and string literals, only a subset of 7-bit ASCII characters are used?
the abstract syntax of a floating constant is defined by values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes however values.html#floating-point seems to produce reals with only a fuzzy note above that they get rounded; could you instead have an explicit realBytes(...) function applied to the reals to explicitly produce the bytes?
types.html#table-types : perhaps add a Note that elemtype may be extended with other types in the future

rossberg · 2017-05-29T11:51:48Z

On 25 May 2017 at 05:38, Luke Wagner ***@***.***> wrote: Nice work! lgtm with a few nits above and here: - lexical.html#characters : the Note seems a bit confusing to me given the preceding para just said all valid unicode code points are characters. I *think* what's being said is that, across the entire set of rules which define the text format, it is noted to be the case that, outside of comments and string literals, only a subset of 7-bit ASCII characters are used? Yes, that was the intention. Reworded. - the abstract syntax of a floating constant is defined by values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes however values.html#floating-point seems to produce reals with only a fuzzy note above that they get rounded; could you instead have an explicit realBytes(...) function applied to the reals to explicitly produce the bytes? Right, that's the purpose of the `ieee` meta functions used in the

attribute expressions. I yet have to define those, though (see the todo), which I plan to do when I get to the numeric ops.

…

- types.html#table-types : perhaps add a Note that elemtype may be extended with other types in the future Added.

lukewagner · 2017-05-29T15:02:10Z

@rossberg-chromium Ah, I see now; they all feed in to fN; I missed that before.

rossberg · 2017-06-01T09:38:04Z

Landing with above LGTM and no objections.

@binji

This change specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. It also adapts the interpreter to the changes listed below. The changes relative to the .wast format previously implemented in the interpreter and other tools are the following. Removals: - some of the more baroque forms of sugar for `if` - binary module bodies - anything script related (assertions, invokes, etc) - `infinity` as a secondary spelling for float `inf` Additions: - \u{...} escapes in strings (see below) - more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import - the toplevel (module ...) is optional Changes: One breaking change makes the syntax forward compatible with some of the future extensions that have been discussed: - non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures Unicode: - the lexical syntax is defined in terms of Unicode characters (i.e., code points) - comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII - in strings, a Unicode character denotes its UTF-8 encoding - in strings, Unicode characters can be given explicitly with \u{...} notation - .wat files are assumed to be encoded in UTF-8 Misc Remarks: - formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that - Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive? - binary module bodies: they would seem pretty unusual for a "text" format, so are not included for now. - abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax - inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous) - formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well - tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...

The new limits were previously merged in WebAssembly#360, but apparently that PR was to another branch that itself was never merged. Reland that change with two additional fixes: - Add digit separators in the numbers to improve readability and match the formatting of other limits. - Add an explicit limit of 1,000,000 types per recursion group. This is not intended to be a functional change, but this limit was previously left implicit.

These were accidentally deleted in WebAssembly#471.

rossberg added 17 commits April 27, 2017 17:58

Start on text format

fa6ee73

Core syntax done

cefc934

Lexical

fed5963

Disallow control characters in strings

748ccac

Lexical; index context

006f359

Unicode

26b9df9

Finish

6cfc342

Merge branch 'master' into spec.textual

e2f42d3

Locals, params

5cac42d

Reject malformed UTF-8 sources

de44c0c

Inline types

ac11a6c

Free module ordering

818b4cc

Free module ordering

5658fab

More abbreviations

dfaa467

Complete

7907caa

Merge branch 'spec.exec.1' into spec.textual

3cae95a

Fix various xref and TeX bugs

bb30331

rossberg changed the title ~~[spec] Specify text format~~ [spec] Text format May 11, 2017

rossberg added 2 commits May 11, 2017 19:59

Merge branch 'spec.exec.1' into spec.textual

998078c

C&P typo

b184b66

Stricter lexical rules for token separation

7fc3ba7

rossberg mentioned this pull request May 16, 2017

[spec+interpreter] Accepts "i32.const0" (missing space) #478

Closed

rossberg added 4 commits May 16, 2017 11:16

Parens are tokens too

302d843

Update interpreter to match text format spec

d1469fa

Support .wat files

678dade

Forgot added test file

6bfd70e

Fix tokenisation

bcec37f

This was referenced May 18, 2017

Interpreter: Inconsistencies in s-expression format #437

Closed

S-Expression Syntax #466

Closed

lukewagner reviewed May 25, 2017

View reviewed changes

binji mentioned this pull request May 26, 2017

Add options to wat-writer WebAssembly/wabt#436

Closed

6 tasks

rossberg added 2 commits May 29, 2017 12:53

More Unicode-related fixes in the interpreter

2e52a8a

Comments

997a258

rossberg added 2 commits May 29, 2017 17:47

Fix Latex

47b7405

Rename text to string token

f1f24e4

rossberg merged commit 0a8fda1 into spec.exec.1 Jun 1, 2017

rossberg deleted the spec.textual branch June 1, 2017 09:42

sbc100 mentioned this pull request Jun 10, 2017

Output inf/-inf rather than infinity/-infinity WebAssembly/binaryen#1046

Merged

cretz mentioned this pull request Oct 11, 2017

Spec Text Format For Quoted Modules And Binary #578

Closed

dhil pushed a commit to dhil/webassembly-spec that referenced this pull request Nov 13, 2023

[js-api] Restore accidentally-deleted digit separators

51de001

These were accidentally deleted in WebAssembly#471.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spec] Text format #471

[spec] Text format #471

rossberg commented May 11, 2017 •

edited

Loading

jfbastien commented May 11, 2017

rossberg commented May 11, 2017

jfbastien commented May 11, 2017 •

edited

Loading

sunfishcode commented May 11, 2017

rossberg commented May 11, 2017

rossberg commented May 11, 2017 •

edited

Loading

jfbastien commented May 11, 2017

rossberg commented May 11, 2017

jfbastien commented May 11, 2017

rossberg commented May 11, 2017

rossberg commented May 17, 2017 •

edited

Loading

rossberg commented May 22, 2017

lukewagner May 25, 2017

rossberg May 29, 2017

lukewagner May 25, 2017

rossberg May 29, 2017

lukewagner May 25, 2017

rossberg May 29, 2017

lukewagner commented May 25, 2017

rossberg commented May 29, 2017 via email

lukewagner commented May 29, 2017

rossberg commented Jun 1, 2017

[spec] Text format #471

[spec] Text format #471

Conversation

rossberg commented May 11, 2017 • edited Loading

jfbastien commented May 11, 2017

rossberg commented May 11, 2017

jfbastien commented May 11, 2017 • edited Loading

sunfishcode commented May 11, 2017

rossberg commented May 11, 2017

rossberg commented May 11, 2017 • edited Loading

jfbastien commented May 11, 2017

rossberg commented May 11, 2017

jfbastien commented May 11, 2017

rossberg commented May 11, 2017

rossberg commented May 17, 2017 • edited Loading

rossberg commented May 22, 2017

lukewagner May 25, 2017

Choose a reason for hiding this comment

rossberg May 29, 2017

Choose a reason for hiding this comment

lukewagner May 25, 2017

Choose a reason for hiding this comment

rossberg May 29, 2017

Choose a reason for hiding this comment

lukewagner May 25, 2017

Choose a reason for hiding this comment

rossberg May 29, 2017

Choose a reason for hiding this comment

lukewagner commented May 25, 2017

rossberg commented May 29, 2017 via email

lukewagner commented May 29, 2017

rossberg commented Jun 1, 2017

rossberg commented May 11, 2017 •

edited

Loading

jfbastien commented May 11, 2017 •

edited

Loading

rossberg commented May 11, 2017 •

edited

Loading

rossberg commented May 17, 2017 •

edited

Loading