Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spec] Text format #471

Merged
merged 35 commits into from
Jun 1, 2017
Merged

[spec] Text format #471

merged 35 commits into from
Jun 1, 2017

Conversation

rossberg
Copy link
Member

@rossberg rossberg commented May 11, 2017

This PR specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. Preview at http://webassembly.github.io/spec/

The changes relative to the .wast format currently implemented in the interpreter and other tools are the following.

Removals:

  • some of the more baroque forms of sugar for if
  • binary module bodies
  • anything script related (assertions, invokes, etc)
  • infinity as a secondary spelling for float inf
  • (we had discussed removing the optional module name as well, but in the light of design PR make land problems with python and cygwin  #1055, we probably want to keep it?)

Additions:

  • \u{...} escapes in strings (see below)
  • more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import
  • the toplevel (module ...) is optional

Changes:

I took the liberty to propose one breaking change to make the syntax forward compatible with some of the future extensions that have been discussed:

  • non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures

Unicode:

  • the lexical syntax is defined in terms of Unicode characters (i.e., code points)
  • comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII
  • in strings, a Unicode character denotes its UTF-8 encoding
  • in strings, Unicode characters can be given explicitly with \u{...} notation
  • .wat files are assumed to be encoded in UTF-8

The PR includes changes to the interpreter implementing the Unicode support (but not yet the other changes).

Misc Remarks:

  • formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that

  • Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?

  • binary module bodies: they would seem pretty unusual for a "text" format, so are not included. Is there a reason to keep them?

  • abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax

  • inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)

  • formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well

  • PDF: currently doesn't build; due to various limitations of MathJax (such as the inability to use packages) I had to resort to some hacks for special characters that apparently don't work in proper LaTeX; will fix later

  • tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...

@rossberg rossberg changed the title [spec] Specify text format [spec] Text format May 11, 2017
@jfbastien
Copy link
Member

On Unicode comments: are U+2028 and U+2029 allowed?

@rossberg
Copy link
Member Author

@jfbastien, yes, currently, any code point is allowed, with no special interpretation. The only ones with specific meaning are ; ( ) and \n. (In particular, U+2028/9 would not end a line comment, but nor does ASCII \r, \f, or \v.)

@jfbastien
Copy link
Member

jfbastien commented May 11, 2017

Gotcha, that makes sense. I'm not sure we want to restrict more than what you have, silly Unicode-isms are silly. Were we to try to fool-proof things there's other semicolons such as U+FF1B ; and friends ؛፤⁏⍮⸵︔﹔
❨ and more ❩ ❪ parens than ❫ ⦗ one can ⦘ ﹙ keep track ﹚ ﹝ of ﹞ ( because lol Unicode ).

I just would rather ask so we all think about JSONP and shed a Unicode tear 💧 for it.

@sunfishcode
Copy link
Member

Concerning module names and WebAssembly/design#1055, the module name in the name section has no semantic effect, and so seems like it could differ from the module name used for linking. I'd suggest this is an interesting enough corner case that it's worth testing, which would mean we'd want distinct syntax for the name-section module name.

Are the baroque forms of sugar mentioned above being removed from the .wast format too? Similarly, are the various Additions and Changes being made to the .wast format too?

What is the spec.exec.1 branch? This PR contains some changes that look like changes already in master, so it's not clear what the specific changes are here.

@rossberg
Copy link
Member Author

@sunfishcode, yes, I plan to make the changes to the interpreter, since .wast should remain a strict super set of .wat that differs only in terms of the additional script constructs.

Branch spec.exec.1 corresponds to PR #467 -- I used that as a baseline here, since there were too many dependent changes.

@rossberg
Copy link
Member Author

rossberg commented May 11, 2017

One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it?

@jfbastien
Copy link
Member

One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it?

Can the proposed format generate all interesting invalid inputs without this?

Additionally, can the proposed format generate equivalent modules (say, non-canonical LEBs)?

@rossberg
Copy link
Member Author

@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests.

@jfbastien
Copy link
Member

@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests.

Yes, I think that's the right question to ask. I think we can move forward with your proposal, without this addition, and then add it later if we decide we need it. Correct?

@rossberg
Copy link
Member Author

@jfbastien, correct.

@rossberg
Copy link
Member Author

rossberg commented May 17, 2017

I included the changes necessary to both interpreter and test suite to adjust to the listed grammar modifications. Also a few fixes to the spec, notably including #478 and clarification of the ability to combine inline import/export sugar.

@rossberg
Copy link
Member Author

Anybody opposed to landing this? Anybody willing to review the PR? :)


* Terminal symbols are either literal strings of characters enclosed in quotes: :math:`\text{module}`;
or expressed as `Unicode <http://www.unicode.org/versions/latest/>`_ code points: :math:`\unicode{0A}`.
(All characters written literally are unambguously drawn from the `7-bit ASCII <http://webstore.ansi.org/RecordDetail.aspx?sku=INCITS+4-1986%5bR2012%5d>`_ subset of Unicode.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*unambiguously

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

\Tkeyword ~|~ \TuN ~|~ \TsN ~|~ \TfN ~|~ \Tstring ~|~ \Tid ~|~
\text{(} ~|~ \text{)} ~|~ \Treserved \\
\production{keyword} & \Tkeyword &::=&
\mbox{(any terminal symbol in the grammar that is non of the above)} \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*none

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Values
------

The grammar produtions in this section define *lexical syntax*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*productions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lukewagner
Copy link
Member

Nice work! lgtm with a few nits above and here:

  • lexical.html#characters : the Note seems a bit confusing to me given the preceding para just said all valid unicode code points are characters. I think what's being said is that, across the entire set of rules which define the text format, it is noted to be the case that, outside of comments and string literals, only a subset of 7-bit ASCII characters are used?
  • the abstract syntax of a floating constant is defined by values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes however values.html#floating-point seems to produce reals with only a fuzzy note above that they get rounded; could you instead have an explicit realBytes(...) function applied to the reals to explicitly produce the bytes?
  • types.html#table-types : perhaps add a Note that elemtype may be extended with other types in the future

@rossberg
Copy link
Member Author

rossberg commented May 29, 2017 via email

@lukewagner
Copy link
Member

@rossberg-chromium Ah, I see now; they all feed in to fN; I missed that before.

@rossberg
Copy link
Member Author

rossberg commented Jun 1, 2017

Landing with above LGTM and no objections.

@rossberg rossberg merged commit 0a8fda1 into spec.exec.1 Jun 1, 2017
@rossberg rossberg deleted the spec.textual branch June 1, 2017 09:42
rossberg added a commit that referenced this pull request Jun 1, 2017
This change specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. It also adapts the interpreter to the changes listed below.

The changes relative to the .wast format previously implemented in the interpreter and other tools are the following.

Removals:

- some of the more baroque forms of sugar for `if`
- binary module bodies
- anything script related (assertions, invokes, etc)
- `infinity` as a secondary spelling for float `inf`

Additions:

- \u{...} escapes in strings (see below)
- more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import
- the toplevel (module ...) is optional

Changes:

One breaking change makes the syntax forward compatible with some of the future extensions that have been discussed:

- non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures

Unicode:

- the lexical syntax is defined in terms of Unicode characters (i.e., code points)
- comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII
- in strings, a Unicode character denotes its UTF-8 encoding
- in strings, Unicode characters can be given explicitly with \u{...} notation
- .wat files are assumed to be encoded in UTF-8

Misc Remarks:

- formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that

- Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?

- binary module bodies: they would seem pretty unusual for a "text" format, so are not included for now.

- abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax

- inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)

- formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well

- tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...
dhil pushed a commit to dhil/webassembly-spec that referenced this pull request Nov 13, 2023
The new limits were previously merged in WebAssembly#360, but apparently that PR was to
another branch that itself was never merged. Reland that change with two
additional fixes:

 - Add digit separators in the numbers to improve readability and match the
   formatting of other limits.

 - Add an explicit limit of 1,000,000 types per recursion group. This is not
   intended to be a functional change, but this limit was previously left
   implicit.
dhil pushed a commit to dhil/webassembly-spec that referenced this pull request Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants