-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spec] Text format #471
[spec] Text format #471
Conversation
On Unicode comments: are U+2028 and U+2029 allowed? |
@jfbastien, yes, currently, any code point is allowed, with no special interpretation. The only ones with specific meaning are ; ( ) and \n. (In particular, U+2028/9 would not end a line comment, but nor does ASCII \r, \f, or \v.) |
Gotcha, that makes sense. I'm not sure we want to restrict more than what you have, silly Unicode-isms are silly. Were we to try to fool-proof things there's other semicolons such as U+FF1B ; and friends ؛፤⁏⍮⸵︔﹔ I just would rather ask so we all think about JSONP and shed a Unicode tear 💧 for it. |
Concerning module names and WebAssembly/design#1055, the module name in the name section has no semantic effect, and so seems like it could differ from the module name used for linking. I'd suggest this is an interesting enough corner case that it's worth testing, which would mean we'd want distinct syntax for the name-section module name. Are the baroque forms of sugar mentioned above being removed from the .wast format too? Similarly, are the various Additions and Changes being made to the .wast format too? What is the spec.exec.1 branch? This PR contains some changes that look like changes already in master, so it's not clear what the specific changes are here. |
@sunfishcode, yes, I plan to make the changes to the interpreter, since .wast should remain a strict super set of .wat that differs only in terms of the additional script constructs. Branch spec.exec.1 corresponds to PR #467 -- I used that as a baseline here, since there were too many dependent changes. |
One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it? |
Can the proposed format generate all interesting invalid inputs without this? Additionally, can the proposed format generate equivalent modules (say, non-canonical LEBs)? |
@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests. |
Yes, I think that's the right question to ask. I think we can move forward with your proposal, without this addition, and then add it later if we decide we need it. Correct? |
@jfbastien, correct. |
I included the changes necessary to both interpreter and test suite to adjust to the listed grammar modifications. Also a few fixes to the spec, notably including #478 and clarification of the ability to combine inline import/export sugar. |
Anybody opposed to landing this? Anybody willing to review the PR? :) |
document/text/conventions.rst
Outdated
|
||
* Terminal symbols are either literal strings of characters enclosed in quotes: :math:`\text{module}`; | ||
or expressed as `Unicode <http://www.unicode.org/versions/latest/>`_ code points: :math:`\unicode{0A}`. | ||
(All characters written literally are unambguously drawn from the `7-bit ASCII <http://webstore.ansi.org/RecordDetail.aspx?sku=INCITS+4-1986%5bR2012%5d>`_ subset of Unicode.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*unambiguously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
document/text/lexical.rst
Outdated
\Tkeyword ~|~ \TuN ~|~ \TsN ~|~ \TfN ~|~ \Tstring ~|~ \Tid ~|~ | ||
\text{(} ~|~ \text{)} ~|~ \Treserved \\ | ||
\production{keyword} & \Tkeyword &::=& | ||
\mbox{(any terminal symbol in the grammar that is non of the above)} \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
document/text/values.rst
Outdated
Values | ||
------ | ||
|
||
The grammar produtions in this section define *lexical syntax*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*productions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Nice work! lgtm with a few nits above and here:
|
On 25 May 2017 at 05:38, Luke Wagner ***@***.***> wrote:
Nice work! lgtm with a few nits above and here:
- lexical.html#characters : the Note seems a bit confusing to me given
the preceding para just said all valid unicode code points are characters.
I *think* what's being said is that, across the entire set of rules
which define the text format, it is noted to be the case that, outside of
comments and string literals, only a subset of 7-bit ASCII characters are
used?
Yes, that was the intention. Reworded.
- the abstract syntax of a floating constant is defined by
values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes
however values.html#floating-point seems to produce reals with only a fuzzy
note above that they get rounded; could you instead have an explicit
realBytes(...) function applied to the reals to explicitly produce the
bytes?
Right, that's the purpose of the `ieee` meta functions used in the
attribute expressions. I yet have to define those, though (see the todo),
which I plan to do when I get to the numeric ops.
… - types.html#table-types : perhaps add a Note that elemtype may be
extended with other types in the future
Added.
|
@rossberg-chromium Ah, I see now; they all feed in to |
Landing with above LGTM and no objections. |
This change specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. It also adapts the interpreter to the changes listed below. The changes relative to the .wast format previously implemented in the interpreter and other tools are the following. Removals: - some of the more baroque forms of sugar for `if` - binary module bodies - anything script related (assertions, invokes, etc) - `infinity` as a secondary spelling for float `inf` Additions: - \u{...} escapes in strings (see below) - more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import - the toplevel (module ...) is optional Changes: One breaking change makes the syntax forward compatible with some of the future extensions that have been discussed: - non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures Unicode: - the lexical syntax is defined in terms of Unicode characters (i.e., code points) - comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII - in strings, a Unicode character denotes its UTF-8 encoding - in strings, Unicode characters can be given explicitly with \u{...} notation - .wat files are assumed to be encoded in UTF-8 Misc Remarks: - formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that - Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive? - binary module bodies: they would seem pretty unusual for a "text" format, so are not included for now. - abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax - inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous) - formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well - tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...
The new limits were previously merged in WebAssembly#360, but apparently that PR was to another branch that itself was never merged. Reland that change with two additional fixes: - Add digit separators in the numbers to improve readability and match the formatting of other limits. - Add an explicit limit of 1,000,000 types per recursion group. This is not intended to be a functional change, but this limit was previously left implicit.
These were accidentally deleted in WebAssembly#471.
This PR specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. Preview at http://webassembly.github.io/spec/
The changes relative to the .wast format currently implemented in the interpreter and other tools are the following.
Removals:
if
infinity
as a secondary spelling for floatinf
Additions:
Changes:
I took the liberty to propose one breaking change to make the syntax forward compatible with some of the future extensions that have been discussed:
Unicode:
The PR includes changes to the interpreter implementing the Unicode support (but not yet the other changes).
Misc Remarks:
formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that
Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?
binary module bodies: they would seem pretty unusual for a "text" format, so are not included. Is there a reason to keep them?
abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax
inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)
formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well
PDF: currently doesn't build; due to various limitations of MathJax (such as the inability to use packages) I had to resort to some hacks for special characters that apparently don't work in proper LaTeX; will fix later
tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...