Skip to content
This repository has been archived by the owner on Dec 17, 2018. It is now read-only.

Make the parser conform to ICU MessageFormat #3

Merged
merged 1 commit into from
Jul 18, 2017

Conversation

nkovacs
Copy link
Contributor

@nkovacs nkovacs commented Jun 30, 2017

Apostrophes are now handled correctly, emulating
ICU's default DOUBLE_OPTIONAL behavior.

Octothorpe is handled correctly, and can be escaped using
apostrophes, but only inside plural
(depending on strictNumberSign).

A single apostrophe only starts quoted literal
text if it immediately precedes a curly brace ({}),
or, if inside a plural, an octothorpe (#).
The parser now supports the strictNumberSign option,
since that determines whether a quoted octothorpe
is parsed as '#' or just #.

Since choice format isn't supported, the pipe symbol
never causes an apostrophe to start quoted literal text.

Parameters to functions may contain whitespace and
quoted special characters, but argStyle is still trimmed
and split into multiple parameters.
A new option, strictFunctionParams, activates ICU-compatible
parsing, which parses everything from the second
comma to the closing curly brace as a single "argStyleText"
parameter.

Fixes #1, fixes #2

@jsf-clabot
Copy link

jsf-clabot commented Jun 30, 2017

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to get to this. Overall, I like these changes. However, it would be good if you could split the quote-escaping changes into a separate PR, as that's a change that'll require us to update the major version of the parser -- it will break some users' strings.

parser.pegjs Outdated

functionParamsStrict = _ ',' p:paramStrict { return p; }

functionParamsDefault = _ ',' _ p:paramDefault _ { return p.replace(/^[ \t\n\r]*|[ \t\n\r]*$/g, ''); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return p.trim()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because _ is just [ \t\n\r] and trim is [\s], which is [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff], and I wanted to keep it the same.

Changing them both to \s would probably be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, let's do that. The _ definition is an ancient remnant.

@nkovacs
Copy link
Contributor Author

nkovacs commented Jul 17, 2017

I've split the function parameter stuff into #4.
It should be backwards-compatible.

@nkovacs nkovacs force-pushed the icu-compatibility branch from d046478 to 2f59f48 Compare July 17, 2017 14:50
Apostrophes are now handled correctly, emulating
ICU's default DOUBLE_OPTIONAL behavior.

Octothorpe is handled correctly, and can be escaped using
apostrophes, but only inside plural
(depending on strictNumberSign).

A single apostrophe only starts quoted literal
text if it immediately precedes a curly brace ({}),
or, if inside a plural, an octothorpe (#).
The parser now supports the strictNumberSign option,
since that determines whether a quoted octothorpe
is parsed as `'#'` or just `#`.

Since choice format isn't supported, the pipe symbol
never causes an apostrophe to start quoted literal text.

Fixes messageformat#2
@nkovacs nkovacs force-pushed the icu-compatibility branch from 2f59f48 to ad6c0c2 Compare July 17, 2017 14:59
@nkovacs
Copy link
Contributor Author

nkovacs commented Jul 17, 2017

\s doesn't work in peg.js. It has to be done manually: nkovacs@a0be6aa

Also, the identifier syntax in the Java implementation is [^[[:Pattern_Syntax:][:Pattern_White_Space:]]]+, where Pattern_Syntax is this: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:Pattern_Syntax:%5D&abb=on&g= and Pattern_White_Space is http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3APattern_White_Space%3A%5D&abb=on&g=&i=
See http://icu-project.org/apiref/icu4j/com/ibm/icu/text/MessageFormat.html, http://icu-project.org/apiref/icu4j/com/ibm/icu/text/SelectFormat.html and http://icu-project.org/apiref/icu4j/com/ibm/icu/text/PluralFormat.html

The id definition in messageformat-parser is a bit inconsistent. The first character has to be ascii alphanumeric or $ or _, but then the rest of the characters can be almost anything.

@eemeli eemeli merged commit ad6c0c2 into messageformat:master Jul 18, 2017
eemeli added a commit that referenced this pull request Jul 18, 2017
Make the parser conform to ICU MessageFormat
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of apostrophes is incorrect argStyleText is not supported
3 participants