Skip to content
RY edited this page Jan 9, 2022 · 24 revisions

The Library main purpose is to provide an easy, straightforward and flexible way to map source text to a desired format with a variety of options and customizations available. Basic components which are involved in the workflow and their relationships can be demonstrated by the following diagram:

  • Grammar defines the primary source mapping units(terminals) and the set of rules(productions) describing the way they're supposed to be reduced to a target structure.
  • Each terminal is expressed by a regular expression and can be easily constructed from the library included primitives.
  • ParserBase defines an primary abstract interface for the parser which partially has to be implemented in the custom parser, while the rest is going to be generated automatically according to the grammar provided.
  • Custom parser defines a public interaction interface and also is responsible for source reduction logic.
  • Final parser is a dynamic CLR type, with both lexical and syntactic analyzers generated, which guarantees all the custom defined reducers are executed in the expected order.

Regular Expressions

Regular Expressions are constructed from nodes of the following types:

  • Text nodes to match a single position within a source which supports any combination of Unicode character codes, ranges and Unicode categories:
Rex.Char('r');
Rex.Char(@"a-z0-9");
Rex.Char(@"\u{0|10-ff|ccc|10000-10FFFF}");
Rex.Char(@"\p{Cc|Cf|Cn|Co|Cs|C|Ll|Lm|Lo|Lt|Lu|L|Mc|Me|Mn|M|Nd|Nl|No|N|Pc|Pd|Pe|Po|Ps|Pf|Pi|P|Sc|Sk|Sm|So|S|Zl|Zp|Zs|Z}");
Rex.Char(@"\a\b\t\r\n\v\f\\");

There is also a convenient way to configure a match for any character (or any set in general) excluding specific characters, ranges or categories:

Rex.Char(@"0-10ffff-[\p{L}]");
Rex.Char(@"0-10ffff-[a]");
Rex.Except(@"\p{L}");
Rex.Except('a');
  • Text nodes are later combined in the expression by OR, AND(concatenation) and REPEAT nodes:
Rex.Char('a').Then(Rex.Char('b')).Then(Rex.Char('b'));
Rex.Text("abb");
Rex.Or(Rex.Char('a'), Rex.Char('b')).NoneOrMore().Then("abb"); // (a|b)*abb
  • Special kind of a conditional expressions are also available. The idea is to define a sub-expression, which is supposed to be checked at some position in the pattern and an evaluation is transferred to the state which follows only if the sub-expression has succeed(positive lookahead) or failed(negative lookahead). Every time a conditional expression evaluation is completed current text position is reset to the state where it's started:
Rex.IfNot("-->").Then(Rex.AnyChar).NoneOrMore();
Rex.Char('a').NoneOrMore().FollowedBy('b');

Compiled Expression

An individual regular expression can be compiled in a delegate of int RexEvaluator(string content, int offset, int length) type. Accepting a text input and boundaries the dynamic method will find and return a length of the longest sub-string has been matched or -1 if none succeed:

var digit  = Rex.Char("0-9");
var number = digit.OneOrMore();
var eval   = Rex.Compile(number);
var input  = "x123y";

Console.WriteLine(eval(input, 0, input.Length));     // outputs -1
Console.WriteLine(eval(input, 1, input.Length - 1)); // outputs  3
Console.WriteLine(eval(input, 1, 2));                // outputs  2

Lexical Analysis

Multiple regular expressions(each expression is associated with an appropriate token ID) then are converted into a DFA by a lexical builder. The DFA is represented by lexical states and transitions between them. Each lexical state corresponds one or more nodes from the source expressions. When some state contains a regular expression accept node(added automatically) it's considered as a final state. If there are several acceptance nodes, the one with the lowest token ID is selected as final for the state. When a source text (represented by a sequence of UTF-16 code points) is processed the following workflow is applied to choose the next state:

images/lex-transitions.png

  • Each state contains ordered disjoint list of Unicode ranges.
  • If some range contains a pending char code then Unicode category for the char code is considered.
  • If there is a mapping between some category and the char code then the transition which is defined by this category is selected.
  • If no category is matched then the default transition for the range is selected.
  • If no range contains the char code then the default categories are tested.
  • If neither range nor default category corresponds to the char then the default state transition is selected.

Surrogates

Lexical analyzer provides natural support for the code points from the extended Unicode range(0x10000-0x10FFFF). To achieve this a supplementary surrogate state is generated when necessary. Then all codes which does not fit UTF-16 range are moved to this "extra" state. Finally, the surrogate state is connected to the root by introducing a special 'high-surrogate'(0xD800--0xDBFF) range transition.

Lookaheads

To support conditional regular expression nodes a lexical state is extended to have a tree-like structure. Each state contains False and True child state nodes. When both are null means we're dealing with a regular lexical state. But when any or both are set identifies a lookahead initial state case. If so, the lookahead state is processed individually and if a final state is encountered the current position is reset and True-state is considered as a current state. Otherwise False-state is chosen.

Clone this wiki locally