Skip to content
Nicolas "Norswap" Laurent edited this page Nov 17, 2015 · 2 revisions

Every parsing expression or rule name in an Autumn grammar can be followed by a capture suffix. This suffix indicates that the expression needs to be treated specially in order to build the parse tree.

Parse Tree Construction

Autumn does not automatically generate a parse tree. You have to explicitly select which parsing expressions are going to create a node in the parse tree. This allows you to structure your grammar as you wish without having to worry about what the parse tree will look like. Removing uneeded parse tree nodes also improves efficiency and simplifies parse tree visitors.

The parser maintains a current parse tree node. Each created node is added as a child of the current node, then becomes the current node itself. Whenever the parsing expression associated with the node finishes parsing, the old node is restored as current.

Of course, if an expression fails, all the nodes created underneath it are discarded. The parser initializes the parse tree with a root node.

Much like the current node, the parser also maintain a "current name", which is the name that will be assigned to new nodes. Whenever a new node becomes current, the current name is discarded, and restored whenever the parsing expression associated to the node finishes parsing.

Similarly, the parser maintains a current set of textual tags to apply to new nodes. This set is discarded/restored at the same time as the current name.

Capture Suffix Components

Capture suffix are made by combining 4 possible components.

Colon (capture)

A colon indicates that a node for the parsing expression must be created in the parse tree. If the colon is immediately followed by a + sign, then the source text matching the expression will be copied in the node.

This example creates a node in the parse tree:

S = A X: B ;

This example creates a node capturing the source text matching X:

S = A X:+ B ; // idem + captures the source text matching X

Only a single colon component can appear in the capture suffix. If it appears, it must appear first.

Dash (name)

A dash (-) must be followed by an identifier. If combined with a colon component, it gives the created node a name. If appearing without a colon component, it sets the default name for nodes created within the scope of the annotated parsing expression, if those are not given an explicit name.

This example creates a node named "foo":

S = A X :- foo B ;

This example creates two nodes named "foo":

S = A X - foo B ;
X = C Y: Z: D ;

Only a single dash component can appear in the capture suffix.

Tilda (tag)

A tilda (~) must be followed by an identifier. If combined with a colon component, it adds a tag to the created node. If appearing without a colon components, it specifies that all nodes created within the scope of the annotated parsing expression will receive that tag.

A node can only have a single name, but it can have multiple tags.

This example creates a node with tag "bar":

S = A X :~ bar B ;

This examples creates a node with tags "bar" and "y" and a node with tags "bar" and "z":

S = A X ~ bar B ;
X = C Y :~ y Z :~ z D ;

Multiple tilda components may appear in the capture suffix, meaning that multiple tags will be added.

Hash (group)

A hash (#) must be followed by an identifier. If combined with a colon component, it indicates that the created node must be added to a logical group whose name is the identifier. If appearing without a colon component, it indicates that all nodes appearing within the scope of the annotated parsing expression will be part of such a group.

Currently, the way to make a node part of a group is to set its name to that of the group. This means that the hash and dash components are functionally identical. However, this is subject to change, so you shouldn't rely on it.

Hash and dash express a different intent. The dash should be used to name things that are unique (e.g. the body of a while loop), while the hash should be used to name things that are going to be repeated (e.g. a parameter in a function call).

The following example defines the syntax of a while loop. We use the dash to name the condition, which is unique, and a hash to group the statements, of which there can be many.

While = "while" Cond - cond "{" (Statement # statements)* "}"

Note that disjoint hashes might still be combined, for instance:

X = (Y :# ys Z | Y :# ys | Z)* ;

In this example, there will be a single group ys containing all the Y nodes despite the fact that two hashes appear, and that they appear within a repetition.

Just like dash components, there can only be a single hash component. Hash and dash components cannot be combined.

Capture Suffix After Rule Names

In addition to appearing after any parsing expression on the right hand side of a rule definition, capture suffixes can also annotate rule definition, by appearing after rule names on the left hand side:

X :~ foo = A B C ;

This is equivalent to applying the suffix to the whole expression:

X = (A B C) :~ foo ;

Dollar Notation

In components that require an identifier, you can use a dollar symbol ($) instead. The dollar can only appear in suffixes annotating a rule reference or rule definition.

Example of dollar in a suffix annotating a rule reference:

Tag = "<" Identifier:-$ ">"

This is equivalent to:

Tag = "<" Identifier :- Identifier ">"

Example of dollar in a suffix annotating a rule definition:

Addition ~$ = Expression "+" Expression ;

This is equivalent to:

Addition = (Expression "+" Expression) ~ Addition ;

Guidelines for Using Captures Effectively

The capture system is very flexible. However, because of the scoping mechanism, it's sometimes hard to keep track of which nodes can get which names and which tags. To help stay organized, here are a couple guidelines.

Names (introduced by dashes and hashes) are best thought of as accessors: handles used to retrieve a single sub-node (dash) or a group of them (hashes). As such, dashes and hashes should usually appear on the right hand side of rules: it should be up to the containing node to choose the accessors.

Tags, on the other hand, can be thought of as indicating the nature of a node. The fact that a node can have multiple tags can be used to simulate a class hierarchy.

Here is an example that applies these advices. It's an idiomatic way to write a very simple left-associative expression language in Autumn:

Expression   ~$ = Binary | FunCall | Literal;
Binary       ~$ = Addition | Subtraction ;
Addition    :~$ = Expression - left "+" Expression - right;
Subtraction :~$ = Expression - left "-" Expression - right ;
FunCall     :~$ = Identifier "(" (Expression # parameters) , "," ")" ;
Literal         = ... ;

For those not familiar with Autumn, X , "," means "a list of X separated by commas".

A few more comments about the example.

First, you can see that we put captures (the colons) on some rule definitions. We could have achieved the same effect by putting a capture on the Expression rule definition, but it's generally a good idea to put the captures as low in the hierarchy as possible. It ensures no unwanted captures will be introduced or desired captures lost as the grammar is being refactored.

If you think of captures + names as a sort of data structure (think C struct or Java class), this way of doing is also more intuitive.

Putting the captures on rule definitions is generally what you want to do. It makes it easier to find them and see what's going on. It also alleviates the need of adding a colon component each time you use the rule.

The names of identifiers after a hash are pluralized by convention, since they represent a list.

You should strive to use the dollar notation as much as possible. But of course sometimes it isn't specific enough. This is especially true for dash components where the name is often related to the function of the node within the rule rather than to its nature.

Of course, these are guideline, not rules set in stone, and so there are exceptions!