First, we need to have a grammar, such as Math.g4
,
where we have a lexer tokens and parser rules defined. It may be necessary to use two types of labels
for certain rules; rule labels (=
) and alternative labels (#
). Let's
take a look at a rule from Math.g4
as an example:
expr
: lparen=LPAREN expr rparen=RPAREN # parensExpr
| op=( OP_ADD | OP_SUB ) expr # unaryExpr
| left=expr op=( OP_MUL | OP_DIV ) right=expr # infixExpr
| left=expr op=( OP_ADD | OP_SUB ) right=expr # infixExpr
| func=ID lparen=LPAREN expr (COMMA expr)* rparen=RPAREN # funcExpr
| value=NUM # numberExpr
| constant # constExpr
| expr fact=OP_FACT # factorialExpr
;
Note the direct left recursion that the expr
uses, this keeps this rule compact and tidy.
Each alternative of expr
has a #
label, such as # numberExpr
; this tells ANTLR to generate a
separate visit method for this alternative. Notice the rule label for NUM
in numberExpr
:
value=NUM
. Later, when we build our AST, this will allow us to easily access this token via the
rule context.
Next, we need to tell ANTLR that we want to have it generate a visitor in pom.xml
:
<plugin>
<groupId>org.antlr</groupId>
<artifactId>antlr4-maven-plugin</artifactId>
<version>4.9.1</version>
<configuration>
<visitor>true</visitor>
<listener>false</listener>
<inputEncoding>UTF-8</inputEncoding>
<sourceDirectory>
src/main/resources/parsevamath/tools/grammar
</sourceDirectory>
</configuration>
<executions>
<execution>
<goals>
<goal>antlr4</goal>
</goals>
<phase>generate-resources</phase>
</execution>
</executions>
</plugin>
Then, we generate the visitor and parser:
mvn antlr4:antlr4
In the case of parseva-math, ANTLR generates two visitors, found in
target/generated-sources
:
├── assets
├── config
├── docs
├── LICENSE
├── pom.xml
├── README.md
├── src
└── target
├── checkstyle-cachefile
├── checkstyle-checker.xml
├── checkstyle-result.xml
├── classes
├── generated-sources
│ ├── annotations
│ └── antlr4
│ ├── MathBaseVisitor.java
│ ├── ...
│ └── MathVisitor.java
├── maven-status
...
We are really only interested in MathBaseVisitor.java
. This java file generated by
ANTLR is comprised of one visit method per parser rule or alternative rule label, and we can override
these methods to build out our homogeneous AST.
An example of a method from MathBaseVisitor:
@Override
public T visitNumberExpr(MathParser.NumberExprContext ctx) { return visitChildren(ctx); }
As you can see, unless this method is overridden, it will simply return the result of visiting the children of this rule/ label. This will work to our advantage; we will not need to override ALL of these methods, since we usually do not want to create nodes at most non-terminal rules, and our visitor can simply pass through them.
A homogeneous abstract syntax tree is a syntax tree that consists of only one node type; we assign each node with a type, and other information such as value, text, etc. upon instantiation. We can also assign children nodes and set the parent node as we visit each rule context.
In parseva-math, our homogeneous node is called MathAstNode.
As our visitor passes through each rule context expression, we build MathAstNodes as we go. Here is an example of a terminal node construction:
@Override
public MathAstNode visitNumberExpr(MathParser.NumberExprContext ctx) {
final MathAstNode astNode = new MathAstNodeImpl();
astNode.setText(ctx.getText());
astNode.setTokenType(TokenTypes.NUM);
return astNode;
}
Here is an example of a non-terminal node construction:
@Override
public MathAstNode visitInfixExpr(MathParser.InfixExprContext ctx) {
final Token token = ctx.op;
final MathAstNode astNode = new MathAstNodeImpl();
astNode.setText(ctx.op.getText());
final int tokenType = switch (token.getType()) {
case MathLexer.OP_ADD -> TokenTypes.OP_ADD;
case MathLexer.OP_SUB -> TokenTypes.OP_SUB;
case MathLexer.OP_MUL -> TokenTypes.OP_MUL;
case MathLexer.OP_DIV -> TokenTypes.OP_DIV;
default -> throw new IllegalStateException(UNEXPECTED_TOKEN
+ TokenUtil.getTokenName(token.getType()));
};
astNode.setTokenType(tokenType);
astNode.addChild(visit(ctx.left));
astNode.addChild(visit(ctx.right));
astNode.getChildren()
.forEach(child -> child.setParent(astNode));
return astNode;
}
Notice that we can access parts of our alternatives through our rule labels, such
as ctx.left
and ctx.op
for the left operand and operator(respectively) of our infix
expression above. Also note that we simply call visit()
on any child rule contexts
of interest in non-terminal nodes.
The following expression:
2 + 2
Will create a homogeneous abstract syntax tree like this:
'- OP_ADD -> +
|- NUM -> 2
'- NUM -> 2
We can see that this abstract syntax tree has three nodes:
- An
OP_ADD
node, representing the addition operation token, a non-terminal node. - A
NUM
node, which is the left child of theOP_ADD
token, a terminal node. - Another
NUM
node, which is the right child of theOP_ADD
token, also a terminal node.
Internally, all nodes are of type MathAstNode, but all nodes have details, as listed above. The AST can be traversed in any direction, since we set the parents and children upon the instantiation of each node.
We can also build much more complicated trees.
For instance:
2 + sin(1.24) + sqrt(5) + pow(4,4) * 22 + sqrt(9)
Becomes:
'- OP_ADD -> +
|- OP_ADD -> +
| |- OP_ADD -> +
| | |- OP_ADD -> +
| | | |- NUM -> 2
| | | '- FUNCTION -> sin
| | | |- LPAREN -> (
| | | |- NUM -> 1.24
| | | '- RPAREN -> )
| | '- FUNCTION -> sqrt
| | |- LPAREN -> (
| | |- NUM -> 5
| | '- RPAREN -> )
| '- OP_MUL -> *
| |- FUNCTION -> pow
| | |- LPAREN -> (
| | |- NUM -> 4
| | |- COMMA -> ,
| | |- NUM -> 4
| | '- RPAREN -> )
| '- NUM -> 22
'- FUNCTION -> sqrt
|- LPAREN -> (
|- NUM -> 9
'- RPAREN -> )
Since we have written our grammar and homogeneous node, generated our visitor, and overridden any visit methods that we need to, we can now take a look at how to actually build the AST within Main.java:
public static MathAstNode buildMathAstNodeTree(String exprInput) {
// Get expression input, create character stream
final CharStream codePointCharStream = CharStreams.fromString(exprInput);
// Tokenize character stream
final MathLexer lexer = new MathLexer(codePointCharStream);
final CommonTokenStream tokenStream = new CommonTokenStream(lexer);
// Parse token stream to create rule contexts
final MathParser parser = new MathParser(tokenStream);
final MathParser.CompilationUnitContext compilationUnit = parser.compilationUnit();
// Call visit on rule contexts and build our AST
return new HomogeneousAstVisitor().visit(compilationUnit);
}
In pursuit of the best way to build a homogeneous AST with ANTLR4, I tried all of the following approaches:
- First build a heterogeneous AST, then transform it into a homogeneous AST. If there was a need for the heterogeneous tree too, I see no issue with this approach, but for parseva-math, then just seemed like an extra step.
- Use actions,
@init
and@after
blocks to keepvisit..context()
overridden methods to a minimum. This made for a very messy grammar. - The listener pattern; this was great if you do not need to set parent nodes, etc. However, for our homogeneous tree, we need to be able to traverse in any direction and track parent nodes.
I have found the above procedure to be the cleanest and most easily maintained method for creating a homogeneous AST
with ANTLR4. It is good to keep a context and action-free grammar, and handle all java code separately. The ability
to override the visit...context()
methods as necessary to maintain a simple grammar helps keep changes atomic, and
allows us to build the AST with no limitation. We can easily rearrange elements, create imaginary tokens, and traverse
the parsed rule contexts with our visitor.