Homogeneous Abstract Syntax Tree

Visitor Interface Generation

First, we need to have a grammar, such as Math.g4, where we have a lexer tokens and parser rules defined. It may be necessary to use two types of labels for certain rules; rule labels (=) and alternative labels (#). Let's take a look at a rule from Math.g4 as an example:

expr
    :   lparen=LPAREN expr rparen=RPAREN                         # parensExpr
    |   op=( OP_ADD | OP_SUB ) expr                              # unaryExpr
    |   left=expr op=( OP_MUL | OP_DIV ) right=expr              # infixExpr
    |   left=expr op=( OP_ADD | OP_SUB ) right=expr              # infixExpr
    |   func=ID lparen=LPAREN expr (COMMA expr)* rparen=RPAREN   # funcExpr
    |   value=NUM                                                # numberExpr
    |   constant                                                 # constExpr
    |   expr fact=OP_FACT                                        # factorialExpr
    ;

Note the direct left recursion that the expr uses, this keeps this rule compact and tidy.

Each alternative of expr has a # label, such as # numberExpr; this tells ANTLR to generate a separate visit method for this alternative. Notice the rule label for NUM in numberExpr: value=NUM. Later, when we build our AST, this will allow us to easily access this token via the rule context.

Next, we need to tell ANTLR that we want to have it generate a visitor in pom.xml:

<plugin>
    <groupId>org.antlr</groupId>
    <artifactId>antlr4-maven-plugin</artifactId>
    <version>4.9.1</version>
    <configuration>
        <visitor>true</visitor>
        <listener>false</listener>
        <inputEncoding>UTF-8</inputEncoding>
        <sourceDirectory>
            src/main/resources/parsevamath/tools/grammar
        </sourceDirectory>
    </configuration>
    <executions>
        <execution>
            <goals>
                <goal>antlr4</goal>
            </goals>
            <phase>generate-resources</phase>
        </execution>
    </executions>
</plugin>

Then, we generate the visitor and parser:

 mvn antlr4:antlr4

In the case of parseva-math, ANTLR generates two visitors, found in target/generated-sources:

├── assets
├── config
├── docs
├── LICENSE
├── pom.xml
├── README.md
├── src
└── target
    ├── checkstyle-cachefile
    ├── checkstyle-checker.xml
    ├── checkstyle-result.xml
    ├── classes
    ├── generated-sources
    │   ├── annotations
    │   └── antlr4
    │       ├── MathBaseVisitor.java
    │       ├── ...
    │       └── MathVisitor.java
    ├── maven-status
    ...

We are really only interested in MathBaseVisitor.java. This java file generated by ANTLR is comprised of one visit method per parser rule or alternative rule label, and we can override these methods to build out our homogeneous AST.

An example of a method from MathBaseVisitor:

@Override
public T visitNumberExpr(MathParser.NumberExprContext ctx) { return visitChildren(ctx); }

As you can see, unless this method is overridden, it will simply return the result of visiting the children of this rule/ label. This will work to our advantage; we will not need to override ALL of these methods, since we usually do not want to create nodes at most non-terminal rules, and our visitor can simply pass through them.

Homogeneous Node

A homogeneous abstract syntax tree is a syntax tree that consists of only one node type; we assign each node with a type, and other information such as value, text, etc. upon instantiation. We can also assign children nodes and set the parent node as we visit each rule context.

In parseva-math, our homogeneous node is called MathAstNode.

Building the Abstract Syntax Tree

As our visitor passes through each rule context expression, we build MathAstNodes as we go. Here is an example of a terminal node construction:

    @Override
    public MathAstNode visitNumberExpr(MathParser.NumberExprContext ctx) {
        final MathAstNode astNode = new MathAstNodeImpl();
        astNode.setText(ctx.getText());
        astNode.setTokenType(TokenTypes.NUM);
        return astNode;
    }

Here is an example of a non-terminal node construction:

    @Override
    public MathAstNode visitInfixExpr(MathParser.InfixExprContext ctx) {
        final Token token = ctx.op;
        final MathAstNode astNode = new MathAstNodeImpl();
        astNode.setText(ctx.op.getText());
        final int tokenType = switch (token.getType()) {
            case MathLexer.OP_ADD -> TokenTypes.OP_ADD;
            case MathLexer.OP_SUB -> TokenTypes.OP_SUB;
            case MathLexer.OP_MUL -> TokenTypes.OP_MUL;
            case MathLexer.OP_DIV -> TokenTypes.OP_DIV;
            default -> throw new IllegalStateException(UNEXPECTED_TOKEN
                + TokenUtil.getTokenName(token.getType()));
        };
        astNode.setTokenType(tokenType);
        astNode.addChild(visit(ctx.left));
        astNode.addChild(visit(ctx.right));

        astNode.getChildren()
            .forEach(child -> child.setParent(astNode));
        return astNode;
    }

Notice that we can access parts of our alternatives through our rule labels, such as ctx.left and ctx.op for the left operand and operator(respectively) of our infix expression above. Also note that we simply call visit() on any child rule contexts of interest in non-terminal nodes.

Abstract Syntax Tree Example

The following expression:

2 + 2

Will create a homogeneous abstract syntax tree like this:

    '- OP_ADD -> +
       |- NUM -> 2
       '- NUM -> 2

We can see that this abstract syntax tree has three nodes:

An OP_ADD node, representing the addition operation token, a non-terminal node.
A NUM node, which is the left child of the OP_ADD token, a terminal node.
Another NUM node, which is the right child of the OP_ADD token, also a terminal node.

Internally, all nodes are of type MathAstNode, but all nodes have details, as listed above. The AST can be traversed in any direction, since we set the parents and children upon the instantiation of each node.

We can also build much more complicated trees.

For instance:

2 + sin(1.24) + sqrt(5) + pow(4,4) * 22 + sqrt(9)

Becomes:

         '- OP_ADD -> +
               |- OP_ADD -> +
               |  |- OP_ADD -> +
               |  |  |- OP_ADD -> +
               |  |  |  |- NUM -> 2
               |  |  |  '- FUNCTION -> sin
               |  |  |     |- LPAREN -> (
               |  |  |     |- NUM -> 1.24
               |  |  |     '- RPAREN -> )
               |  |  '- FUNCTION -> sqrt
               |  |     |- LPAREN -> (
               |  |     |- NUM -> 5
               |  |     '- RPAREN -> )
               |  '- OP_MUL -> *
               |     |- FUNCTION -> pow
               |     |  |- LPAREN -> (
               |     |  |- NUM -> 4
               |     |  |- COMMA -> ,
               |     |  |- NUM -> 4
               |     |  '- RPAREN -> )
               |     '- NUM -> 22
               '- FUNCTION -> sqrt
                  |- LPAREN -> (
                  |- NUM -> 9
                  '- RPAREN -> )

Putting It All Together

Since we have written our grammar and homogeneous node, generated our visitor, and overridden any visit methods that we need to, we can now take a look at how to actually build the AST within Main.java:

    public static MathAstNode buildMathAstNodeTree(String exprInput) {
        // Get expression input, create character stream
        final CharStream codePointCharStream = CharStreams.fromString(exprInput);
        
        // Tokenize character stream
        final MathLexer lexer = new MathLexer(codePointCharStream);
        final CommonTokenStream tokenStream = new CommonTokenStream(lexer);
        
        // Parse token stream to create rule contexts
        final MathParser parser = new MathParser(tokenStream);
        final MathParser.CompilationUnitContext compilationUnit = parser.compilationUnit();
        
        // Call visit on rule contexts and build our AST
        return new HomogeneousAstVisitor().visit(compilationUnit);
    }

Thoughts

In pursuit of the best way to build a homogeneous AST with ANTLR4, I tried all of the following approaches:

First build a heterogeneous AST, then transform it into a homogeneous AST. If there was a need for the heterogeneous tree too, I see no issue with this approach, but for parseva-math, then just seemed like an extra step.
Use actions, @init and @after blocks to keep visit..context() overridden methods to a minimum. This made for a very messy grammar.
The listener pattern; this was great if you do not need to set parent nodes, etc. However, for our homogeneous tree, we need to be able to traverse in any direction and track parent nodes.

I have found the above procedure to be the cleanest and most easily maintained method for creating a homogeneous AST with ANTLR4. It is good to keep a context and action-free grammar, and handle all java code separately. The ability to override the visit...context() methods as necessary to maintain a simple grammar helps keep changes atomic, and allows us to build the AST with no limitation. We can easily rearrange elements, create imaginary tokens, and traverse the parsed rule contexts with our visitor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HOMOGENEOUS_AST.md

HOMOGENEOUS_AST.md

Homogeneous Abstract Syntax Tree

Visitor Interface Generation

Homogeneous Node

Building the Abstract Syntax Tree

Abstract Syntax Tree Example

Putting It All Together

Thoughts

Files

HOMOGENEOUS_AST.md

Latest commit

History

HOMOGENEOUS_AST.md

File metadata and controls

Homogeneous Abstract Syntax Tree

Visitor Interface Generation

Homogeneous Node

Building the Abstract Syntax Tree

Abstract Syntax Tree Example

Putting It All Together

Thoughts