This is a working demonstration project / proof of concept that shows how you can create your own scripting language parser/interpreter using the Antlr4 tool and implement the evaluation of simple code fragments inside of document content when the webpage is rendered.
So as that you don't have to find out everything for yourself, I will in this readme explain parts of the what, the how and somethimes even the why. In addition I have written a blog entry that explains things in a more general manner so that you can see how the parts work together. See here:
https://www.sentiatechblog.com/embedded-script-in-bloomreach-content
As usual with Bloomreach projects you can get it up and running with the cargo run:
mvn clean package
mvn -P cargo.run
Point your browser to http://localhost:8080/site and /cms for the web application (password is admin/admin).
Directly in the homepage is a document that uses embedded scripting.
Here are some resources regarding antlr4
https://tomassetti.me/antlr-mega-tutorial/
https://github.com/antlr/grammars-v4
https://github.com/antlr/antlr4/tree/4.6/doc
https://alexecollins.com/antlr4-and-maven-tutorial/
The actual definition of a language syntax usually comes in two parts:
- the lexer rules define which combinations of characters when taken together as a unit have some meaning in the language (usually such a construct is called a token). In this way you define keywords, constant values, variable names, operators and so on.
- the grammar rules define what combinations of tokens are valid. In this way you define for instance what a statement is, what a block of statements is, how parenthesis are used, in what order operators, constants and variable names may exist and so on.
Note that neither lexical nor grammatical rules say what a particular combination of characters is supposed to mean, they just tell what constitutes a lexically and syntactically valid sequence of characters. To have the language actually do something, the creator of the language must add meaning and behaviour to the various combinations of tokens that are valid according to the grammar rules.
To give an example, lexical rules may define that 'a' is an identifer, '2' is a constant and '+' is the plus operator. A grammar rule maye determine that "a+2" satisfies the syntax rule for expression. When this fragment of code is evaluated, at some point a function (likely called expression) is called with arguments values for the plus operator, the identifier 'a' and the constant '2'.
It is you, the creator of the language, who has to intervene in the execution of the expression() callback function. You must extract the value of 'a' from somewhere, convert the constant '2' to another value, add these values together because of the plus operator and then somehow return the resulting value.
With antlr4, you write the lexer and grammar rules into two files in a format that the tool can understand, run the antlr4 generator, and this will output a set of Java source files that together with the antlr4 runtime library form a parser for the language together with supporting classes that let you do something with the output of the parser.
When the parser parses some input code the result is a parse tree. When you subsequently walk this tree with the BaseListener or BaseVisitor that were also generated by antlr4, the various methods in these base classes are called for the various nodes in the tree. The information at what point in the tree we find ourselves is contained in a contex object.
Your job as a creator of the language, apart from writing the lexer and grammar for the generating process, is to extend one of the generated base classes and fill in what should happen for various nodes in the parse three. You extract information from the context provided by the tree walker, and together with the knowledge that you are in a certain node type (because of the callback method you are overriding) you have all you need to implement a meaningful behaviour for the rule this node represents such as finding values for 'a' and '2' (from the context) and adding these (you are in the 'plus' callback method).
You can find the source code for the lexer here:
calc/src/main/antlr4/nl/dimario/numbercalc/NumberLexer.g4
The syntax used in defining lexer rules in Antlr4 has a certain resemblance to the syntax used in defining regular expressions. A tilde ~ means negation, square brackets denote sets (you can also use ranges), pipeline means 'or', dot, star, question mark and plus mean the same as in a regular expression and so on. For a more complete description please see https://github.com/antlr/antlr4/blob/4.6/doc/lexer-rules.md
Since in this case we are using an island grammar, the lexer definition contains two differents set of lexical rules for the two different modes.
The first mode (sea) is implicitly the default mode. The calculation mode is activated when the default parser mode encounters a boundary marker token. The boundary marker token is named NUMBERCALCOPEN. The mode switch is engaged by ways of the pushMode(NUMBERCALC)
you see in the line defining the marker.
The other mode is named NUMBERCALC and it also defines a boundary marker, named NUMBERCALCCLOSE. When that marker is encountered while in NUMBERCALC mode, the popMode
switches lexical analys back to the default mode.
The lexical structure of the default mode is very uncomplicated: all characters encounterd in that mode ar added to one single token named STATICTEXT.
The definition of the STATTICTEXT token effectively says "any sequence of characters up to an ${* marker should be considered STATICTEXT (in other words, not embedded script).
The actual definition of the STATICTEXT token is a bit more complicated than you would expect because we want to allow for a $ that is not followed by a { to be interpeted as part of a STATICTEXT and likewise for a ${ that is not followed by a * token. Only when all three characters are encountered consecutively should this count as a marker for the start of an embedded expression.
So there are just two tokens in the default mode: the NUMBERCALCOPEN marker token for the mode switch, and the STATICTEXT token for everything else.
As for lexical analysis in the NUMBERCALC mode, a bit more is happening there. Any single character that has a meaning in our syntax is given a name which will be used when defining the grammar rules. The DIGIT and ALFA fragments are used in the definition of CONSTANT and IDENTIFIER. Fragments are in themselves not a whole token but they aid in defining them. The concept of fragments is specific for the antlr4 tool.
The lexical rule for CONSTANT says: a constant is a sequence of at least one DIGIT, possibly folllowed by a DOT and another sequence of at least one DIGIT. So 2 is a valid constant and so is 2.0, but .9 is not and neither is 200,000.00. Nor do we have exponential notation 1.334e-15, and we also do not use an 0x prefix for hexadecimal nor an L suffix for long int. We could if we wanted to but the aim here is to keep things simple.
Note that constant values can be integers or floating point, as can other values that we use for calculations. So what happens when we mix the two in for instance 0.9 / 2 ? This is another decision that the creator of the language will have to make. In our case, I decided to convert everything to floating point before performing the arithmatics unless both operands are integer in which case both are converted to Long.
The rule for IDENTIFIER allows for concatenating names with dots in between, where a name is allowed to begin with a letter or an underscore, but not with a digit. This makes sense because otherwise we would have a hard time knowing if something is a constant or an identifier. Note that nothing is said about identifiers being case sensitive or not. This is in fact a detail that will be settled when we will be using identifiers to obtain a value.
Thus person.name is a valid identifier, and so is x3._internal, but not 2.fast.2.furious. A borderline case is _4 which our lexer rules allow for but in any real language definition should probably be illegal.
Finally the WS token definition has a skip directive that says "if you encounter any space, tab, carriage return or newline character do not pass this on to the parser". WS stands for "whitespace".
You can find the source code for the grammar here:
calc/src/main/antlr4/nl/dimario/numbercalc/NumberParser.g4
Again, you can find a full description of the grammar rule format here: https://github.com/antlr/antlr4/blob/4.6/doc/grammars.md
The options
line tells antlr4 that this grammar expects tokens as defined by the lexer named "NumberLexer". Then the grammar starts defining syntax rules. Rules are defined top down with higher level rules building on lower level rules. At the lowest level we have things that cannot be described in terms of other things and should therefor be able to convey some meaning being constructed in and of themselves.
In our case, the grammar defines that the thing that lives at the uppermost level in our language is called a "document" and it defines a "document" by describing the rule that determines how you can that tell if the text that you feed to the parser is indeed a valid "document". In other language definitions the thing that lives at the topmost level can have a different name, e.g. "program", "file", "molecule" and so on. See https://github.com/antlr/grammars-v4 for a large number of grammar file examples (the *.g4 files are the actual grammar definitions).
Our document rule says that a valid "document" is made up of a sequence of lower level syntax constructs that are either a "text" or a "calculation". In addition, the sequence is allowed to be totally empty.
The rule for "text" states that this construct is simply a sequence of exactly one STATICTEXT token all by itself. The 'statictext' following the hash sign '#' is not a comment but tells antlr4 how we would like the method for this rule to be named in the generated Java code. When implementing what our scripting language should be doing, we will override the generated statictext
method, extract the string that is the actual STATICTEXT, and do something useful with it (such as copying unchanged to the output).
A sequence of tokens that satisfy the "calculation" rule consists of an "expression" between an opening and closing token, and we want to name the handler method for this construct "result".
The definition for the "expression" rule says that an expression can be one of five different sequences of tokens. The first two of these rules only use lexical tokens in their definition and do not rely on lower level grammatical constructs. When you implement the code for dealing with these two grammar rules, you will find that all subdivision ends here, and there is a clear way on how to translate the CONSTANT or IDENTIFIER to a value that can be used in higher level calculations.
In contrast, the other three rules for "expression" define an "expression" resursively in terms of "expression" !. This means that when writing code to deal with these rules, you have to obtain a (sub-)expression from the context, extract a value from it by invoking the method that deals with an "expression" and then process the resulting value accoding to which one of the "expression" rules you are currently evaluating at this level of the recursion.
This, by introducing recursion in the definition of the rules for "expression" we also introduce recursion in the Java code that handles the processing of these rules. In this way you can evaluate for instance ((2+3) * 4)
by allowing 2+3
to be an expression, while (2+3)
is another expression which in turn is part of still another expression of which *
and 4
are the other parts. All three expressions are processed at different recursion depths by the same set of Java methods that handles the "expression" rules.
The actual generation process is performed by running
java -cp /opt/antlr-4.7.2/antlr-4.7.2-complete.jar org.antlr.v4.Tool
first on the lexer definition. This will create amongst other things a *.tokens file which is needed when running the same command on the grammar. Note that your mileage may vary with respect to the precise location and version of the antlr4 jar.
However, generating the Java code for the parser by hand is usually not done unless you want to study what happens when you tweak the language definitions. Instead, we use the antlr4-maven-plugin
and run the generation as part of the build. For details, see the pom.xml of the calc project. It is set up to generate base classes for both the visitor and the listener manifestation of the parse tree walker. In the project only the visitor is used, but an example using the listener is also included.
Because of the very specific location where the *.g4 files are located, the generated Java code will be placed in a package named nl.dimario.numbercalc
The classnames used are derived from the names given on the first lines of both files. When the build process generates the source files it places them under target\generated-sources\antlr4\
. This is a default configuration location for Maven, so it includes this source directory automatically when executing the compile fase of the build.
As a beside, when you first check out the project and import it into your IDE, you'll get a lot of Java syntax errors because the base classes that other code extends are not yet present. The errors should disappear once you have run a Maven build. The build creates the missing base classes and your IDE is probably smart enough to figure out that it should look in the target/generated-sources
directory for additional source code. If not, you must tweak your IDE project settings to include the generated sources.
For this part of the technical explanation, please direct your attention to
calc/src/main/java/nl/dimario/numbercalc/NumberRenderer.java
This is where the interesting stuff happens. NumberRenderer has a render()
method that takes a parsed tree as its input and then dives into the visit
method of the generated base class. The visit method starts walking the parse three and calling relevant methods when it encounters the various types of grammatical syntax structures that were transformed into a parse tree when the input source code was parsed.
For instance, when it encounters a node that represents a statictext
grammatical rule, it will call the visitStatictext()
method and pass along a context object holding information about the particulars of this statictext occurence.
Our NumberRenderer overrides the method, extracts the string that was interpreted as a STATICTEXT token, and adds it to the output buffer. The agreement about static text is that we should pass it along unchanged from input to output and that is exactly what takes place here.
To ensure that walking the token tree continues we then let the base class take over and do its thing.
When the walk along the parse tree sees a node for a "constant" syntax rule, it calls visitConstant()
again with a context object that has information about this particular constant. In this case, we cannot simply add the constant to the output buffer because the constant is part of an expression and the value that it has must be used in the evaluation of the expression. So we get the string that was interpreted as a CONSTANT token, transform it to a data type that is usable in calculations and return the value. In this case, we don't call the base class to continue walking the nodes under the CONSTANT, because we know that this type of node is a leaf node. Its rule in the grammar does not rely on lower level syntax constructs, but only on lexical tokens to which we can assign a value directly.
Similar, when an identifier
occurs in the tree, our override method extracts the string that is the identifier from the context and then uses it to look up a value in a Map. Where does this Map come from and how does it get its values? I will explain this later on.
When the treewalk sees an expression of the form BRACEOPEN whatever BRACECLOSE it calls the visitBraces()
method, as we instructed it to do in the grammar definition. Here, we know that whatever the braces surround must be an expression
so we extract it from the context and then walk the expression by calling the topmost entrypoint for walking the syntax tree, visit()
. Here is where recursion can take place. Suppose the expression that we extracted from the context, which in itself is a subtree of the parsed input, contains another node for a subexpression surrounded by braces? Well, then the base class walking the subtree wil again call visitBraces()
with a context for the subexpression. In short, visitBraces()
calls visit()
which in turn may call visitBaces()
which then again calls visit()
. There is your recursion!
The rule for multiplying or dividing calls the visitMultdiv()
method, and the rule for addition and subtraction calls the visitAddsub()
method. Here we must perform some gymnastics because the values used in these basic operations can be either integer or floating point and we must take care to convert the operands to either floating point or integer values before using them. In order not to muddle the issue I have removed the gory details dealing with implicit conversions to a utility class. The parser method merely extracts the values that the operation is performed on together with the information whether we need to multiply or divide in Multdiv() or whether me must add or subtract in Addsub().
Note that both visitMultdiv()
and visitAddsub()
make a call to visit()
not once but twice. This is because both "things" surrounding the operand are expressions that have a value and that value is calculated by calling visit()
for the expression in case.
I left the highest level syntax rule for the last: the result. The result rule is detected for the whole sequence of tokens that we encounter between a NUMBERCALCOPEN and a NUMBERCALCCLOSE token. This sequence of tokens is assumed to be an expression and thus after invoking the rule for 'expression' by calling visist()
for that expression we are left with a value, which is the result of all arithmatic that went on inside of the open and close markers.
What to do with this value? Actually, the whole purpose of this excercise was to replace an expression in the input with its calculated value in the output so this is exactly what we do: the numerical value is converted to String and appended to the output buffer. In the process, the two marker tokens are discarded (or better, they are ignored as we don't do anything with them) and thus do not show up in the output.
You were left wondering how to obtain the values that we must fill in whenever we encounter the name of a variable, a.k.a. identifier. The mechanism in itself is pretty straight forward: we assume that we have a Map<String,Number>
that we use to lookup a value for any variable name we encounter. This does not explain how the values get in the map, nor how we know what names to use or what they mean.
You probably won't like the answer, but the long and short of it is: it depends.
The most obvious place to get data from would be an external service. The service could be a REST like server, or perhaps an interface to a database. Or, god forbid but shit happens, some unfortunate creature that reads Excel spreadsheets for a living. Or all of the above.
The general idea here is that you somehow obtain data from whatever source is relevant for your project, and then pass it on to the NumberRenderer in the form of a Map filled with Numbers. Obviously whoever is responsible for adding embedded expressions to the content must know what kind of data may appear in the Map as a result of the various extraneous shenanigans, and more precisely he or she should know what names to use for the variables and what the values for these names mean when used together with the other content (natural language words and sentences) in the page.
There is a little bit of practical support I can offer: when the data is returned by the service as a JSON object, you could walk this object recursively and place any attribute value in the Map, perhaps using a path-like approach where you concatenate the names of the parent object to the child attribute with a dot in between. This is what happens in JsonDataSource
which I included as an example. Note that the example deals with arrays by encoding the item index in the variable name in a way that is currently not accounted for by the lexer rules. Again, I want to keep things simple so I provide a suggestion of how things could work but then I don't clutter up the beautiful clean and simple language definitions in order to make the suggestion actually work.
Another approach which is useful when you have to query a service specifically for some value or other by name: before rendering, scan the input and collect the names of all variables used. Then submit this list of names to your external service and digest the result into the aforementioned Map<String, Number>
for the renderer. In fact I have added NumberVariableScanner
to the package as an illustration of how to do this. Fun fact: this class implements the listener and not the visitor base class, so you get a example of that one for free!
What happens when an expression entered by a content creator does not comply with the syntax rules that you have concocted for your scripting language? By default, not much. The antlr tool just sort of tries to make the best of it and the result is that we may or may not see some output for the bad expression.
Clearly, we want to be alerted when something bad happens.
For this purpose I have added a CustomErrorListener
to both the lexer and the parser. This error listener transforms parse errors to a special class of exception, the ParseCancellationException
which causes the runtime library and / or the generated code to NOT try and soldier on in case of errors but instead throw the exception upwards so that we can be made aware of it in code that we manage ourselves.
How you want to deal with exceptions from the parser is entirely up to you. The goal of the CustomErrorListener is to make sure such exceptions reach your code.
The parsing, obtaining data from an external source and rendering output is all conveniently packaged in a high level class, the ScriptExpander
. This class also adds the custom error listeners for you.
So your Bloomreach web application needs to create a ScriptExpander, pass the rich text containing embedded script to it, and then receive the rendered output.
The best place for doing this would be inside of a ContentRewriter.
I have created a ScriptContentRewriter
that extends the default SimpleContentRewriter to demonstrate the use of scripting.
All it takes is three lines of code to get the ScriptExpander to do its thing.
Input is html, the content of the document to be evaluated. This is parsed and rendered and that's it. For good measure some data is thrown in from a resource file data.json
, just to show how you would connect external data to the script.
In order to get the Bloomreach web application to actually use your version of the ContentRewriter you have to configure it in /site/webapp/src/main/webapp/WEB-INF/hst-config.properties
as per instructions in the Bloomreach documentation.