Imagine finding yourself in a situation where you have a collection of
files containing source code. If this is an unreasonable prospect for
you, stop reading and pick another tutorial. If you are still with us,
consider these files the blueprints for your software. They are still
not the award-winning, executable program you are aiming for, but by
applying a compiler to them, you can generate one (with some minor
provisions about syntactial and semantical correctness). If so inclined,
you may also run a documentation generator like Javadoc on the source,
to generate structured documentation. Or, while colleagues are not
looking, you can secretly invoke tools like lint
(C/C++) or FindBugs
(Java) to weed out common programming errors.
The compilation, documentation generation and source-code analysis are all examples of software transformations, but they are certainly not the only ones. Software transformation has been used by the mathematically inclined for generating programs from high-level specifications, by forgetful people to recover lost design and architecture from legacy code and by reverse-engineers to obtain high-level, readable code from binary files after somebody accidentally misplaced a stack of backup tapes. Specialization of a program to known inputs in order to improve performance, optimization using domain knowledge from the application domain and improving program understanding by analysing sources are also favoured topics among software transformers.
But who uses software transformation, anyway? People with a problem resembling any in ? are. Compilation, the translation of a program to machine code in order to make it executable, is the standard processing technique applied to get running programs out of source code. But much more can be done. There are many other kinds of processes that can be applied to programs. For example, programs can be synthesized from high-level specifications; programs can be optimized using knowledge of the application domain; documentation for understanding a program can be automatically derived from its sources; programs can be specialized to known inputs; application programs can be generated from domain-specific languages; low-level programs can be reverse engineered into high-level programs.
All too often, Real Programmers facing such problems are of the opinion that software transformation is overly complicated dark magic, and that simple regular expression hacks solve the problem just fine. Almost equally often, their ad-hoc, text-based solutions turn out to be brittle, overly complicated and acquire a status of dark magic, with the result that no other team member dears touch the stuff. Most of the time, the problem would be easily solved in a maintainable and robust way if only the right tool could be found.
-
Compilers
-
-
Translation, e.g. Stratego into C
-
Desugaring, e.g. Java's
foreach
intofor
-
Instruction selection
-
-
Maximal munch
-
BURG-style dynamic programming
-
-
Optimization
-
-
Data-flow optimization
-
Vectorization
-
GHC-style simplification
-
Deforestation
-
Domain-specific optimization
-
Partial evaluation
-
-
Type checking
-
Specialization of dynamic typing
-
-
Program generators
-
-
Pretty-printer and signature generation from syntax definitions
-
Application generation, e.g. data format checkers from specifications
-
-
Program migration
-
- Grammar conversion, e.g. YACC to SDF
-
Program understanding
-
- Documentation generation, e.g. API documentation for Stratego
-
Document generation/transformation
-
- Web/XML programming (server-side scripts)
So what do should you do if you have a mountain of source code that you have to do some transformation on? Obviously, using the the right tool for the job is a good start. We don't recommend using toothpicks to move software mountains. Instead, we think using Stratego for this is a good idea. In this tutorial, we will use small, understandable and cute examples that you can try out in the leisure of your own desktop. We hope these will convince you exactly how good an idea using Stratego for software transformation really is.
Stratego/XT is a framework for implementing software transformation systems. A software transformation system is usually organized as a simple pipeline of transformation steps, see ?. At the source of the pipeline (far left), a parser reads the text of the input program and turns it into a parse tree or abstract syntax tree. Subsequently, one or several transformations components modify the tree. At the sink of the pipeline (far right), a pretty-printer turns the output tree into program text. The output program need not be in the same language as the input program. It need not even be a programming language. This allows us to create important tools such as compilers and documentation generators using Stratego/XT.
The Stratego/XT framework consists of two parts: Stratego, a language for implementing software transformations, and XT, a collection of transformation tools. The Stratego language is a powerful language for implementing the core transformations of a complete transformation system. The XT tools help with the implementation of the infrastructure required around these core transformations, such as a parser and a pretty-printer.
Stratego and XT aim at better productivity in the development of transformation systems through the use of a high-level representations of programs, domain-specific languages for the development of parts of a transformation system, and generating various aspects of a transformation system automatically.
ATerm Format
: Although some transformation systems work directly on text, in general a textual representation is not adequate for performing complex transformations. Therefore, a structured representation is used by most systems. Since programs are written as texts by programmers, parsers are needed to convert from text to structure and unparsers are needed to convert structure to text.
The basic assumptions in our approach are that programs can be
represented as trees, or terms, and that term rewrite rules are an
excellent way to formalize transformations on programs. Stratego/XT
uses the Annotated Term Format, or ATerms for short, as term
representation. The Stratego run-time system is based on the ATerm
Library which provides support for internal term representation as
well as their persistent representation in files. This makes it easy
to provide input and output for terms in Stratego, and to exchange
terms between transformation tools.
Stratego Language
: Stratego is the core of Stratego/XT. It is a language for software transformation based on the paradigm of rewriting strategies. Basic transformations are defined using conditional term rewrite rules. These are combined into full fledged transformations by means of strategies, which control the application of rules.
Term rewrite systems are formalisations of systematic modifications
of terms or trees. A rewrite rule describes how a program fragment
matching a certain pattern is transformed into another program
fragment. Term rewriting is the exhaustive application of a set of
rules to a term.
A complex software transformation is achieved through a number of
consecutive modifications of a program. At least at the level of
design it is useful to distinguish transformation rules from
transformation strategies. A rule defines a basic step in the
transformation of a program. A strategy is a plan for achieving a
complex transformation using a set of rules. rules InlineF : |[ let
f(xs) = e in e'[f(es)] ]| -\> |[ let f(xs) = e in e'[e[es/xs]] ]|
InlineV : |[ let x = e in e'[x] ]| -\> |[ let x = e in e'[e] ]| Dead
: |[ let x = e in e' ]| -\> |[ e' ]| where \<not(in)\> (x,e')
Extract(f,xs) : |[ e ]| -\> |[ let f(xs) = e in f(xs) ]| Hoist : |[
let x = e1 in let f(xs) = e2 in e3 ]| -\> |[ let f(xs) = e2 in let x
= e1 in e3 ]| where \<not(in)\> (x, \<free-vars\> e2) For example,
consider the transformation rules above. The `Inline*` rules define
inlining of function and variable definitions. The `Dead` rule
eliminates an unused variable definition. The `Extract` rule
abstracts an expression into a function. The `Hoist` rule defines
lifting a function definition out of a variable definition if the
variable is not used in the function. Using this set of rules,
different transformations can be achieved. For example, a constant
propagation strategy in an optimizer could use the `InlineV` and
`Dead` rules to eliminate constant variable definitions:
let x = 3 in x + y -> let x = 3 in 3 + y -> 3 + y
On the other hand, the `ExtractFunction` strategy in a refactoring
browser could use the `Extract` and `Hoist` rules to abstract
addition with `y` into a new function and lift it to top-level.
let x = 3 in x + y
-> let x = 3 in let addy(z) = z + y in addy(x)
-> let addy(z) = z + y in let x = 3 in addy(x)
Conceptually, rules could be applied interactively by a programmer
via a graphical user interface. In Stratego/XT, you can use the
Stratego Shell for doing this. More on this later. The problem with
such interative manipulations is that the transformation is not
reproducible, since the decisions have not been recorded. We want to
be able to automate the transformation process, because we can then
apply series of basic transformations repeatedly to a program. By
generalizing the sequence of transformations, the combined
transformation can be applied to many programs. This requires a
mechanism for combining rules into more complex transformations, and
this is exactly what the Stratego language gives us.
Pure term rewriting is not adequate for the implementation of
software transformation systems, since most rewrite systems are
non-confluent and/or non-terminating. Hence, standard rewriting
strategies are not applicable. The paradigm of programmable
rewriting strategies solves this problem by supporting the
definition of strategies adapted to a specific transformation
system. This makes it possible to select which rule to apply in
which transformation stage, and using which traversal order.
SDF Language
: Converting program texts to terms for transformations requires parsers. Since Stratego programs operate on terms, they do not particularly care about the implementation of parsers. Thus, parsers can be implemented with any parsing technology, or terms can be produced by an existing compiler front-end. In practice, Stratego is mostly used together with the syntax definition formalism SDF. The Stratego compiler itself uses SDF to parse Stratego programs, and many Stratego applications have been developed with SDF as well.
The syntax definition formalism SDF supports high-level,
declarative, and modular definition of the syntax of programming
languages and data formats. The formalism integrates the definition
of lexical and context-free syntax. The modularity of the formalism
implies that it is possible to easily combine two languages or to
embed one language into another.
GPP and the Box Language
: Stratego/XT uses the pretty-printing model provided by the Generic Pretty-Printing package GPP. In this model a tree is unparsed to a Box expression, which contains text with markup for pretty-printing. A Box expression can be interpreted by different back-ends to produce formatted output for different displaying devices such as plain text, HTML, and LATEX.
XT tool collection
: XT is a collection of transformation tools providing support for the generation of many infrastructural aspects of program transformation systems, including parsers, pretty-printers, parenthesizers, and format checkers.
XTC
: Parsers, pretty-printers, and transformations can be encapsulated in separate executable components, which can be reused in multiple transformation systems. Composition of such components is facilitated by the XTC transformation tool composition library. Initially this tutorial uses separate components that are glued using shell scripts, in order to improve the understanding of the separate components. The use of XTC is introduced later on.
Exactly what all this means will become clear to you as we move along in this tutorial.
This tutorial is divided into three parts. The first part introduces the XT architecture and many of the tools from the XT collection. An important point in this part that is how to construct parsers using the syntax definition formalism SDF. The parser takes source code text into structured ATerms. Another point in this part is the reverse action: going from ATerms back to source code text.
The second part of the tutorial introduces the Stratego language, starting with the concept of terms and moving on to rules and strategies. After explaining how rules and strategies may be combined to create complete transformation programs, the more advanced topics of concrete syntax and dynamic rules are covered.
The third and final part of the tutorial explains the most important strategies found in the Stratego library: basic data types such as lists, strings, hashtables and sets; basic I/O functionality; the SUnit framework for unit testing. This part also explains the technical details of how to put together complete software transformation systems from XT components using the Stratego build system, using the XTC component composition model.
The Stratego/XT project distributes several packages. So let's first make clear what you actually need to install. Stratego/XT itself is a language independent toolset for constructing program transformation systems. Language-specific extensions of Stratego/XT are distributed as separate packages, so that you only have to install what you really need for your particular application.
Stratego/XT.
All Stratego/XT users need to install the ATerm Library (aterm
), the
SDF2 Bundle (sdf2-bundle
) and Stratego/XT (strategoxt
). These
packages enable you to compile Stratego programs, and provide the basic
infrastructure for parsing and pretty-printing source files.
Stratego Shell.
Optionally, you can install the Stratego Shell, which provides an interpreter for Stratego and an interactive command-line for experimenting with the Stratego language. The Stratego Shell is used in the Stratego part of this tutorial to demonstrate the features of the Stratego language. The Stratego Shell is also very useful for small experiments with strategies of the Stratego Library.
Extensions.
Then there are the language-specific packages. These packages provide the basic infrastructure for parsing, pretty-printing, and in some cases also analyzing source files of a specific programming language. Reusing such a package enables you to get started immediately with the implementation of an actual transformation. Examples of such packages are Java-front, Dryad, Transformers C and C++, BibTeX Tools, Prolog Tools, AspectJ Front, and SQL Front. Also, there are some demonstration packages, for example Tiger Base, which implements a compiler for the Tiger language, and Java Borg, which demonstrates the implementation of language embeddings. All these packages can separately be installed as extensions of Stratego/XT.
Examples of the Stratego/XT Manual.
All the code examples in this manual are available for separate download, so that you can experiment based on these: examples.tar.gz
First of all, you have to decide which deployment mechanism you want to use. For users of RPM-based Linux distributions (such as Redhat, Fedora Core, and SUSE), we advise to use RPMs, which are available from the release page. For Cygwin users we provide pre-compiled binaries that can simply be unpacked. For Mac OS X users, we provide these binary packages as well, but they can also use the Nix deployment system, which will guarantee that all dependencies are installed correctly. Nix packages are also available for Linux users. Finally, it is always possible to build from source.
Next, download the required packages. Stratego/XT depends on the ATerm
Library and the SDF2 Bundle, so you have to download aterm
,
sdf2-bundle
, and strategoxt
. The downloads are all available at the
release page of Stratego/XT.
The following sequence of commands takes care of building and installing
the aterm and the sdf2-bundle in /usr/local
.
$ tar zxf aterm-version.tar.gz
$ cd aterm-version
$ ./configure
$ make
$ make install
$ cd ..
$ tar zxf sdf2-bundle-version.tar.gz
$ cd sdf2-bundle-version
$ ./configure --with-aterm=/usr/local
$ make
$ make install
$ cd ..
If you want to install the packages at a different location (i.e. not
/usr/local
, you should specify a --prefix
in the configure command.
For example:
$ ./configure --prefix=/opt/aterm
$ ./configure --prefix=/opt/sdf2-bundle --with-aterm=/opt/aterm
In this case, it possible that the sdf2-bundle cannot find the aterm
package. To tell the sdf2-bundle where it should look for the ATerm
Library, you can use the --with-aterm
argument:
$ ./configure --prefix=/opt/sdf2-bundle --with-aterm=/opt/aterm
Alternatively, you can add the location of the ATerm Library to the
PKG_CONFIG_PATH
, which the configure script will use for searching
packages. In this way, you don't need to specify the --with-
arguments. More information about this is available in the pkg-config
documentation (man pkg-config
). For example:
$ export PKG_CONFIG_PATH=/opt/aterm/lib/pkgconfig:/opt/sdf2-bundle/lib/pkgconfig
Unpack, configure, make and install Stratego/XT using the following commands:
$ tar zxf strategoxt-version.tar.gz
$ cd strategoxt-version
$ ./configure --with-aterm=/usr/local --with-sdf=/usr/local
$ make
$ make install
If you want to install StrategoXT at a different prefix, you should
specify a --prefix
. If you installed the ATerm Library and the SDF2
Bundle at a different location, you should specify their location using
--with-aterm
and --with-sdf
. For example:
$ ./configure --prefix=/opt/strategoxt \
--with-aterm=/opt/aterm --with-sdf=/opt/sdf2-bundle
As mentioned earlier, you can alternatively add the location of the
ATerm Library and the SDF2 Bundle to the PKG_CONFIG_PATH
, which the
configure script will use for searching packages.
The Stratego Shell, Java Front, Tiger Base and several other packages depend on Stratego/XT. For all these packages, you can use the following commands:
$ tar zxf package-version.tar.gz
$ cd package-version
$ ./configure
$ make
$ make install
For all these packages, you should use the --with-aterm=
,
--with-sdf=
, and --with-strategoxt=
options if you installed these
packages at non-standard locations. Alternatively, you can extend the
PKG_CONFIG_PATH
to include the locations of the ATerm Library, SDF2
Bundle, and Stratego/XT.
Installing binary RPMs is very easy. Install the RPMs by running
rpm -i *
in the directory where you have downloaded the RPMs. Use the
upgrade option rpm -U *
if you have already installed earlier versions of RPMs for
aterm, strategoxt or the sdf2-bundle. Of course you can also install the
RPMs one by one by specifying the filenames of the RPMs.
Using the Nix deployment system for the installation of Stratego/XT is a good idea if you need to run multiple versions of Stratego/XT on the same system, if you will need to update other Stratego/XT related packages regularly, or if there is a problem with installation from source at your system. The Nix deployment system is designed to guarantee that the Stratego/XT that we are using on our system is exactly reproduced on your system. This basically guarantees that the installation will never fail because of missing dependencies or mistakes in the configuration.
The release page of all the packages refer to Nix packages that can be
installed using nix-install-package
.
In this chapter, a general overview is given of the architecture of the XT transformation tools. The technical details of the tools and languages that are involved will be discussed in the follwing chapters.
XT is a collection of components for implementing transformation systems. Some of these components generate code that can be included in a transformation system, such as a parser or pretty-printer. Other components can be used immediately, since they are generic tools. The components of XT are all executable tools: they can be used directly from the command-line.
According to the Unix philosophy, the tools that are part of XT all do just one thing (i.e. implement one aspect of a transformation system) and can be composed into a pipeline to combine them in any way you want. A sketch of a typical pipeline is shown in ?. First, a parser is applied to a source program. This results in an abstract syntax tree, which is then transformed by a sequence of transformation tools. Finally, the tree is pretty-printed back to a source program. So, this pipeline is a source to source transformation system. The tools that are part of the pipeline exchange structured representations of a program, in the form of an abstract syntax tree. These structured representations can be read in any programming language you want, so the various components of a transformation system can be implemented in different programming languages. Usually, the real transformation components will be implemented in Stratego.
Of course, some compositions are very common. For example, you definitly don't want to enter the complete pipeline of a compiler again and again. So, you want to pre-define such a composition to make it reusable as single tool. For this, you could of create a shell script, but option handling in shell scripts is quite tiresome and you cannot easily add composition specific glue code. To solve this, Stratego itself has a concise library for creating compositions of transformation tools, called XTC. We will come back to that later in ?.
The nice thing about this approach is that all the tools can be reused in different transformation systems. A parser, pretty-printer, desugarer, optimizer, simplifier, and so an is automatically available to other transformation systems that operate on the same language. Even if a tool will typically be used in a single dominant composition (e.g. a compiler), having the tool available to different transformation systems is very useful. In other words: an XT transformation system is open and its components are reusable.
Programmers write programs as texts using text editors. Some programming environments provide more graphical (visual) interfaces for programmers to specify certain domain-specific ingredients (e.g., user interface components). But ultimately, such environments have a textual interface for specifying the details. So, a program transformation system needs to deal with programs that are in text format.
However, for all but the most trivial transformations, a structured, rather than a textual, representation is needed. Working directly on the textual representation does not give the transformation enough information about what the text actually means. To bridge the gap between the textual and the structured representation, parsers and unparsers are needed. Also, we need to know how this structured representation is actually structured.
The syntactical rules of a programming language are usually expressed in a context-free grammar. This grammar (or syntax definition) of a programming language plays a central role in Stratego/XT. Most of the XT tools work on a grammar in one way or another. The grammar is a rich source of information. For example, it can be used to generate a parser, pretty-printer, disambiguator, tree grammar, format-checker, and documentation. This central role of a grammar is the most crucial aspect of XT, which is illustrated in ?: all the components of a transformation system are directly or indirectly based on the grammar of a programming language.
What makes this all possible is the way syntax is defined in Stratego/XT. Stratego/XT uses the SDF language for syntax definition, which is a very high-level and declarative language for this purpose. SDF does not require the programmer to encode all kinds of properties of a programming language (such as associativity and priorities) in a grammar. Instead, SDF has declarative, concise, ways of defining these properties in such a way that they can actually be understood by tools that take the grammar as an input. And, of course, creating a grammar in such language is much more fun!
As mentioned before, the XT tools exchange a structured representation of a program: an abstract syntax tree. The structure of this abstract syntax tree is called the abstract syntax, as opposed to the ordinary textual syntax, which is called the concrete syntax. In XT, the abstract syntax is directly related to the syntax definition and can be generated from it. The result is a tree grammar that defines the format of the trees that are exchanged between the transformation tools. From the world of XML, you are probably already familiar with tree grammars: DTD, W3C XML Schema and RELAX NG are tree grammars in disguise.
Until now, we have been a bit vague about the format in which abstract syntax trees are actually exchanged between transformation tools. What is this structured program representation?
We can easily imagine an abstract syntax tree as a graphical structure.
For example, the tree at the right is a simple abstract syntax tree for
the expression (a + n) * 1
. This representation corresponds closely to the representation
of trees in computer memory. However, drawing pictures is not a very
effective way of specifying tree transformations. We need a concise,
textual, way to express an abstract syntax tree. Fortunately, there is a
one-to-one correspondence between trees and so-called first-order prefix
terms (terms, for short). A term is a constructor, i.e., an identifier,
applied to zero or more terms. Strings and integer constants are terms
as well. Thus, the following term corresponds to the abstract syntax
tree that we have just drawn.
Times(Plus(Var("a"), Var("n")), Int("1"))
In Stratego/XT, programs are exchanged between transformation tools as terms. The exact format we use for terms, is the Annotated Term Format, or ATerms for short. We will discuss this format in more detail later, but some of its features are interesting to note here.
First, ATerms are not only used to exchange programs between tools. ATerms are also used in the Stratego language itself for the representation of programs. In other words, ATerms are used for the external representation as well as the internal representation of programs in Stratego/XT. This is very convenient, since we don't have to bind the ATerms to Stratego specific data-structures. Usually, if such a data-binding is necessary, then there is always a mismatch here and there.
Second, an important feature of the implementation is that terms are represented using maximal sharing. This means that any term in use in a program is represented only once. In other words, two occurrences of the same term will be represented by pointers to the same location. This greatly reduces the amount of memory needed for representing programs.
The Annotated Term Format, or ATerms for short, is heavily used in Stratego/XT. It is used for the structured representation of programs (and also data in general). Program representations are exchanged between transformation tools in the ATerm format, and the data-structures of the Stratego language itself are ATerms.
Before we start with the more interesting tools of XT, we need to take a closer look at the ATerm format. This chapter introduces the ATerm format and some tools that operate on ATerms.
The ATerm format provides a set of constructs for representing trees,
comparable to XML or abstract data types in functional programming
languages. For example, the code 4 + f(5 * x)
might be represented in a term as:
Plus(Int("4"), Call("f", [Mul(Int("5"), Var("x"))]))
ATerms are constructed from the following elements:
Integer
: An integer constant, that is a list of decimal digits, is an ATerm.
Examples: 1
, 12343
.
String
: A string constant, that is a list of characters between double quotes is an ATerm. Special characters such as double quotes and newlines should be escaped using a backslash. The backslash character itself should be escaped as well.
Examples: `"foobar"`, `"string with
quotes\""`, `"escaped escape character\\ and
a newline\n".`
Constructor application
: A constructor is an identifier, that is an alphanumeric string starting with a letter, or a double quoted string.
A constructor application `c(t1,...,tn)` creates a term by applying
a constructor to a list of zero or more terms. For example, the term
`Plus(Int("4"),Var("x"))` uses the constructors `Plus`, `Int`, and
`Var` to create a nested term from the strings `"4"` and `"x"`.
When a constructor application has no subterms the parentheses may
be omitted. Thus, the term `Zero` is equivalent to `Zero()`. Some
people consider it good style to explicitly write the parentheses
for nullary terms in Stratego programs. Through this rule, it is
clear that a string is really a special case of a constructor
application.
List
: A list is a term of the form [t1,...,tn]
, that is a list of zero
or more terms between square brackets.
While all applications of a specific constructor typically have the
same number of subterms, lists can have a variable number of
subterms. The elements of a list are typically of the same type,
while the subterms of a constructor application can vary in type.
Example: The second argument of the call to `"f"` in the term
`Call("f",[Int("5"),Var("x")])` is a list of expressions.
Tuple
: A tuple (t1,...,tn)
is a constructor application without
constructor. Example: (Var("x"), Type("int"))
Annotation
: The elements defined above are used to create the structural part of
terms. Optionally, a term can be annotated with a list terms. These
annotations typically carry additional semantic information about
the term. An annotated term has the form t{t1,...,tn}
. Example:
Lt(Var("n"),Int("1")){Type("bool")}
. The contents of annotations
is up to the application.
As a Stratego programmer you will be looking a lot at raw ATerms. Stratego pioneers would do this by opening an ATerm file in emacs and trying to get a sense of the structure by parenthesis highlighting and inserting newlines here and there. These days your life is much more pleasant through the tool ?, which adds layout to a term to make it readable. For example, parsing the following program
let function fact(n : int) : int =
if n < 1 then 1 else (n * fact(n - 1))
in printint(fact(10))
end
produces the following ATerm (say in file fac.trm):
Let([FunDecs([FunDec("fact",[FArg("n",Tp(Tid("int")))],Tp(Tid("int")),
If(Lt(Var("n"),Int("1")),Int("1"),Seq([Times(Var("n"),Call(Var("fact"),
[Minus(Var("n"),Int("1"))]))])))])],[Call(Var("printint"),[Call(Var(
"fact"),[Int("10")])])])
By pretty-printing the term using pp-aterm
as
$ pp-aterm -i fac.trm
we get a much more readable term:
Let(
[ FunDecs(
[ FunDec(
"fact"
, [FArg("n", Tp(Tid("int")))]
, Tp(Tid("int"))
, If(
Lt(Var("n"), Int("1"))
, Int("1")
, Seq(
[ Times(
Var("n")
, Call(
Var("fact")
, [Minus(Var("n"), Int("1"))]
)
)
]
)
)
)
]
)
]
, [ Call(
Var("printint")
, [Call(Var("fact"), [Int("10")])]
)
]
)
An important feature of the implementation is that terms are represented using maximal sharing. This means that any term in use in a program is represented only once. In other words, two occurrences of the same term will be represented by pointers to the same location. ? illustrates the difference between a pure tree representation and a tree, or more accurately, a directed acyclic graph, with maximal sharing. That is, any sub-term is represented exactly once in memory, with each occurrence pointing to the same memory location. This representation entails that term equality is a constant operation, since it consists of comparing pointers.
It should be noted that annotations create different terms, that is, two terms, one with and the other without annotations that are otherwise, modulo annotations, the same, are not equal.
Maximal sharing can make a big difference in the amount of bytes needed for representing programs. Therefore, we would like to preserve this maximal sharing when an ATerm is exchanged between two programs. When exporting a term using the textual exchange format, this compression is lost. Therefore, the ATerm Library also provides a binary exchange format that preserves maximal sharing.
Actually, there are three different formats:
Textual ATerm Format
: In the textual ATerm format the ATerm is written as plain text, without sharing. This format is very inefficient for the exchange of large programs, but it is readable for humans.
Binary ATerm Format
: The binary ATerm format, also known as BAF, is an extremely efficient binary encoding of an ATerm. It preserves maximal sharing and uses all kinds of tricks to represent an ATerm in as few bytes as possible.
Shared Textual ATerm Format
: In the shared, textual, format the ATerm is written as plain text, but maximal sharing is encoded in the text.
The tool ? can be used to convert an ATerm from one format to another. Baffle, and other tools that operate on ATerms, automatically detect the format of an input ATerm.
The ATerm Format is an external representation for terms that can be used to exchange structured data between programs. In order to use a term, a program needs to parse ATerms and transform them into some internal representation. To export a term after processing it, a program should transform the internal representation into the standard format. There are libraries supporting these operation for a number of languages, including C, Java, and Haskell.
The implementation of the Stratego transformation language is based on the C implementation of the library. The library provides term input and output, and an API for constructing and inspecting terms. Garbage collection is based on Boehms conservative garbage collection algorithm.
In ? we have introduced the architecture of the XT tansformation tools. Source to source transformation systems based on XT consist of a pipeline of a parser, a series of transformations on a structured program representation, and a pretty-printer. In ? we have explained the ATerm format, which is the format we use for this structured program transformation. This chapter will be about the parser part of the pipeline.
Stratego/XT uses the Syntax Definition Formalism (SDF), for defining the syntax of a programming language. From a syntax definition in SDF, a parser can be generated fully automatically. There is no need for a separate lexer or scanner specification, since SDF integrates the lexical and the context-free syntax definition of a programming language in a single specification. The generated parser is based on the Scannerless Generalized-LR algorithm, but more details about that later. The parser directly produces an ATerm representation of the program, as a parse tree, or as an abstract syntax tree.
Actually, the component-based approach of XT allows you to use any tool for parsing a source program to an ATerm. So, you don't necessarily have to use the parsing tools we present in this chapter. Instead, it might sometimes be a good idea to create an ATerm backend for a parser that you already have developed (by hand or using a different parser generator), or reuse an entire, existing front-end that is provided by a third-party. However, the techniques we present in this chapter are extremely expressive, flexible, and easy to use, so for developing a new parser it would be a very good idea to use SDF and SGLR.
In this section, we review the basics of grammars, parse trees and abstract syntax trees. Although you might already be familiar with these basic concepts, the perspective from the Stratego/XT point of view might still be interesting to read.
Context-free grammars were originally introduced by Chomsky to describe
the generation of grammatically correct sentences in a language. A
context-free grammar is a set of productions of the form A0 -> A1
... An, where A0 is non-terminal and A1 ... An is a string of
terminals and non-terminals. From this point of view, a grammar
describes a language by generating its sentences. A string is
generated by starting with the start non-terminal and repeatedly
replacing non-terminal symbols according to the productions until a
string of terminal symbols is reached.
A context-free grammar can also be used to recognize sentences in the
language. In that process, a string of terminal symbols is rewritten to
the start non-terminal, by repeatedly applying grammar productions
backwards, i.e. reducing a substring matching the right-hand side of a
production to the non-terminal on the left-hand side. To emphasize the
recognition aspect of grammars, productions are specified as A1 ...
An -> A0 in SDF.
As an example, consider the SDF productions for a small language of
arithmetic expressions in ?, where Id
and IntConst
are terminals and
Var
and Exp
are non-terminals. Using this definition, and provided
that a
and n
are identifiers (Id
) and 1
is an IntConst
, a
string such as (a+n)*1
can be recognized as an expression by reducing
it to Exp
, as shown by the reduction sequence in the right-hand side
of ?.
Id -> Var (a + n ) * 1
Var -> Exp -> (Id + n ) * 1
IntConst -> Exp -> (Var + n ) * 1
"-" Exp -> Exp -> (Exp + n ) * 1
Exp "*" Exp -> Exp -> (Exp + Id ) * 1
Exp "+" Exp -> Exp -> (Exp + Var) * 1
Exp "-" Exp -> Exp -> (Exp + Exp) * 1
Exp "=" Exp -> Exp -> (Exp ) * 1
Exp ">" Exp -> Exp -> Exp * 1
"(" Exp ")" -> Exp -> Exp * IntConst
-> Exp * Exp
-> Exp
Recognition of a string only leads to its grammatical category, not to any other information. However, a context-free grammar not only describes a mapping from strings to sorts, but actually assigns structure to strings. A context-free grammar can be considered as a declaration of a set of trees of one level deep. For example, the following trees correspond to productions from the syntax definition in ?:
Such one-level trees can be composed into larger trees by fusing trees
such that the symbol at a leaf of one tree matches with the root symbol
of another, as is illustrated in the fusion of the plus and times
productions on the right. The fusion process can continue as long as the
tree has non-terminal leaves. A tree composed in this fashion is a parse
tree if all leaves are terminal symbols. ? shows a parse tree for the
expression (a+n)*1
, for which we showed a reduction sequence earlier
in ?. This illustrates the direct correspondence between a string and
its grammatical structure. The string underlying a parse tree can be
obtained by concatening the symbols at its leaves, also known as the
yield.
Parse trees can be derived from the reduction sequence induced by the productions of a grammar. Each rewrite step is associated with a production, and thus with a tree fragment. Instead of replacing a symbol with the non-terminal symbol of the production, it is replaced with the tree fragment for the production fused with the trees representing the symbols being reduced. Thus, each node of a parse tree corresponds to a rewrite step in the reduction of its underlying string. This is illustrated by comparing the reduction sequence in ? with the tree in ?
The recognition of strings described by a syntax definition, and the corresponding construction of parse trees can be done automatically by a parser, which can be generated from the productions of a syntax definition.
Parse trees contain all the details of a program including literals,
whitespace, and comments. This is usually not necessary for performing
transformations. A parse tree is reduced to an abstract syntax tree by
eliminating irrelevant information such as literal symbols and layout.
Furthermore, instead of using sort names as node labels, constructors
encode the production from which a node is derived. For this purpose,
the productions in a syntax definition are extended with constructor
annotations. ? shows the extension of the syntax definition from ? with
constructor annotations and the abstract syntax tree for the same good
old string (a+n)*1
. Note that some identifiers are used as sort names
and and as constructors. This does not lead to a conflict since sort
names and constructors are derived from separate name spaces. Some
productions do not have a constructor annotation, which means that these
productions do not create a node in the abstract syntax tree.
Id -> Var {cons("Var")}
Var -> Exp
IntConst -> Exp {cons("Int")}
"-" Exp -> Exp {cons("Uminus")}
Exp "*" Exp -> Exp {cons("Times")}
Exp "+" Exp -> Exp {cons("Plus")}
Exp "-" Exp -> Exp {cons("Minus")}
Exp "=" Exp -> Exp {cons("Eq")}
Exp ">" Exp -> Exp {cons("Gt")}
"(" Exp ")" -> Exp
In the next section we will take a closer look at the various features of the SDF language, but before that it is useful to know your tools, so that you can immediately experiment with the various SDF features you will learn about. You don't need to fully understand the SDF fragments we use to explain the tools: we will come back to that later.
One of the nice things about SDF and the tools for it, is that the concepts that we have discussed directly translate to practice. For example, SDF supports all context-free grammars; the parser produces complete parse trees, which can even be yielded; parse trees can be converted to abstract syntax trees.
In SDF, you can split a syntax definition into multiple modules. So, a
complete syntax definition consists of a set modules. SDF modules are
stored in files with the extension .sdf
. ? shows two SDF modules for a
small language of expressions. The module Lexical
defines the
identifiers and integer literals of the language. This module is
imported by the module Expression
.
module Expression module Operators
imports exports
Lexical Operators sorts Exp
context-free syntax
exports Exp "*" Exp -> Exp {left, cons("Times")}
context-free start-symbols Exp Exp "/" Exp -> Exp {left, cons("Div")}
context-free syntax Exp "%" Exp -> Exp {left, cons("Mod")}
Id -> Exp {cons("Var")}
IntConst -> Exp {cons("Int")} Exp "+" Exp -> Exp {left, cons("Plus")}
"(" Exp ")" -> Exp {bracket} Exp "-" Exp -> Exp {left, cons("Minus")}
context-free priorities
{left:
Exp "*" Exp -> Exp
Exp "/" Exp -> Exp
Exp "%" Exp -> Exp
}
> {left:
Exp "+" Exp -> Exp
Exp "-" Exp -> Exp
}
module Lexical
exports
sorts Id IntConst
lexical syntax
[\ \t\n] -> LAYOUT
[a-zA-Z]+ -> Id
[0-9]+ -> IntConst
Before you can invoke the parser generator to create a parser for this
expression language, the modules that constitute a complete syntax
definition have to be collected into a single file, usually called a
definition. This file has the extension .def
. Collecting SDF modules
into a .def
file is the job of the tool pack-sdf.
$ pack-sdf -i Expression.sdf -o Expression.def
Pack-sdf collects all modules imported by the SDF module specified using
the -i
parameter. This results in a combined syntax definition, which
is written to the file specified with the -o
parameter. Modules are
looked for in the current directory and any of the include directories
indicated with the -I dir
arguments.
Pack-sdf does not analyse the contents of an SDF module to report possible errors (except for syntactical ones). The parser generator, discussed next, performs this analysis.
Tip
All tools in Stratego/XT use the
-i
and-o
options for input and output. Also, most tools read from the standard input or write to standard output if no input or output is specified.
From the .def
definition file (as produced by pack-sdf
), the parser
generator ? constructs a parse table. This parse table can later on be
handed off to the actual parser.
$ sdf2table -i Expression.def -o Expression.tbl -m Expression
The -m
option is used to specify the module for which to generate a
parse table. The default module is Main
, so if the syntax definition
has a different main module (in this case Expression
), then you need
to specify this option.
The parse table is stored in a file with extension .tbl
. If all you
plan on doing with your grammar is parsing, the resulting .tbl
file is
all you need to deploy. ? could be thought of as a parse-generator;
however, unlike many other parsing systems, it does not construct an
parser program, but a compact data representation of the parse table.
Sdf2table analyzes the SDF syntax definition to detect possible errors,
such as undefined symbols, symbols for which no productions exists,
deprecated features, etc. It is a good idea to try to fix these
problems, although sdf2table
will usually still happily generated a
parse table for you.
Now we have a parse table, we can invoke the actual parser with this
parse table to parse a source program. The parser is called ? and
produces a complete parse tree in the so-called AsFix format. Usually,
we are not really interested in the parse tree and want to work on an
abstract syntax tree. For this, there is the somewhat easier to use tool
?, which indirectly just invokes sglr
.
$ cat mul.exp
(a + n) * 1
$ sglri -p Expression.tbl -i mul.exp
Times(Plus(Var("a"),Var("n")),Int("1"))
For small experiments, it is useful that sglri can also read the source file from standard input. Example:
$ echo "(a + n) * 1" | sglri -p Expression.tbl
Times(Plus(Var("a"),Var("n")),Int("1"))
Heuristic Filters.
As we will discuss later, SGLR uses disambiguation filters to select the desired derivations if there are multiple possibilities. Most of these filters are based on specifications in the original syntax definition, such a associativity, priorities, follow restrictions, reject, avoid and prefer productions. However, unfortunately there are some filters that select based on heuristics. Sometimes these heuristics are applicable to your situation, but sometimes they are not. Also, the heuristic filters make it less clear when and why there are ambiguities in a syntax definition. For this reason, they are disabled by default if you use sglri, (but not yet if you use sglr).
Start Symbols.
If the original syntax definition contained multiple start symbols, then
you can optionally specify the desired start symbol with the -s
option. For example, if we add Id
to the start-symbols
of our
expression language, then a single identifier is suddenly an ambiguous
input (we will come back to ambiguities later) :
$ echo "a" | sglri -p Expression.tbl
amb([Var("a"),"a"])
By specifying a start symbol, we can instruct the parser to give us the expression alternative, which is the first term in the list of two alternatives.
$ echo "a" | sglri -p Expression.tbl -s Exp
Var("a")
Working with Parse Trees.
If you need a parse tree, then you can use ? itself. These parse trees contain a lot of information, so they are huge. Usually, you really don't want to see them. Still, the structure of the parse tree is quite interesting, since you can exactly inspect how the productions from the syntax definition have been applied to the input.
$ echo "(a + n) * 1" | sglr -p Expression.tbl -2 -fi -fe | pp-aterm
parsetree( ... )
Note that we passed the options -2 -fi -fe
to sglr
. The -2
option
specifies the variant of the AsFix parse tree format that should be
used: AsFix2. The Stratego/XT tools use this variant at the moment. All
variants of the AsFix format are complete and faithful representation of
the derivation constructed by the parser. It includes all details of the
input file, including whitespace, comments, and is self documenting as
it uses the complete productions of the syntax definition to encode node
labels. The AsFix2 variant preserves all the structure of the
derivation. In the other variant, the structure of the lexical parts of
a parse tree are not preserved. The -fi -fe
options are used to heuristic disambiguation filters, which
are by default disabled in sglri
, but not in sglr
.
The parse tree can be imploded to an abstract syntax tree using the tool
?. The combination of sglr
and implode-asfix
has the same effect as
directly invoking sglri
.
$ echo "(a + n) * 1" | sglr -p Expression.tbl -2 -fi -fe | implode-asfix
Times(Plus(Var("a"),Var("n")),Int("1"))
The parse tree can also be yielded back to the original source file using the tool ?. Applying this tool shows that whitespace and comments are indeed present in the parse tree, since the source is reproduced in exactly the same way as it was!
$ echo "(a + n) * 1" | sglr -p Expression.tbl -2 -fi -fe | asfix-yield
(a + n) * 1
First, basic structure of SDF. Second, how syntax is defined SDF. Third, examples of lexical and context-free syntax. Fourth, more detailed coverage of disambigation.
In this section, we give an overview of the basic constructs of SDF. After this section, you will now the basic idea of SDF. The next sections will discuss these constructs more detail.
Before defining some actual syntax, we have to explain the basic structure of a module. For this, let's take a closer look at the language constructs that are used in the modules we showed earlier in ?.
module Expression
imports
Lexical Operators
exports
context-free start-symbol Exp
context-free syntax
Id -> Exp {cons("Var")}
IntConst -> Exp {cons("Int")}
"(" Exp ")" -> Exp {bracket}
module Operators
exports
sorts Exp
context-free syntax
Exp "*" Exp -> Exp {left, cons("Times")}
Exp "/" Exp -> Exp {left, cons("Div")}
Exp "%" Exp -> Exp {left, cons("Mod")}
Exp "+" Exp -> Exp {left, cons("Plus")}
Exp "-" Exp -> Exp {left, cons("Minus")}
context-free priorities
{left:
Exp "*" Exp -> Exp
Exp "/" Exp -> Exp
Exp "%" Exp -> Exp
}
> {left:
Exp "+" Exp -> Exp
Exp "-" Exp -> Exp
}
module Lexical
exports
sorts Id IntConst
lexical syntax
[a-zA-Z]+ -> Id
[0-9]+ -> IntConst
[\ \t\n] -> LAYOUT
lexical restrictions
Id -/- [a-zA-Z]
? shows these modules, highlighting some of the constructs that are important to know before we dive into the details of defining syntax.
Modules have a name, which can be a plain identifier, such as
Expression
in this example. The module must be in a file with same
name and the .sdf
extension. The module name can also be a path, for
example java/expressions/Assignment
. In this case, the module has to
be in a file with name Assignment.sdf
, which must be in a directory
java/expressions
.
A module can optionally import a number of other modules. Multiple modules can be imported with a single import, as we have done in this example, or multiple imports can be used. This is not very common: usually, modules have just a single imports declaration.
Modules are always imported by their full name. So, if the name of a
module is a path, such as java/expressions/Assignment
, then the module
must be imported with that full name, even if it is in the directory
java/expressions
. If the name of the modules are long, which is
typically the case if you use full paths to organize a large syntax
definition in different directories, then the names are usually
mentioned on separate lines, but still in a single import.
Modules contain a number of sections, of which we now only consider
the exports
section. An exports
section defines a number of
syntactic aspects, called a grammar, that will be available to modules
that import this module. This includes syntax, but also declarations of
sorts, start symbols, priorities, restrictions, etc.
A module can also just import other modules and not actually define any syntactical aspects itself. This is typically the case in main modules of large syntax definitions, which only import modules for names, expressions, statements, etc.
Every syntax definition needs to define one or more start symbols,
otherwise not a single input will accepted. Start symbols are the
language constructs that are allowed at the top-level of a source file.
For our simple expression language, the start symbol is Exp
. For a
java syntax definition, the start symbol could be CompilationUnit
,
which consists of a package, import, and type declarations.
Every syntax definition introduces names for the syntactical sorts of a
language, such as Exp
and Id
. These names can be declared in a
sorts
declaration. Declaring sorts is optional, but the SDF parser
generator will give a warning if a sort that is used somewhere in the
syntax definition is not declared. It is a good habit to declare all the
sorts, since this makes it easier to find possible miss-spellings of
these names.
Note that in SDF syntax definitions we do not directly use the terminology of terminal and non-terminal, since actually only single characters are terminals in SDF, and almost everything else is a non-terminal. Lexical and context-free sorts are both declared as sorts.
The actual syntax is defined in lexical
and context-free
syntax. The
lexical syntax defines the syntax of language constructs like literals,
comments, whitespace, and identifiers, or what is usally referred to as
terminals. The context-free syntax defines the syntax of constructs like
operators, statements, or what is usually referred to as non-terminals.
In other parser generators the lexical syntax is often specified in a separate scanner specification, but in SDF these lexical aspects are integrated in the definition of the context-free syntax. We will come back to that later when we discuss the definition of lexical syntax.
Productions can have attributes can have attributes, specified between
curly braces after the production. Some of these attributes, such as
left
have a special meaning for SDF itself. Other attributes can
specify information about a production that target a different tool,
such as bracket
and cons
, which target the tool that implodes parse
trees to abstract syntax trees.
SDF support constructs to define in a declarative way that certain kinds of derivations are not allowed, also known as disambiguation filters. In our example, there two examples of this: we define priorities of the arithmetic expressions, and there is a lexical restriction that specifies that an identifier can never be followed by a character that is allowed in an identifier. We will explain these mechanisms later.
Usually, parsing is performed in two phases. First, a lexical analysis phase splits the input in tokens, based on a grammar for the lexical syntax of the language. This lexical grammar is usually specified by a set of regular expressions that specifiy the tokens of the language. Second, a parser based on a grammar for the context-free syntax of the language performs the syntactic analysis. This approach has several disadvantages for certain applications, which won't discuss in detail for now. One of the most important disadvantages is that the combination of the two grammars is not a complete, declarative definition of the syntax of the language.
SDF integrates the definition of lexical and context-free syntax in a
single formalism, thus supporting the complete description of the
syntax of a language in a single specification. All syntax, both lexical
and context-free, is defined by productions, respectively in
lexical syntax
and context-free syntax
sections. Parsing of
languages defined in SDF is implemented by scannerless generalized-LR
parsing, which operates on individual characters instead of tokens.
Expressive Power.
Since lexical and context-free syntax are both defined by productions, there is actually no difference in the expressive power of the lexical and context-free grammar. Hence, lexical syntax can be a context-free language, instead of being restricted to a regular grammar, which is the case when using conventional lexical analysis tools based on regular expression. For example, this means that you can define the syntax of nested comments in SDF, which we will illustrate later. In practice, more important is that it is easier to define lexical syntax using productions than using regular expressions.
Layout.
Then, why are there two different sections for defining syntax? The difference between these two kinds of syntax sections is that in lexical syntax no layout (typically whitespace and comments) is allowed between symbols. In contrast, in context-free syntax sections layout is allowed between the symbols of a production. We will explain later how layout is defined. The allowance of layout is the only difference between the two kinds of syntax sections.
In the ? we recapped context-free grammars and productions, which have
the form A0 -> A1 ... An, where A0 is non-terminal and A1 ...
An is a string of terminals and non-terminals. Also, we mentioned
earlier that the distinction between terminals and non-terminals is less
useful in SDF, since only single characters are terminals if the lexical
and context-free syntax are defined in a single formalism. For this
reason, every element of a production, i.e. A0 ... An is called a
symbol. So, productions take a list of symbols and produce another
symbol.
There are two primary symbols:
Sorts
: Sorts are names for language specific constructs, such as Exp
,
Id
, and IntConst
. These names are declared using the previously
introduced sorts
declaration and defined by productions.
Character Classes
: A character class is set of characters. Character classes are
specified by single characters, character ranges, and can be
combined using set operators, such as complement, difference, union,
intersection. Examples: [abc]
, [a-z]
, [a-zA-Z0-9]
, ~[\n]
. We
will discuss character classes in more detail in ?.
Of course, defining an entire language using productions that can
contain only sorts and character classes would be a lot of work. For
example, programming languages usually contain all kinds of list
constructs. Specification of lists with plain context-free grammars
requires several productions for each list construct. SDF provides a
bunch of regular expression operators abbreviating these common
patterns. In the following list, A
represents a symbol and c
a
character.
"c ... c"
: Literals are strings that must literally occur in the input, such as
keywords (if
, while
, class
), literals (null
, true
) and
operators (+
, *
). Literals can be written naturally as, for
example, "while"
. Escaping of special characters will be discussed
in ?.
A*
: Zero or more symbols A
. Examples: Stm*
, [a-zA-Z]*
A+
: One or more symbols A
. Examples: TypeDec+
, [a-zA-Z]+
{A A}*
: Zero or more symbols A
separated by A
. Examples: {Exp ","}*
,
{FormalParam ","}*
{A A}+
: One or more symbols A
separated by A
. Examples: {Id "."}+
,
{InterfaceType ","}+
A?
: Optional symbol A
. Examples: Expr?
, [fFdD]?
A | A
: Alternative of symbol A
or A
. Example:
{Expr ","}* | LocalVarDec
(A ... A)
: Sequence of symbols A ... A
.
In order to define the syntax at the level of characters, SDF provides
character classes, which represent a set of characters from which one
character can be recognized during parsing. The content of a character
classes is a specification of single characters or character ranges
(c-``c
). Letters and digits can be written as themselves, all other
characters should be escaped using a slash, e.g. \_
. Characters can
also be indicated by their decimal ASCII code, e.g. \13
for linefeed.
Some often used non-printable characters have more mnemonic names, e.g.,
\n
for newline, \
for space and \t
for tab.
Character classes can be combined using set operations. The most common
one is the unary complement operator ~
, e.g ~[\n]
. Binary operators
are the set difference /
, union \/
and intersection /\
.
[0-9]
: Character class for digits: 0, 1, 2, 3, 4, 5, 6, 8, 9.
[0-9a-fA-F]
: Characters typically used in hexi-decimal literals.
[fFdD]
: Characters used as a floating point type suffix, typically in C-like languages.
[\ \t\12\r\n]
: Typical character class for defining whitespace. Note that SDF does not yet support \f as an escape for form feed (ASCII code 12).
[btnfr\"\'\\]
: Character class for the set of characters that are usually allowed as escape sequences in C-like programming languages.
~[\"\\\n\r]
: The set of characters that is typically allowed in string literals.
Until now, we have mostly discussed the design of SDF. Now, it's about time to see how all these fancy ideas for syntax definition work out in practice. In this and the next section, we will present a series of examples that explain how typical language constructs are defined in SDF. This first section covers examples of lexical syntax constructs. The next section will be about context-free syntax.
Before we can start with the examples of lexical constructs like
identifiers and literals, you need to know the basics of defining
whitespace. In SDF, layout is a special sort, called LAYOUT
. To define
layout, you have to define productions that produce this LAYOUT
sort.
Thus, to allow whitespace we can define a production that takes all
whitespace characters and produces layout. Layout is lexical syntax, so
we define this in a lexical syntax section.
lexical syntax
[\ \t\r\n] -> LAYOUT
We can now also reveal how context-free syntax
exactly works. In
context-free syntax
, layout is allowed between symbols in the
left-hand side of the productions, by automatically inserting optional
layout (e.g. LAYOUT?
) between them.
In the following examples, we will assume that whitespace is always defined in this way. So, we will not repeat this production in the examples. We will come back to the details of whitespace and comments later.
Almost every language has identifiers, so we will start with that. Defining identifiers themselves is easy, but there is some more definition of syntax required, as we will see next. First, the actual definition of identifiers. As in most languages, we want to disallow digits as the first character of an identifier, so we take a little bit more restrictive character class for that first character.
lexical syntax
[A-Za-z][A-Za-z0-9]* -> Id
If a language would only consists of identifiers, then this production
does the job. Unfortunately, life is not that easy. In practice,
identifiers interact with other language constructs. The best known
interaction is that most languages do not allow keywords (such as if
,
while
, class
) and special literals (such as null
, true
). In SDF,
keywords and special literals are not automatically preferred over
identifiers. For example, consider the following, very simple expression
language (for the context-free syntax we appeal to your intuition for
now).
lexical syntax
[A-Za-z][A-Za-z0-9]* -> Id
"true" -> Bool
"false" -> Bool
context-free start-symbols Exp
context-free syntax
Id -> Exp {cons("Id")}
Bool -> Exp {cons("Bool")}
The input true
can now be parsed as an identifier as well as a boolean
literal. Since the generalized-LR parser actually supports ambiguities,
we can even try this out:
$ echo "true" | sglri -p Test.tbl
amb([Bool("true"), Id("true")])
The amb
term is a representation of the ambiguity. The argument of the
ambiguity is a list of alternatives. In this case, the first is the
boolean literal and the second is the identifier true. So, we have to
define explicitly that we do not want to allow these boolean literals as
identifiers. For this purpose, we can use SDF reject productions. The
intuition of reject productions is that all derivations of a symbol
for which there is a reject production are forbidden. In this example,
we need to create productions for the boolean literals to identifiers.
lexical syntax
"true" -> Id {reject}
"false" -> Id {reject}
For true
, there will now be two derivations for an Id
: one using the
reject production and one using the real production for identifiers.
Because of that reject production, all derivations will be rejected, so
true
is not an identifier anymore. Indeed, if we add these productions
to our syntax definition, then the true literal is no longer ambiguous:
$ echo "true" | sglri -p Test.tbl
Bool("true")
We can make the definition of these reject productions a bit more
concise by just reusing the Bool
sort. In the same way, we can define
keywords using separate production rules and have a single reject
production from keywords to identifiers.
lexical syntax
Bool -> Id {reject}
Keyword -> Id {reject}
"class" -> Keyword
"if" -> Keyword
"while" -> Keyword
Scanners usually apply a longest match policy for scanning tokens. Thus, if the next character can be included in the current token, then this will always be done, regardless of the consequences after this token. In most languages, this is indeed the required behaviour, but in some languages longest match scanning actually doesn't work. Similar to not automatically reserving keywords, SDF doesn't choose the longest match by default. Instead, you need to specify explicitly that you want to recognize the longest match.
For example, suppose that we introduce two language constructs based on
the previously defined Id
. The following productions define two
statements: a simple goto and a construct for variable declarations,
where the first Id
is the type and the second the variable name.
context-free syntax
Id -> Stm {cons("Goto")}
Id Id -> Stm {cons("VarDec")}
For the input foo
, which is of course intended to be a goto, the
parser will now happily split up the identifier foo
, which results in
variable declarations. Hence, this input is ambiguous.
$ echo "foo" | sglri -p Test.tbl
amb([Goto("foo"), VarDec("f","oo"), VarDec("fo","o")])
To specify that we want the longest match of an identifier, we define a
follow restriction. Such a follow restriction indicates that a string
of a certain symbol cannot be followed by a character from the given
character class. In this way, follow restrictions can be used to encode
longest match disambiguation. In this case, we need to specify that an
Id
cannot be followed by one of the identifier characters:
lexical restrictions
Id -/- [A-Za-z0-9]
Indeed, the input foo
is no longer ambiguous and is parsed as a goto:
$ echo "foo" | sglri -p Test.tbl
Goto("foo")
In ? we explained how to reject keywords as identifiers, so we will not
repeat that here. Also, we discussed how to avoid that identifiers get
split. A similar split issue arises with keywords. Usually, we want to
forbid a letter immediately after a keyword, but the scannerless parser
will happily start a new identifier token immediately after the keyword.
To illustrate this, we need to introduce a keyword, so let's make our
previous goto
statement a bit more clear:
context-free syntax
"goto" Id -> Stm {cons("Goto")}
To illustrate the problem, let's take the input gotox
. Of course, we
don't want to allow this string to be a goto, but without a follow
restriction, it will actually be parsed by starting an identifier after
the goto
:
$ echo "gotox" | sglri -p Test.tbl
Goto("x")
The solution is to specify a follow restriction on the "goto"
literal
symbol.
lexical restrictions
"goto" -/- [A-Za-z0-9]
It is not possible to define the follow restrictions on the Keyword
sort that we introduced earlier in the reject
example. The follow
restriction must be defined on the symbol that literally occurs in the
production, which is not the case with the Keyword
symbol. However,
you can specify all the symbols in a single follow restriction,
seperated by spaces:
lexical restrictions
"goto" "if" -/- [A-Za-z0-9]
Compared to identifiers, integer literals are usually very easy to define, since they do not really interact with other language constructs. Just to be sure, we still define a lexical restriction. The need for this restriction depends on the language in which the integer literal is used.
lexical syntax
[0-9]+ -> IntConst
lexical restrictions
IntConst -/- [0-9]
In mainstream languages, there are often several notations for integer
literal, for example decimal, hexadecimal, or octal. The alternatives
are then usually prefixed with one or more character that indicates the
kind of integer literal. In Java, hexadecimal numerals start with 0x
and octal with a 0
(zero). For this, we have to make the definition of
decimal numerals a bit more precise, since 01234
is now an octal
numeral.
lexical syntax
"0" -> DecimalNumeral
[1-9][0-9]* -> DecimalNumeral
[0][xX] [0-9a-fA-F]+ -> HexaDecimalNumeral
[0] [0-7]+ -> OctalNumeral
Until now, the productions for lexical syntax have not been very complex. In some cases, the definition of lexical syntax might even seem to be more complex in SDF, since you explicitly have to define behaviour that is implicit in existing lexical anlalysis tools. Fortunately, the expressiveness of lexical syntax in SDF also has important advantages, even if it is applied to language that are designed to be processed with a separate scanner. As a first example, let's take a look at the definition of floating-point literals.
Floating-point literals consists of three elements: digits, which may
include a dot, an exponent, and a float suffix (e.g. f
, d
etc).
There are three optional elements in float literals: the dot, the
exponent, and the float suffix. But, if you leave them all out, then the
floating-point literal no longer distinguishes itself from an integer
literal. So, one of the floating-point specific elements is required.
For example, valid floating-point literals are: 1.0
, 1.
, .1
, 1f
,
and 1e5
, but invalid are: 1
, and .e5
. These rules are encoded in
the usual definition of floating-point literals by duplicating the
production rule and making different elements optional and required in
each production. For example:
lexical syntax
[0-9]+ "." [0-9]* ExponentPart? [fFdD]? -> FloatLiteral
[0-9]* "." [0-9]+ ExponentPart? [fFdD]? -> FloatLiteral
[0-9]+ ExponentPart [fFdD]? -> FloatLiteral
[0-9]+ ExponentPart? [fFdD] -> FloatLiteral
[eE] SignedInteger -> ExponentPart
[\+\-]? [0-9]+ -> SignedInteger
However, in SDF we can use reject production to reject these special cases. So, the definition of floating-point literals itself can be more naturally defined in a single production ?. The reject production ? defines that there should at least be one element of a floating-point literal: it rejects plain integer literals. The reject production ? defines that the digits part of the floating-point literals is not allowed to be a single dot.
lexical syntax
FloatDigits ExponentPart? [fFdD]? -> FloatLiteral
[0-9]* "." [0-9]* -> FloatDigits
[0-9]+ -> FloatDigits
[0-9]+ -> FloatLiteral {reject}
"." -> FloatDigits {reject}
Similar to defining whitespace, comments can be allowed everywhere by
defining additional LAYOUT
productions. In this section, we give
examples of how to define several kinds of common comments.
Most languages support end-of-line comments, which start with special
characters, such as //
, #
, or %
. After that, all characters on
that line are part of the comment. Defining end-of-line comments is
quite easy: after the initial characters, every character except for the
line-terminating characters is allowed until a line terminator.
lexical syntax
"//" ~[\n]* [\n] -> LAYOUT
Block comments (i.e. /* ... */
) are a bit more tricky to define, since
the content of a block comment may include an asterisk (*
). Let's
first take a look at a definition of block comments that does not allow
an asterisk in its content:
"/*" ~[\*]* "*/" -> LAYOUT
If we allow an asterisk and a slash, the sequence */
will be allowed
as well. So, the parser will accept the string /* */ */
as a comment,
which is not valid in C-like languages. In general, allowing this in a
language would be very inefficient, since the parser can never decide
where to stop a block comment. So, we need to disallow just the specific
sequence of characters */
inside a comment. We can specify this using
a follow restriction: an asterisk in a block comments is allowed, but
it cannot be followed by a slash (/
).
But, on what symbol do we specify this follow restriction? As explained
earlier, we need to specify this follow restriction on a symbol that
literally occurs in the production. So, we could try to allow a "*"
,
and introduce a follow restriction on that:
lexical syntax
"/*" (~[\*] | "*")* "*/" -> LAYOUT
lexical restrictions
"*" -/- [\/]
But, the symbol "*"
also occurs in other productions, for example in
multiplication expressions and we do not explicitly say here that we
intend to refer to the "*"
in the block comment production. To
distinguish the block comment asterisk from the multiplication operator,
we introduce a new sort, creatively named Asterisk
, for which we can
specify a follow restriction that only applies to the asterisk in a
block comment.
lexical syntax
"/*" (~[\*] | Asterisk)* "*/" -> LAYOUT
[\*] -> Asterisk
lexical restrictions
Asterisk -/- [\/]
To illustrate that lexical syntax in SDF can actually be context-free,
we now show an example of how to implement balanced, nested block
comments, i.e. a block comment that supports block comments in its
content: /* /* */ */
. Defining the syntax for nested block comments is quite easy,
since we can just define a production that allows a block comment inside
itself ?. For performance and predictability, it is important to require
that the comments are balanced correctly. So, in addition to disallowing
*/
inside in block comments, we now also have to disallow /*
. For
this, we introduce a Slash
sort, for which we define a follow
restriction ?, similar to the Asterisk
sort that we discussed in the
previous section.
lexical syntax
BlockComment -> LAYOUT
"/*" CommentPart* "*/" -> BlockComment
~[\/\*] -> CommentPart
Asterisk -> CommentPart
Slash -> CommentPart
BlockComment -> CommentPart
[\/] -> Slash
[\*] -> Asterisk
lexical restrictions
Asterisk -/- [\/]
Slash -/- [\*]
Context-free syntax in SDF is syntax where layout is allowed between the symbols of the productions. Context-free syntax can be defined in a natural way, thanks to the use of generalized-LR parsing, declarative disambiguation mechanism, and an extensive set of regular expression operators. To illustrate the definition of context-free syntax, we give examples of defining expressions and statements. Most of the time will be spend on explaining the disambiguation mechanisms.
In the following sections, we will explain the details of a slightly extended version of the SDF modules in ?, shown in ?.
module Expression
imports
Lexical
exports
context-free start-symbols Exp
context-free syntax
Id -> Var
Var -> Exp {cons("Var") }
IntConst -> Exp {cons("Int") }}
"(" Exp ")" -> Exp {bracket }
"-" Exp -> Exp {cons("UnaryMinus")}
Exp "*" Exp -> Exp {cons("Times"), left }
Exp "/" Exp -> Exp {cons("Div"), left}
Exp "%" Exp -> Exp {cons("Mod"), left}
Exp "+" Exp -> Exp {cons("Plus") , left}
Exp "-" Exp -> Exp {cons("Minus"), left}
Exp "=" Exp -> Exp {cons("Eq"), non-assoc }
Exp ">" Exp -> Exp {cons("Gt"), non-assoc}
context-free priorities
"-" Exp -> Exp
> {left:
Exp "*" Exp -> Exp
Exp "/" Exp -> Exp
Exp "%" Exp -> Exp
}
> {left:
Exp "+" Exp -> Exp
Exp "-" Exp -> Exp
}
> {non-assoc:
Exp "=" Exp -> Exp
Exp ">" Exp -> Exp
}
First, it is about time to explain the constructor attribute, cons
,
which was already briefly mentioned in ?. In the example expression
language, most productions have a constructor attribute, for example ?,
? and ?, but some have not, for example ? and ?.
The cons
attribute does not have any actual meaning in the definition
of the syntax, i.e the presence or absence of a constructor does not
affect the syntax that is defined in any way. The constructor only
serves to specify the name of the abstract syntax tree node that is to
be constructed if that production is applied. In this way, the cons
attribute ? of the production for integer literals, defines that an
Int
node should be produced for that production:
$ echo "1" | sglri -p Test.tbl
Int("1")
Note that this Int
constructor takes a single argument, a string,
which is the name of the variable. This argument of Int
is a string
because the production for IntConst
is defined in lexical syntax and
all derivations from lexical syntax productions are represented as
strings, i.e. without structure. As another example, the production for
addition has a Plus
constructor attribute ?. This production has three
symbols on the left-hand side, but the constructor takes only two
arguments, since literals are not included in the abstract syntax tree.
$ echo "1+2" | sglri -p Test.tbl
Plus(Int("1"),Int("2"))
However, there are also productions that have no cons
attribute, i.e.
? and ?. The production ? from Id
to Var
is called an injection,
since it does not involve any additional syntax. Injections don't need
to have a constructor attribute. If it is left out, then the application
of the production will not produce a node in the abstract syntax tree.
Example:
$ echo "x" | sglri -p Test.tbl
Var("x")
Nevertheless, the production ? does involve additional syntax, but does
not have a constructor. In this case, the bracket
attribute should be
used to indicate that this is a symbol between brackets, which should be
literals. The bracket
attribute does not affect the syntax of the
language, similar to the constructor attribute. Hence, the parenthesis
in the following example do not introduce a node, and the Plus
is a
direct subterm of Times
.
$ echo "(1 + 2) * 3" | sglri -p Test.tbl
Times(Plus(Int("1"),Int("2")),Int("3"))
Conventions.
In Stratego/XT, constructors are by covention CamelCase. Constructors may be overloaded, i.e. the same name can be used for several productions, but be careful with this feature: it might be more difficult to distinguish the several cases for some tools. Usually, constructors are not overloaded for productions with same number of arguments (arity).
Exp "+" Exp -> Exp {cons("Plus")}
Exp "-" Exp -> Exp {cons("Minus")}
Exp "*" Exp -> Exp {cons("Mul")}
Exp "/" Exp -> Exp {cons("Div")}
Syntax definitions that use only a single non-terminal for expressions
are highly ambiguous. ? shows the basic arithmetic operators defined in
this way. For every combination of operators, there are now multiple
possible derivations. For example, the string a+b*c
has two possible
derivations, which we can actually see because of the use of a
generalized-LR parser:
$ echo "a + b * c" | sglri -p Test3.tbl | pp-aterm
amb(
[ Times(Plus(Var("a"), Var("b")), Var("c"))
, Plus(Var("a"), Times(Var("b"), Var("c")))
]
)
These ambiguities can be solved by using the associativities and
priorities of the various operators to disallow undesirable derivations.
For example, from the derivations of a + b * c
we usually want to disallow the second one, where the
multiplications binds weaker than the addition operator. In plain
context-free grammars the associativity and priority rules of a language
can be encoded in the syntax definition by introducing separate
non-terminals for all the priority levels and let every argument of
productions refer to such a specific priority level. ? shows how the
usual priorities and associativity of the operators of the example can
be encoded in this way. For example, this definition will never allow an
AddExp
as an argument of a MulExp
, which implies that *
binds
stronger than +
. Also, AddExp
can only occur at the left-hand side
of an AddExp
, which makes the operator left associative.
This way of dealing with associativity and priorities has several disadvantages. First, the disambiguation is not natural, since it is difficult to derive the more high-level rules of priorities and associativity from this definition. Second, it is difficult to define expressions in a modular way, since the levels need to be known and new operators might affect the carefully crafted productions for the existing ones. Third, due to all the priority levels and the productions that connect these levels, the parse trees are more complex and parsing is less efficient. For these reasons SDF features a more declarative way of defining associativity and priorities, which we discuss in the next section.
AddExp -> Exp
MulExp -> AddExp
AddExp "+" MulExp -> AddExp {cons("Plus")}
AddExp "-" MulExp -> AddExp {cons("Minus")}
PrimExp -> MulExp
MulExp "*" PrimExp -> MulExp {cons("Mul")}
MulExp "/" PrimExp -> MulExp {cons("Div")}
IntConst -> PrimExp {cons("Int")}
Id -> PrimExp {cons("Var")}
Warning
This chapter is work in progress. Not all parts have been finished yet. The latest revision of this manual may contain more material. Refer to the online version.
In order to support natural syntax definitions, SDF provides several
declarative disambiguation mechanisms. Associativity declarations
(left
, right
, non-assoc
), disambiguate combinations of a binary
operator with itself and with other operators. Thus, the left
associativity of +
entails that a+b+c
is parsed as (a+b)+c
.
Priority declarations (>
) declare the relative priority of
productions. A production with lower priority cannot be a direct subtree
of a production with higher priority. Thus a+b*c
is parsed as
a+(b*c)
since the other parse (a+b)*c
has a conflict between the *
and +
productions.
...
> Exp "&" Exp -> Exp
> Exp "^" Exp -> Exp
> Exp "|" Exp -> Exp
> Exp "&&" Exp -> Exp
> Exp "||" Exp -> Exp
> ...
Solution: introduce an auxiliary non-terminal
This is usually handled with introducing a new non-terminal.
context-free syntax
"new" ArrayBaseType DimExp+ Dim* -> ArrayCreationExp {cons("NewArray")}
Exp "[" Exp "]" -> Exp {cons("ArrayAccess")}
ArrayCreationExp "[" Exp "]" -> Exp {reject}
? illustrates the use of these operators in the extension of the
expression language with statements and function declarations. Lists are
used in numerous places, such as for the sequential composition of
statements (Seq
), the declarations in a let binding, and the formal
and actual arguments of a function (FunDec
and Call
). An example
function definition in this language is:
module Statements
imports Expressions
exports
sorts Dec FunDec
context-free syntax
Var ":=" Exp -> Exp {cons("Assign")}
"(" {Exp ";"}* ")" -> Exp {cons("Seq")}
"if" Exp "then" Exp "else" Exp -> Exp {cons("If")}
"while" Exp "do" Exp -> Exp {cons("While")}
"let" Dec* "in" {Exp ";"}* "end" -> Exp {cons("Let")}
"var" Id ":=" Exp -> Dec {cons("VarDec")}
FunDec+ -> Dec {cons("FunDecs")}
"function" Id "(" {Id ","}* ")"
"=" Exp -> FunDec {cons("FunDec")}
Var "(" {Exp ","}* ")" -> Exp {cons("Call")}
context-free priorities
{non-assoc:
Exp "=" Exp -> Exp
Exp ">" Exp -> Exp}
> Var ":=" Exp -> Exp
> {right:
"if" Exp "then" Exp "else" Exp -> Exp
"while" Exp "do" Exp -> Exp}
> {Exp ";"}+ ";" {Exp ";"}+ -> {Exp ";"}+
context-free start-symbols Exp
function fact(n, x) =
if n > 0 then fact(n - 1, n * x) else x
context-free restrictions
LAYOUT? -/- [\ \t\12\n\r]
Why follow restrictions on whitespace is necessary
context-free restrictions
LAYOUT? -/- [\/].[\*]
LAYOUT? -/- [\/].[\/]
Parse-unit
is a tool, part of Stratego/XT, for testing SDF syntax
definitions. The spirit of unit testing is implemented in parse-unit by
allowing you to check that small code fragments are parsed correctly
with your syntax definition.
In a parse testsuite you can define tests with an input and an expected
result. You can specify that a test should succeed (succeeds
, for lazy
people), fail (fails
) or that the abstract syntax tree should have a
specific format. The input can be an inline string or the contents of a
file for larger tests.
Assuming the following grammar for a simple arithmetic expressions:
module Exp
exports
context-free start-symbols Exp
sorts Id IntConst Exp
lexical syntax
[\ \t\n] -> LAYOUT
[a-zA-Z]+ -> Id
[0-9]+ -> IntConst
context-free syntax
Id -> Exp {cons("Var")}
IntConst -> Exp {cons("Int")}
Exp "*" Exp -> Exp {left, cons("Mul")}
Exp "/" Exp -> Exp {left, cons("Div")}
Exp "%" Exp -> Exp {left, cons("Mod")}
Exp "+" Exp -> Exp {left, cons("Plus")}
Exp "-" Exp -> Exp {left, cons("Minus")}
context-free priorities
{left:
Exp "*" Exp -> Exp
Exp "/" Exp -> Exp
Exp "%" Exp -> Exp
}
> {left:
Exp "+" Exp -> Exp
Exp "-" Exp -> Exp
}
You could define the following parse testsuite in a file
expression.testsuite
:
testsuite Expressions
topsort Exp
test simple int literal
"5" -> Int("5")
test simple addition
"2 + 3" -> Plus(Int("2"), Int("3"))
test addition is left associative
"1 + 2 + 3" -> Plus(Plus(Int("1"), Int("2")), Int("3"))
test
"1 + 2 + 3" succeeds
test multiplication has higher priority than addition
"1 + 2 * 3" -> Plus(Int("1"), Mul(Int("2"), Int("3")))
test
"x" -> Var("x")
test
"x1" -> Var("x1")
test
"x1" fails
test
"1 * 2 * 3" -> Mul(Int("1"), Mul(Int("2"), Int("3")))
Running this parse testsuite with:
$ parse-unit -i expression.testsuite -p Exp.tbl
will output:
-----------------------------------------------------------------------
executing testsuite Expressions with 9 tests
-----------------------------------------------------------------------
* OK : test 1 (simple int literal)
* OK : test 2 (simple addition)
* OK : test 3 (addition is left associative)
* OK : test 4 (1 + 2 + 3)
* OK : test 5 (multiplication has higher priority than addition)
* OK : test 6 (x)
sglr: error in g_0.tmp, line 1, col 2: character `1' (\x31) unexpected
* ERROR: test 7 (x1)
- parsing failed
- expected: Var("x1")
sglr: error in h_0.tmp, line 1, col 2: character `1' (\x31) unexpected
* OK : test 8 (x1)
* ERROR: test 9 (1 * 2 * 3)
- succeeded: Mul(Mul(Int("1"),Int("2")),Int("3"))
- expected: Mul(Int("1"),Mul(Int("2"),Int("3")))
-----------------------------------------------------------------------
results testsuite Expressions
successes : 7
failures : 2
-----------------------------------------------------------------------
You cannot escape special characters because there is no need to escape them. The idea of the testsuite syntax is that test input typically contains a lot of special characters, which therefore they should no be special and should not need escaping.
Anyhow, you still need some mechanism make it clear where the test input
stops. Therefore the testsuite syntax supports several quotation
symbols. Currently you can choose from: "
, ""
, """
, and [
, [[
,
[[[
. Typically, if you need a double quote in your test input, then
you use the [
.
Parse-unit has an option to parse a single test and write the result to
the output. In this mode ambiguities are accepted, which is useful for
debugging. The option for the 'single test mode' is --single
where nr
is the number in the testsuite (printed when the testsuite is executed).
The --asfix2
flag can be used to produce an asfix2 parse tree instead
of an abstract syntax tree.
The following make rule can be used to invoke parse-unit from your build system.
%.runtestsuite : %.testsuite
$(SDF_TOOLS)/bin/parse-unit -i $< -p $(PARSE_UNIT_PTABLE) --verbose 1 -o /dev/null
A typical Makefile.am
fo testing your syntax definitions looks like:
EXTRA_DIST = $(wildcard *.testsuite)
TESTSUITES = \
expressions.testsuite \
identifiers.testsuite
PARSE_UNIT_PTABLE = $(top_srcdir)/syn/Foo.tbl
installcheck-local: $(TESTSUITES:.testsuite=.runtestsuite)
Warning
This chapter is work in progress. Not all parts have been finished yet. The latest revision of this manual may contain more material. Refer to the online version.
Import module and rename symbols in the imported definition.
Not very common, but heavily used for combining syntax definitions of different language. See concrete object syntax.
Modules can have formal parameters.
Warning
This chapter is work in progress. Not all parts have been finished yet. The latest revision of this manual may contain more material. Refer to the online version.
Explains regular tree grammars and how to generate them. See thesis.
Typematch, API.
Explains rtg-script
Format-check
analyzes whether the input ATerm conforms to the format
that is specified in the RTG (Regular Tree Grammar).
Format-check
verifies that the input ATerm is part of the language
defined in the RTG. If this is not the case, then the ATerm contains
format errors. Format-check can operate in three modes: plain, visualize
and XHTML.
The plain mode is used if the other modes are not enabled. In the plain mode format errors are reported and no result is written the the output (stdout or a file). Hence, if format-check is included in a pipeline, then the following tool will probably fail. If the input term is correct, then it is written to the output.
The visualize mode is enabled with the --vis
option. In visualize mode
format errors are reported and in a pretty-printed ATerm will be written
to the output. All innermost parts of the ATerm that cause format errors
are printed in red, if your terminal supports control characters for
colors. If you want to browse through the ATerm with less, then you
should use the -r
flag.
The XHTML mode is enabled with the --xhtml
option. In XHTML mode
format errors are reported and an report in XHTML will be written to the
output. This report shows the parts of the ATerm that are not formatted
correctly. Also, moving with your mouse over the nodes of ATerm, will
show the non-terminals that have be inferred by format-check (do not use
IE6. Firefox or Mozilla is recommended).
Format-check reports all innermost format errors. That is, only the deepest format errors are reported. A format error is reported by showing the ATerm that is not in the correct format, and the inferred types of the children of the ATerm. In XHTML and visualize mode a format error of term in a list is presented by a red comma and term. This means that a type has been inferred for the term itself, but that it is not expected at this point in the list. If only the term is red, then no type could be inferred for the term itself.
In all modes format-check succeeds (exit code 0) if the ATerm contains no format errors. If the term does contain format errors, then format-check fails (exit code 1).
Consider the RTG generated in the example of sdf2rtg:
regular tree grammar
start Exp
productions
Exp -> Minus(Exp,Exp)
Exp -> Plus(Exp,Exp)
Exp -> Mod(Exp,Exp)
Exp -> Div(Exp,Exp)
Exp -> Mul(Exp,Exp)
Exp -> Int(IntConst)
Exp -> Var(Id)
IntConst -> <string>
Id -> <string>
The ATerm
Plus(Var("a"), Var("b"))
is part of the language defined by this RTG. Therefore, format-check succeeds:
$ format-check --rtg Exp.rtg -i exp1.trm
Plus(Var("a"),Var("b"))
$
Note that format-check outputs the input term in this case. In this way format-check can be used in a pipeline of tools. On the other hand, the ATerm
Plus(Var("a"), Var(1))
is not part of the language defined by this RTG. Therefore, the invocation of format-check fails. Format-check also reports which term caused the failure:
$ format-check --rtg Exp.rtg -i exp2.trm
error: cannot type Var(1)
inferred types of subterms:
typed 1 as <int>
$
In large ATerms it might be difficult to find the incorrect subterm. To
help with this, format-check supports the --vis
argument. If this
argument is used, then the result will pretty-printed (in the same way
as pp-aterm) and the incorrect parts will be printed in
red.
For example, consider this term:
Plus(Mul(Int(1), Var("a")), Minus(Var("b"), Div(1, Var("c"))))
format-check will visualize the errors in this ATerm:
The XHTML mode shows the errors in red as well. Moreover, if you move your moves over the terms, then you'll see the inferred types of the term.
$ format-check --rtg Exp.rtg -i exp3.trm --xhtml -o foo.html
You can now view the resulting file in your browser. You need a decent browser (Firefox or Mozilla. no IE).
The abstract syntax of a programming language or data format can be
described by means of an algebraic signature. A signature declares for
each constructor its arity m
, the sorts of its arguments S1 * ... * Sm
, and the sort of the resulting term S0
by means of a
constructor declaration c:S1 * ... * Sm -> S0
. A term can be validated against a signature by a
format checker.
Signatures can be derived automatically from syntax definitions. For
each production A1...An -> A0 {cons(c)}
in a syntax definition, the
corresponding constructor declaration is c:S1 * ... * Sm -> S0
, where
the Si
are the sorts corresponding to the symbols Aj
after leaving
out literals and layout sorts. ? shows the signatures of statement and
expression constructors for the example language from this chapter. The
modules have been derived automatically from the syntax definitions in ?
and ?.
module Statements
signature
constructors
FunDec : Id * List(Id) * Exp -> FunDec
FunDecs : List(FunDec) -> Dec
VarDec : Id * Exp -> Dec
Call : Var * List(Exp) -> Exp
Let : List(Dec) * List(Exp) -> Exp
While : Exp * Exp -> Exp
If : Exp * Exp * Exp -> Exp
Seq : List(Exp) -> Exp
Assign : Var * Exp -> Exp
Gt : Exp * Exp -> Exp
Eq : Exp * Exp -> Exp
Minus : Exp * Exp -> Exp
Plus : Exp * Exp -> Exp
Times : Exp * Exp -> Exp
Uminus : Exp -> Exp
Int : IntConst -> Exp
: Var -> Exp
Var : Id -> Var
: String -> IntConst
: String -> Id
signature
constructors
Some : a -> Option(a)
None : Option(a)
signature
constructors
Cons : a * List(a) -> List(a)
Nil : List(a)
Conc : List(a) * List(a) -> List(a)
sdf2rtg -i m.def -o m.rtg -m M
.
sdf2rtg
derives from an SDF syntax definition m.def
a regular tree
grammar for the module M
and all modules it imports.
rtg2sig
generates from a regular tree grammar a Stratego signature.
Warning
This chapter is work in progress. Not all parts have been finished yet. The latest revision of this manual may contain more material. Refer to the online version.
The GPP package is a tool suite for generic pretty-printing. GPP supports pretty-printing of parse-trees in the AsFix format with comment preservation and of abstract syntax trees. GPP supports the output formats plain text, LaTeX, and HTML. Formattings are defined in pretty print tables, which can be generated from SDF syntax definitions.
The Box language is used in the GPP framework as a language-independent intermediate representation. The input language dependent parts and the output format dependent parts of GPP are connected through this intermediate representation.
Box is a mark-up language to describe the intended layout of text and is
used in pretty print tables. A Box expression is
constructed by composing sub-boxes using Box operators. These operators
specify the relative ordering of boxes. Examples of Box operators are
the H
and V
operator which format boxes horizontally and vertically,
respectively.
The exact formatting of each Box operator can be customized using Box
options. For example, to control the horizontal layout between boxes the
H
operator supports the hs
space option.
For a detailed description of Box (including a description of all available Box operators) we refer to To Reuse Or To Be Reused (Chapter 4)
Pretty-print tables are used to specify how language constructs have to be pretty-printed. Pretty-print tables are used in combination with GPP front-ends, such ast2abox.
Pretty-print tables use Box as language to specify formatting of language constructs. A pretty-print table contains mappings from constructor names to Box expressions. For example, for the SDF production
Exp "+" Exp -> Exp {cons("Plus")}
A pretty-print entry looks like:
Plus -- H hs=1 [ _1 "+" _2]
Pretty-print tables are ordered such that pretty-print rules occuring first take preceedence over overlapping pretty-print rules defined later. The syntax of pretty-print tables is available in GPP.
Pretty-print tables can be generated from SDF syntax definitions using ppgen. Generated pretty-print rules can easiliy be customized by overruling them in additional pretty-print tables. The tool pptable-diff notifies inconsistensies in pretty-print tables after the syntax definition has changed and can be used to bring inconsistent table up-to-date.
To be able to specify formattings for all nested constructs that are allowed in SDF productions, so called selectors are used in pretty-print tables to refer to specific parts of an SDF production and to define a formatting for them. For example, the SDF prodcution
"return" Exp? ";" -> Stm {cons("Return")}
contains a nested symbol A?. To specify a formatting for this production, two pretty-print entries can be used:
Return -- H [ KW["return"] _1 ";" ]
Return.1:opt -- H [ _1 ]
A selector consists of a constructor name followed by a list of number+type tuples. A number selects a particular subtree of a constructor application, the type denotes the type of the selected construct (sequence, optional, separated list etc.). This rather verbose selector mechanism allows unambiguous selection of subtrees. Its verbosity (by specifying both the number of a subtree and its type), makes pretty-print tables easier to understand and, more importantly, it enables pretty-printing of AST's, because with type information, a concrete term can be correctly recontructed from an AST. Pretty-print tables can thus be used for both formatting of parse-trees and AST's.
Below we summarize which selector types are available:
opt
: For SDF optionals S?
iter
: For non-empty SDF lists S+
iter-star
: For possible empty SDF lists S*
iter-sep
: For SDF separator lists {S S}+
. Observe that the symbol S
and the separator S
are ordinary subtrees of the iter-sep construct which can be refered
to as first and second subtree, respectively.
iter-star-sep
: For SDF separator lists {S S}*
. Its symbol S
and separator S
can be refered to
as first and second sub tree.
alt
: For SDF alternatives S | S | S
. According to the SDF syntax, alternatives are binary
operators. The pretty-printer flattens all subsequent alternatives.
Pretty-print rules can be specified for each alternative
individually by specifying the number of each alternative. To be
able to format literals in alternative, a special formatting rule
can be defined for the construct (See the examples below).
seq
: For SDF alternatives (S S S)
.
Below we list a simple SDF module with productions containing all above rule selectors.
module Symbols
exports
context-free syntax
A? -> S {cons("ex1")}
A+ -> S {cons("ex2")}
A* -> S {cons("ex3")}
{S1 S2}+ -> S {cons("ex4")}
{S1 S2}* -> S {cons("ex5")}
"one" | "two" | S? -> S {cons("ex6")}
("one" "two" S? ) -> S {cons("ex7")}
The following pretty-print table shows which pretty-print rules can be defined for this syntax:
[
ex1 -- _1,
ex1.1:opt -- _1,
ex2 -- _1,
ex2.1:iter -- _1,
ex3 -- _1,
ex3.1:iter-star -- _1,
ex4 -- _1,
ex4.1:iter-sep -- _1 _2,
ex5 -- _1,
ex5.1:iter-star-sep -- _1 _2,
ex6 -- _1,
ex6.1:alt -- KW["one"] KW["two"] _1,
ex6.1:alt.1:opt -- _1,
ex7 -- _1,
ex7.1:seq -- KW["one"] KW["two"] _1,
ex7.1:seq.1:opt -- _1
]
The pretty-print rule ex6.1:alt
is a special case. It contains three
Box expressions, one for each alternative. It is used to specify a
formatting for the non-nested literals "one"
and "two"
. During
pretty-printing one of the three Box expressions is selected, depending
on alternative contained the term to format.
In this section, we explain how you can generate a tool that restores all parentheses at the places where necessary according to the priorities and associativities of a language.
In this section, we explain how you can use Box in Stratego to implement a pretty-printer by hand.
After transformation, an abstract syntax tree should be turned into text again to be useful as a program. Mapping a tree into text can be seen as the inverse as parsing, and is thus called unparsing. When an unparser makes an attempt at producing human readable, instead of just compiler parsable, program text, an unparser is called a pretty-printer. We use the pretty-printing model as provided by the Generic Pretty-Printing package GPP. In this model a tree is unparsed to a Box expression, which contains text with markup for pretty-printing. A Box expression can be interpreted by different back-ends to produce text for different displaying devices, such as plain ASCII text, HTML, and LaTeX.
Unparsing is the inverse of parsing composed with abstract syntax tree composition. That is, an unparser turns an abstract syntax tree into a string, such that if the resulting string is parsed again, it produces the same abstract syntax tree.
An unparser can be organized in two phases. In the first phase, each
node in an abstract syntax tree is replaced with the concrete syntax
tree of the corresponding grammar production. In the second phase, the
strings at the leaves of the tree are concatenated into a string. ?
illustrates this process. The replacement of abstract syntax tree nodes
by concrete syntax patterns should be done according to the productions
of the syntax definition. An unparsing table is an abstraction of a
syntax definition definining the inverse mapping from constructors to
concrete syntax patterns. An entry c -- s1 ... sn
defines a mapping for constructor c
to the sequence
s1 ... sn
, where each s_i
is either a literal string or a parameter
_i
referring to the i
th argument of the constructor. ? shows an
unparsing table for some expression and statement constructors. Applying
an unparsing mapping to an abstract syntax tree results in a tree
structure with strings at the leafs, as illustrated in ?.
f(a + 10) - 3
[
Var -- _1,
Int -- _1,
Plus -- _1 "+" _2,
Minus -- _1 "-" _2,
Assign -- _1 ":=" _2,
Seq -- "(" _1 ")",
Seq.1:iter-star-sep -- _1 ";",
If -- "if" _1 "then" _2 "else" _3,
Call -- _1 "(" _2 ")",
Call.2:iter-star-sep -- _1 ","
]
Although the unparse of an abstract syntax tree is a text that can be parsed by a compiler, it is not necessarily a readable text. A pretty-printer is an unparser attempting to produce readable program text. A pretty-printer can be obtained by annotating the entries in an unparsing table with markup instructing a typesetting process. ? illustrates this process.
Box is a target independent formatting language, providing combinators
for declaring the two-dimensional positioning of boxes of text. Typical
combinators are H[b_1...b_n]
, which combines the b_i
boxes
horizontally, and V[b_1...b_n]
, which combines the b_i
boxes
vertically. ? shows a pretty-print table with Box markup. A more
complete overview of the Box language and the GPP tools can be found in
?.
f(a + 10) - 3
[
Var -- _1,
Int -- _1,
Plus -- H[_1 "+" _2],
Minus -- H[_1 "-" _2],
Assign -- H[_1 ":=" _2],
Seq -- H hs=0["(" V[_1] ")"],
Seq.1:iter-star-sep -- H hs=0[_1 ";"],
If -- V[V is=2[H["if" _1 "then"] _2]
V is=2["else" _3]],
Call -- H hs=0[_1 "(" H[_2] ")"],
Call.2:iter-star-sep -- H hs=0[_1 ","]
]
If(Eq(Var("n"),Int("1")),
Int("0"),
Times(Var("n"),
Call(Var("fac"),[Minus(Var("n"),Int("1"))])))
if n = 1 then
0
else
n * fac(n - 1)
Note: correct pretty-printing of an abstract syntax tree requires that
it contains nodes representing parentheses in the right places.
Otherwise, reparsing a pretty-printed string might get a different
interpretation. The sdf2parenthesize
tool
generates from an SDF definition a Stratego program that places
parentheses at the necessary places in the tree.
Ppgen
generates from an SDF syntax definition a pretty-print table
with an entry for each context-free syntax production with a constructor
annotation. Typically it is necessary to edit the pretty-print table to
add appropriate Box markup to the entries. The result should be saved
under a different name to avoid overwriting it.
ast2abox -p m.pp -i file.ast -o file.abox
.
ast2abox
maps an abstract syntax tree file.ast
to an abstract syntax
representation file.abox
of a Box term based on a pretty-print table
m.pp
.
abox2text -i file.abox -o file.txt
.
abox2text
formats a Box term as ASCII text.
pp-aterm -i file1.trm -o file2.trm
.
pp-aterm
formats an ATerm as an ATerm in text format, adding newlines
and indentation to make the structure of the term understandable. This
is a useful tool to inspect terms while debugging transformations.
term-to-dot -i file.trm -o file.dot (--tree | --graph)
.
Term-to-dot
is another visualization tool for terms that creates a
dot
graph representation, which can be visualized using the dot
tool
from the graphviz graph layout package. Term-to-dot
can produce an
expanded tree view (--tree
), or a directed acyclic graph view
(--graph
) preserving the maximal sharing in the term. This tool was
used to produce the tree visualizations in this chapter. This tool is
not part of the Stratego/XT distribution, but included in the
Stratego/XT Utilities package.
Note
This manual is being written for and with Stratego 0.16; You are advised to install the latest milestone for this release. See the download page at stratego-language.org
Program transformation is the mechanical manipulation of a program in order to improve it relative to some cost function C such that C(P) > C(tr(P)), i.e. the cost decreases as a result of applying the transformation. The cost of a program can be measured in different dimensions such as performance, memory usage, understandability, flexibility, maintainability, portability, correctness, or satisfaction of requirements. Related to these goals, program transformations are applied in different settings; e.g. compiler optimizations improve performance and refactoring tools aim at improving understandability. While transformations can be achieved by manual manipulation of programs, in general, the aim of program transformation is to increase programmer productivity by automating programming tasks, thus enabling programming at a higher-level of abstraction, and increasing maintainability and re-usability of programs. Automatic application of program transformations requires their implementation in a programming language. In order to make the implementation of transformations productive such a programming language should support abstractions for the domain of program transformation.
Stratego is a language designed for this purpose. It is a language based on the paradigm of rewriting with programmable rewriting strategies and dynamic rules.
Transformation by Rewriting.
Term rewriting is an attractive formalism for expressing basic program
transformations. A rewrite rule p1 -> p2
expresses that a program
fragment matching the left-hand side pattern p1 can be replaced by the
instantiation of the right-hand side pattern p2. For instance, the
rewrite rule
|[ i + j ]| -> |[ k ]| where <add>(i, j) => k
expresses constant folding for addition, i.e. replacing an addition of two constants by their sum. Similarly, the rule
|[ if 0 then e1 else e2 ]| -> |[ e2 ]|
defines unreachable code elimination by reducing a conditional statement to its right branch since the left branch can never be executed. Thus, rewrite rules can directly express laws derived from the semantics of the programming language, making the verification of their correctness straightforward. A correct rule can be safely applied anywhere in a program. A set of rewrite rules can be directly operationalized by rewriting to normal form, i.e. exhaustive application of the rules to a term representing a program. If the rules are confluent and terminating, the order in which they are applied is irrelevant.
Limitations of Pure Rewriting.
However, there are two limitations to the application of standard term rewriting techniques to program transformation: the need to intertwine rules and strategies in order to control the application of rewrite rules and the context-free nature of rewrite rules.
Transformation Strategies.
Exhaustive application of all rules to the entire abstract syntax tree of a program is not adequate for most transformation problems. The system of rewrite rules expressing basic transformations is often non-confluent and/or non-terminating. An ad hoc solution that is often used is to encode control over the application of rules into the rules themselves by introducing additional function symbols. This intertwining of rules and strategies obscures the underlying program equalities, incurs a programming penalty in the form of rules that define a traversal through the abstract syntax tree, and disables the reuse of rules in different transformations.
Stratego solves the problem of control over the application of rules while maintaining the separation of rules and strategies. A strategy is a little program that makes a selection from the available rules and defines the order and position in the tree for applying the rules. Thus rules remain pure, are not intertwined with the strategy, and can be reused in multiple transformations. schemas.
Context-Senstive Transformation.
The second problem of rewriting is the context-free nature of rewrite rules. A rule has access only to the term it is transforming. However, transformation problems are often context-sensitive. For example, when inlining a function at a call site, the call is replaced by the body of the function in which the actual parameters have been substituted for the formal parameters. This requires that the formal parameters and the body of the function are known at the call site, but these are only available higher-up in the syntax tree. There are many similar problems in program transformation, including bound variable renaming, typechecking, data flow transformations such as constant propagation, common-subexpression elimination, and dead code elimination. Although the basic transformations in all these applications can be expressed by means of rewrite rules, these require contextual information.
Outline.
The following chapters give a tutorial for the Stratego language in which these ideas are explained and illustrated. The first three chapters outline the basic ideas of Stratego programming in bold strokes. ? introduces the term notation used to represent source programs in Stratego. ? shows how to set up, compile, and use a Stratego program. ? introduces term rewrite rules and term rewriting. ? argues the need for control over the application of rewrite rules and introduces strategies for achieving this.
The rest of the chapters in this tutorial explain the language in more detail. ? examines the named rewrite rules, defines the notion of a rewriting strategy, and explains how to create reusable strategy expressions using definitions. ? introduces combinators for the combination of simple transformations into more complex transformations. ? re-examines the notion of a rule, and introduces the notions of building and matching terms, which provide the core to all manipulations of terms. It then goes on to show how these actions can be used to define higher-level constructs for expressing transformations, such as rewrite rules. ? introduces the notion of data-type specific and generic traversal combinators. ? shows how to generically define program analyses using type-unifying strategies. ? shows ho to use the syntax of the source language in the patterns of transformation rules. Finally, ? introduces the notion of dynamic rules for expressing context-sensitive transformations.
Stratego programs transform terms. When using Stratego for program transformation terms typically represent the abstract syntax tree of a program. But Stratego does not much care what a term represents. Terms can just as well represent structured documents, software models, or anything else that can be rendered in a structured format. The chapters in ? show how to transform a program text into a term by means of parsing, and to turn a term into program text again by means of pretty-printing. From now on we will just assume that we have terms that should be transformed and ignore parsing and pretty-printing.
Terms in Stratego are terms in the Annotated Term Format, or ATerms for
short. The ATerm format provides a set of constructs for representing
trees, comparable to XML or abstract data types in functional
programming languages. For example, the code 4 + f(5 * x)
might be
represented in a term as:
Plus(Int("4"), Call("f", [Mul(Int("5"), Var("x"))]))
ATerms are constructed from the following elements:
Integer
: An integer constant, that is a list of decimal digits, is an ATerm.
Examples: 1
, 12343
.
String
: A string constant, that is a list of characters between double
quotes is an ATerm. Special characters such as double quotes and
newlines should be escaped using a backslash. The backslash
character itself should be escaped as well. Examples: "foobar"
,
"string with quotes\""
, "escaped escape character\\ and a newline\n".
Constructor application
: A constructor is an identifier, that is an alphanumeric string starting with a letter, or a double quoted string.
A constructor application `c(t1,...,tn)` creates a term by applying
a constructor to a list of zero or more terms. For example, the term
`Plus(Int("4"),Var("x"))` uses the constructors `Plus`, `Int`, and
`Var` to create a nested term from the strings `"4"` and `"x"`.
When a constructor application has no subterms the parentheses may
be omitted. Thus, the term `Zero` is equivalent to `Zero()`. Some
people consider it good style to explicitly write the parentheses
for nullary terms in Stratego programs. Through this rule, it is
clear that a string is really a special case of a constructor
application.
List
: A list is a term of the form [t1,...,tn]
, that is a list of zero
or more terms between square brackets. While all applications of a
specific constructor typically have the same number of subterms,
lists can have a variable number of subterms. The elements of a list
are typically of the same type, while the subterms of a constructor
application can vary in type. Example: The second argument of the
call to "f"
in the term Call("f",[Int("5"),Var("x")])
is a list
of expressions.
Tuple
: A tuple (t1,...,tn)
is a constructor application without
constructor. Example: (Var("x"), Type("int"))
Annotation
: The elements defined above are used to create the structural part of
terms. Optionally, a term can be annotated with a list terms. These
annotations typically carry additional semantic information about
the term. An annotated term has the form t{t1,...,tn}
. Example:
Lt(Var("n"),Int("1")){Type("bool")}
. The contents of annotations
is up to the application.
The term format described above is used in Stratego programs to denote terms, but is also used to exchange terms between programs. Thus, the internal format and the external format exactly coincide. Of course, internally a Stratego program uses a data-structure in memory with pointers rather than manipulating a textual representation of terms. But this is completely hidden from the Stratego programmer. There are a few facts that are useful to be aware of, though.
The internal representation used in Stratego programs maintains maximal sharing of subterms. This means that all occurrences of a subterm are represented by a pointer to the same node in memory. This makes comparing terms in Stratego for syntactic equality a very cheap operation, i.e., just a pointer comparison.
TODO: picture of tree vs dag representation
When writing a term to a file in order to exchange it with another tool there are several representations to choose from. The textual format described above is the canonical `meaning' of terms, but does not preserve maximal sharing. Therefore, there is also a Binary ATerm Format (BAF) that preserves sharing in terms. The program ? can be used to convert between the textual and binary representations.
As a Stratego programmer you will be looking a lot at raw ATerms. Stratego pioneers would do this by opening an ATerm file in emacs and trying to get a sense of the structure by parenthesis highlighting and inserting newlines here and there. These days your life is much more pleasant through the tool ?, which adds layout to a term to make it readable. For example, parsing the following program
let function fact(n : int) : int =
if n < 1 then 1 else (n * fact(n - 1))
in printint(fact(10))
end
produces the following ATerm (say in file fac.trm):
Let([FunDecs([FunDec("fact",[FArg("n",Tp(Tid("int")))],Tp(Tid("int")),
If(Lt(Var("n"),Int("1")),Int("1"),Seq([Times(Var("n"),Call(Var("fact"),
[Minus(Var("n"),Int("1"))]))])))])],[Call(Var("printint"),[Call(Var(
"fact"),[Int("10")])])])
By pretty-printing the term using pp-aterm
as
$ pp-aterm -i fac.trm -o fac-pp.trm --max-term-size 20
we get a much more readable term:
Let(
[ FunDecs(
[ FunDec(
"fact"
, [FArg("n", Tp(Tid("int")))]
, Tp(Tid("int"))
, If(
Lt(Var("n"), Int("1"))
, Int("1")
, Seq([ Times(Var("n"), Call(Var("fact"), [Minus(Var("n"), Int("1"))]))
])
)
)
]
)
]
, [ Call(Var("printint"), [Call(Var("fact"), [Int("10")])])
]
)
To use terms in Stratego programs, their constructors should be declared
in a signature. A signature declares a number of sorts
and a number of
constructors
for these sorts. For each constructor, a signature
declares the number and types of its arguments. For example, the
following signature declares some typical constructors for constructing
abstract syntax trees of expressions in a programming language:
signature
sorts Id Exp
constructors
: String -> Id
Var : Id -> Exp
Int : Int -> Exp
Plus : Exp * Exp -> Exp
Mul : Exp * Exp -> Exp
Call : Id * List(Exp) -> Exp
Currently, the Stratego compiler only checks the arity of constructor applications against the signature. Still, it is considered good style to also declare the types of constructors in a sensible manner for the purpose of documentation. Also, a later version of the language may introduce typechecking.
Now let's see how we can actually transform terms using Stratego programs. In the rest of this chapter we will first look at the structure of Stratego programs, and how to compile and run them. In the next chapters we will then see how to define transformations.
The simplest program you can write in Stratego is the following
identity.str
program:
module identity
imports list-cons
strategies
main = id
It features the following elements: Each Stratego file is a module,
which has the same name as the file it is stored in without the .str
extension. A module may import other modules in order to use the
definitions in those modules. A module may contain one or more
strategies
sections that introduce new strategy definitions. It will
become clear later what strategies and strategy definitions are. Each
Stratego program has one main definition, which indicates the strategy
to be executed on invocation of the program. In the example, the body of
this program's main definition is the identity strategy id
.
Now let's see what this program means. To find that out, we first need
to compile it, which we do using the Stratego compiler strc
as
follows:
$ strc -i identity.str
[ strc | info ] Compiling 'identity.str'
[ strc | info ] Front-end succeeded : [user/system] = [0.59s/0.56s]
[ strc | info ] Back-end succeeded : [user/system] = [0.46s/0.16s]
[ strc | info ] C compilation succeeded : [user/system] = [0.28s/0.23s]
[ strc | info ] Compilation succeeded : [user/system] = [1.35s/0.95s]
The -i
option of strc
indicates the module to compile. The compiler
also reads all imported modules, in this case the list-cons.str
module
that is part of the Stratego library and that strc
magically knows how
to find. The compiler prints some information about what it is doing,
i.e., the stages of compilation that it goes through and the times for
those stages. You can turn this off using the argument --verbose 0
.
However, since the compiler is not very fast, it may be satisfying to
see something going on.
The result of compilation is an executable named identity
after the
name of the main module of the program. Just to satisfy our curiosity we
inspect the file system to see what the compiler has done:
$ ls -l identity*
-rwxrwxr-x 1 7182 Sep 7 14:54 identity*
-rw------- 1 1362 Sep 7 14:54 identity.c
-rw-rw-r-- 1 200 Sep 7 14:54 identity.dep
-rw-rw-r-- 1 2472 Sep 7 14:54 identity.o
-rw-rw-r-- 1 57 Sep 7 13:03 identity.str
Here we see that in addition to the executable the compiler has produced
a couple of other files. First of all the identity.c
file gives away
the fact that the compiler first translates a Stratego program to C and
then uses the C compiler to compile to machine code. The identity.o
file is the result of compiling the generated C program. Finally, the
contents of the identity.dep
file will look somewhat like this:
identity: \
/usr/local/share/stratego-lib/collection/list/cons.rtree \
/usr/local/share/stratego-lib/list-cons.rtree \
./identity.str
It is a rule in the Make language that declares the dependencies of the
identity
program. You can include this file in a Makefile
to
automate its compilation. For example, the following Makefile
automates the compilation of the identity
program:
include identity.dep
identity : identity.str
strc -i identity.str
Just invoke make
on the command-line whenever you change something in
the program.
Ok, we were digressing a bit. Let's turn back to finding out what the
identity
program does. When we execute the program with some arbitrary
arguments on the command-line, this is what happens:
$ ./identity foo bar
["./identity","foo","bar"]
The program writes to stdout
the list of command-line arguments as a
list of strings in the ATerm format. So what we have learned is that a
Stratego program applies its main strategy to the list of command-line
arguments, and writes the resulting term to stdout
. Since the strategy
in the identity
program is the identity transformation it just writes
the original command-line arguments (as a term).
That was instructive, but not very useful. We are not interested in
transforming lists of strings, but rather programs represented as terms.
So we want to read a term from a file, transform it, and write it to
another file. Let's open the bag of tricks. The identity-io
program
improves the previous program:
module identity-io
imports libstrategolib
strategies
main = io-wrap(id)
The program looks similar to the previous one, but there are a couple of
differences. First, instead of importing module list-cons
, this module
imports libstrategolib
, which is the interface to the separately
compiled Stratego library. This library provides a host of useful
strategies that are needed in implementing program transformations. ?
gives an overview of the Stratego library, and we will every now and
then use some useful strategies from the library before we get there.
Right now we are interested in the io-wrap
strategy used above. It
implements a wrapper strategy that takes care of input and output for
our program. To compile the program we need to link it with the
stratego-lib
library using the -la
option:
$ strc -i identity-io.str -la stratego-lib
What the relation is between libstrategolib
and stratego-lib
will
become clear later; knowing that it is needed to compile programs using
libstrategolib
suffices for now.
If we run the compiled identity-io
program with its --help
option we
see the standard interface supported by the io-wrap
strategy:
$ ./identity-io --help
Options:
-i f|--input f Read input from f
-o f|--output f Write output to f
-b Write binary output
-S|--silent Silent execution (same as --verbose 0)
--verbose i Verbosity level i (default 1)
( i as a number or as a verbosity descriptor:
emergency, alert, critical, error,
warning, notice, info, debug, vomit )
-k i | --keep i Keep intermediates (default 0)
--statistics i Print statistics (default 0 = none)
-h|-?|--help Display usage information
--about Display information about this program
--version Same as --about
The most relevant options are the -i
option for the input file and the
-o
option for the output file. For instance, if we have some file
foo-bar.trm
containing an ATerm we can apply the program to it:
$ echo "Foo(Bar())" > foo-bar.trm
$ ./identity-io -i foo-bar.trm -o foo-bar2.trm
$ cat foo-bar2.trm
Foo(Bar)
If we leave out the -i
and/or -o
options, input is read from stdin
and output is written to stdout
. Thus, we can also invoke the program
in a pipe:
$ echo "Foo(Bar())" | ./identity-io
Foo(Bar)
Now it might seem that the identity-io
program just copies its input
file to the output file. In fact, the identity-io
does not just accept
any input. If we try to apply the program to a text file that is not an
ATerm, it protests and fails:
$ echo "+ foo bar" | ./identity-io
readFromTextFile: parse error at line 0, col 0
not a valid term
./identity: rewriting failed
So we have written a program to check if a file represents an ATerm.
A Stratego program based on io-wrap
defines a transformation from
terms to terms. Such transformations can be combined into more complex
transformations, by creating a chain of tool invocations. For example,
if we have a Stratego program trafo-a
applying some undefined
transformation-a
to the input term of the program
module trafo-a
imports libstrategolib
strategies
main = io-wrap(transformation-a)
transformation-a = ...
and we have another similar program trafo-b
applying a
transformation-b
module tool-b
imports libstrategolib
strategies
main = io-wrap(transformation-b)
transformation-b = ...
then we can combine the transformations to transform an input
file to
an output
file using a Unix pipe, as in
$ tool-a -i input | tool-b -o output
or using an intermediate
file:
$ tool-a -i input -o intermediate
$ tool-b -i intermediate -o output
We have just learned how to write, compile, and execute Stratego programs. This is the normal mode for development of transformation systems with Stratego. Indeed, we usually do not invoke the compiler from the command-line `by hand', but have an automated build system based on (auto)make to build all programs in a project at once. For learning to use the language this can be rather laborious, however. Therefore, we have also developed the Stratego Shell, an interactive interpreter for the Stratego language. The shell allows you to type in transformation strategies on the command-line and directly seeing their effect on the current term. While this does not scale to developing large programs, it can be instructive to experiment while learning the language. In the following chapters we will use the stratego-shell to illustrate various features of the language.
Here is a short session with the Stratego Shell that shows the basics of using it:
$ stratego-shell
stratego> :show
()
stratego> !Foo(Bar())
Foo(Bar)
stratego> id
Foo(Bar)
stratego> fail
command failed
stratego> :show
Foo(Bar)
stratego> :quit
Foo(Bar)
$
The shell is invoked by calling the command stratego-shell
on the
regular command-line. The stratego>
prompt then indicates that you
have entered the Stratego Shell. After the prompt you can enter
strategies or special shell commands.
Strategies are the statements and functions of the Stratego language. A
strategy transforms a term into a new term, or fails. The term to which
a strategy is applied, is called the current term. In the Stratego
Shell you can see the current term with :show
. In the session above we
see that the current term is the empty tuple if you have just started
the Stratego Shell. At the prompt of the shell you can enter strategies.
If the strategy succeeds, then the shell will show the transformed term,
which is now the new current term. For example, in the session above the
strategy !Foo(Bar())
replaces the current term with the term
Foo(Bar())
, which is echoed directly after applying the strategy. The
next strategy that is applied is the identity strategy id
that we saw
before. Here it becomes clear that it just returns the term to which it
is applied. Thus, we have the following general scheme of applying a
strategy to the current term:
current term
stratego> strategy expression
transformed current
stratego>
Strategies can also fail. For example, the application of the fail
strategy always fails. In the case of failure, the shell will print a
message and leave the current term untouched:
current term
stratego> strategy expression
command failed
stratego> :show
current term
Finally, you can leave the shell using the :quit
command.
The Stratego Shell has a number of non-strategy commands to operate the
shell configuration. Theses commands are recognizable by the :
prefix.
The :help
command tells you what commands are available in the shell:
$ stratego-shell
stratego> :help
Rewriting
strategy rewrite the current subject term with strategy
Defining Strategies
id = strategy define a strategy (doesn't change the current subject term)
id : rule define a rule (doesn't change the current subject term)
import modname import strategy definitions from 'modname' (file system or xtc)
:undef id delete defintions of all strategies 'id'/(s,t)
:undef id(s,t) delete defintion of strategy 'id'/(s,t)
:reset delete all term bindings, all strategies, reset syntax.
Debugging
:show show the current subject term
:autoshow on|off show the current subject term after each rewrite
:binding id show term binding of id
:bindings show all term bindings
:showdef id show defintions of all strategies 'id'/(s,t)
:showdef id(s,t) show defintion of strategy 'id'/(s,t)
:showast id(s,t) show ast of defintion of strategy 'id'/(s,t)
Concrete Syntax
:syntax defname set the syntax to the sdf definition in 'defname'.
XTC
:xtc import pathname
Misc
:include file execute command in the script of `file`
:verbose int set the verbosity level (0-9)
:clear clear the screen
:exit exit the Stratego Shell
:quit same as :exit
:q same as :exit
:about information about the Stratego Shell
:help show this help information
stratego>
Let's summarize what we have learned so far about Stratego programming.
First, a Stratego program is divided into modules, which reside in
files with extension .str
and have the following general form:
module mod0
imports libstrategolib mod1 mod2
signature
sorts A B C
constructors
Foo : A -> B
Bar : A
strategies
main = io-wrap(foo)
foo = id
Modules can import other modules and can define signatures for declaring
the structure of terms and can define strategies, which we not really
know much about yet. However, the io-wrap
strategy can be used to
handle the input and output of a Stratego program. This strategy is
defined in the libstrategolib
module, which provides an interface to
the Stratego Library. The main module of a Stratego program should have
a main
strategy that defines the entry point of the program.
Next, a Stratego program is compiled to an executable program using the
Stratego Compiler strc
.
$ strc -i mod0 -la stratego-lib
The resulting executable applies the main
strategy to command-line
arguments turned into a list-of-strings term. The io-wrap
strategy
interprets these command-line arguments to handle input and output using
standard command-line options.
Finally, the Stratego Shell can be used to invoke strategies interactively.
$ stratego-shell
stratego> id
()
stratego>
Next up: transforming terms with rewrite rules.
In ? we saw how terms provide a structured representation for programs derived from a formal definition of the syntax of a programming language. Transforming programs then requires tranformation of terms. In this chapter we show how to implement term transformations using term rewriting in Stratego. In term rewriting a term is transformed by repeated application of rewrite rules.
To see how this works we take as example the language of propositional formulae, also known as Boolean expressions:
module prop
signature
sorts Prop
constructors
False : Prop
True : Prop
Atom : String -> Prop
Not : Prop -> Prop
And : Prop * Prop -> Prop
Or : Prop * Prop -> Prop
Impl : Prop * Prop -> Prop
Eq : Prop * Prop -> Prop
Given this signature we can write terms such as
And(Impl(True,False),False)
, and And(Atom("p"),False))
. Atoms are
also known as proposition letters; they are the variables in
propositional formulae. That is, the truth value of an atom should be
provided in order to fully evaluate an expression. Here we will evaluate
expressions as far as possible, a transformation also known as constant
folding. We will do this using rewrite rules that define how to
simplify a single operator application.
A term pattern is a term with meta variables, which are identifiers
that are not declared as (nullary) constructors. For example,
And(x, True)
is a term pattern with variable x
. Variables in term
patterns are sometimes called meta variables, to distinguish them from
variables in the source language being processed. For example, while
atoms in the proposition expressions are variables from the point of
view of the language, they are not variables from the perspective of a
Stratego program.
A term pattern p
matches with a term t
, if there is a
substitution that replaces the variables in p
such that it becomes
equal to t
. For example, the pattern And(x, True)
matches the term
And(Impl(True,Atom("p")),True)
because replacing the variable x
in
the pattern by Impl(True,Atom("p"))
makes the pattern equal to the
term. Note that And(Atom("x"),True)
does not match the term
And(Impl(True,Atom("p")),True)
, since the subterms Atom("x")
and
Impl(True,Atom("p"))
do not match.
An unconditional rewrite rule has the form L : p1 -> p2
, where L
is the name of the rule, p1
is the left-hand side and p2
the
right-hand side term pattern. A rewrite rule L : p1 -> p2
applies to a
term t
when the pattern p1
matches t
. The result is the
instantiation of p2
with the variable bindings found during matching.
For example, the rewrite rule
E : Eq(x, False) -> Not(x)
rewrites the term Eq(Atom("q"),False)
to Not(Atom("q"))
, since the
variable x
is bound to the subterm Atom("q")
.
Now we can create similar evaluation rules for all constructors of sort
Prop
:
module prop-eval-rules
imports prop
rules
E : Not(True) -> False
E : Not(False) -> True
E : And(True, x) -> x
E : And(x, True) -> x
E : And(False, x) -> False
E : And(x, False) -> False
E : Or(True, x) -> True
E : Or(x, True) -> True
E : Or(False, x) -> x
E : Or(x, False) -> x
E : Impl(True, x) -> x
E : Impl(x, True) -> True
E : Impl(False, x) -> True
E : Eq(False, x) -> Not(x)
E : Eq(x, False) -> Not(x)
E : Eq(True, x) -> x
E : Eq(x, True) -> x
Note that all rules have the same name, which is allowed in Stratego.
Next we want to normalize terms with respect to a collection of rewrite rules. This entails applying all rules to all subterms until no more rules can be applied. The following module defines a rewrite system based on the rules for propositions above:
module prop-eval
imports libstrategolib prop-eval-rules
strategies
main = io-wrap(eval)
eval = innermost(E)
The module imports the Stratego Library (libstrategolib
) and the
module with the evaluation rules, and then defines the main
strategy
to apply innermost(E)
to the input term. (See the discussion of
io-wrap
in ?.) The innermost
strategy from the library exhaustively
applies its argument transformation to the term it is applied to,
starting with `inner' subterms.
We can now compile the program as discussed in ?:
$ strc -i prop-eval.str -la stratego-lib
This results in an executable prop-eval
that can be used to evaluate
Boolean expressions. For example, here are some applications of the
program:
$ cat test1.prop
And(Impl(True,And(False,True)),True)
$ ./prop-eval -i test1.prop
False
$ cat test2.prop
And(Impl(True,And(Atom("p"),Atom("q"))),ATom("p"))
$ ./prop-eval -i test2.prop
And(And(Atom("p"),Atom("q")),ATom("p"))
We can also import these definitions in the Stratego Shell, as illustrated by the following session:
$ stratego-shell
stratego> import prop-eval
stratego> !And(Impl(True(),And(False(),True())),True())
And(Impl(True,And(False,True)),True)
stratego> eval
False
stratego> !And(Impl(True(),And(Atom("p"),Atom("q"))),ATom("p"))
And(Impl(True,And(Atom("p"),Atom("q"))),ATom("p"))
stratego> eval
And(And(Atom("p"),Atom("q")),ATom("p"))
stratego> :quit
And(And(Atom("p"),Atom("q")),ATom("p"))
$
The first command imports the prop-eval
module, which recursively
loads the evaluation rules and the library, thus making its definitions
available in the shell. The !
commands replace the current term with a
new term. (This build strategy will be properly introduced in ?.)
The next commands apply the eval
strategy to various terms.
Next we extend the rewrite rules above to rewrite a Boolean expression to disjunctive normal form. A Boolean expression is in disjunctive normal form if it conforms to the following signature:
signature
sorts Or And NAtom Atom
constructors
Or : Or * Or -> Or
: And -> Or
And : And * And -> And
: NAtom -> And
Not : Atom -> NAtom
: Atom -> NAtom
Atom : String -> Atom
We use this signature only to describe what a disjunctive normal form
is, not in an the actual Stratego program. This is not necessary, since
terms conforming to the DNF signature are also Prop
terms as defined
before. For example, the disjunctive normal form of
And(Impl(Atom("r"),And(Atom("p"),Atom("q"))),ATom("p"))
is
Or(And(Not(Atom("r")),ATom("p")),
And(And(Atom("p"),Atom("q")),ATom("p")))
Module prop-dnf-rules
extends the rules defined in prop-eval-rules
with rules to achieve disjunctive normal forms:
module prop-dnf-rules
imports prop-eval-rules
rules
E : Impl(x, y) -> Or(Not(x), y)
E : Eq(x, y) -> And(Impl(x, y), Impl(y, x))
E : Not(Not(x)) -> x
E : Not(And(x, y)) -> Or(Not(x), Not(y))
E : Not(Or(x, y)) -> And(Not(x), Not(y))
E : And(Or(x, y), z) -> Or(And(x, z), And(y, z))
E : And(z, Or(x, y)) -> Or(And(z, x), And(z, y))
The first two rules rewrite implication (Impl
) and equivalence (Eq
)
to combinations of And
, Or
, and Not
. The third rule removes double
negation. The fifth and sixth rules implement the well known DeMorgan
laws. The last two rules define distribution of conjunction over
disjunction.
We turn this set of rewrite rules into a compilable Stratego program in the same way as before:
module prop-dnf
imports libstrategolib prop-dnf-rules
strategies
main = io-wrap(dnf)
dnf = innermost(E)
compile it in the usual way
$ strc -i prop-dnf.str -la stratego-lib
so that we can use it to transform terms:
$ cat test3.prop
And(Impl(Atom("r"),And(Atom("p"),Atom("q"))),ATom("p"))
$ ./prop-dnf -i test3.prop
Or(And(Not(Atom("r")),ATom("p")),And(And(Atom("p"),Atom("q")),ATom("p")))
We have seen how to define simple transformations on terms using
unconditional term rewrite rules. Using the innermost
strategy, rules
are applied exhaustively to all subterms of the subject term. The
implementation of a rewrite system in Stratego has the following form:
module mod
imports libstrategolib
signature
sorts A B C
constructors
Foo : A * B -> C
rules
R : p1 -> p2
R : p3 -> p4
strategies
main = io-wrap(rewr)
rewr = innermost(R)
The ingredients of such a program can be divided over several modules. Thus, a set of rules can be used in multiple rewrite systems.
Compiling the module by means of the command
$ strc -i mod.str -la stratego-lib
produces an executable mod
that can be used to transform terms.
In ? we saw how term rewriting can be used to implement transformations
on programs represented by means of terms. Term rewriting involves
exhaustively applying rules to subterms until no more rules apply. This
requires a strategy for selecting the order in which subterms are
rewritten. The innermost
strategy introduced in ? applies rules
automatically throughout a term from inner to outer terms, starting with
the leaves. The nice thing about term rewriting is that there is no need
to define traversals over the syntax tree; the rules express basic
transformation steps and the strategy takes care of applying it
everywhere. However, the complete normalization approach of rewriting
turns out not to be adequate for program transformation, because rewrite
systems for programming languages will often be non-terminating and/or
non-confluent. In general, it is not desirable to apply all rules at the
same time or to apply all rules under all circumstances.
Consider for example, the following extension of prop-dnf-rules
with
distribution rules to achieve conjunctive normal forms:
module prop-cnf
imports prop-dnf-rules
rules
E : Or(And(x, y), z) -> And(Or(x, z), Or(y, z))
E : Or(z, And(x, y)) -> And(Or(z, x), Or(z, y))
strategies
main = io-wrap(cnf)
cnf = innermost(E)
This rewrite system is non-terminating because after applying one of the and-over-or distribution rules, the or-over-and distribution rules introduced here can be applied, and vice versa.
And(Or(Atom("p"),Atom("q")), Atom("r"))
->
Or(And(Atom("p"), Atom("r")), And(Atom("q"), Atom("r")))
->
And(Or(Atom("p"), And(Atom("q"), Atom("r"))),
Or(Atom("r"), And(Atom("q"), Atom("r"))))
->
...
There are a number of solutions to this problem. We'll first discuss a couple of solutions within pure rewriting, and then show how programmable rewriting strategies can overcome the problems of these solutions.
The non-termination of prop-cnf
is due to the fact that the
and-over-or and or-over-and distribution rules interfere with each
other. This can be prevented by refactoring the module structure such
that the two sets of rules are not present in the same rewrite system.
For example, we could split module prop-dnf-rules
into prop-simplify
and prop-dnf2
as follows:
module prop-simplify
imports prop-eval-rules
rules
E : Impl(x, y) -> Or(Not(x), y)
E : Eq(x, y) -> And(Impl(x, y), Impl(y, x))
E : Not(Not(x)) -> x
E : Not(And(x, y)) -> Or(Not(x), Not(y))
E : Not(Or(x, y)) -> And(Not(x), Not(y))
module prop-dnf2
imports prop-simplify
rules
E : And(Or(x, y), z) -> Or(And(x, z), And(y, z))
E : And(z, Or(x, y)) -> Or(And(z, x), And(z, y))
strategies
main = io-wrap(dnf)
dnf = innermost(E)
Now we can reuse the rules from prop-simplify
without the and-over-or
distribution rules to create a prop-cnf2
for normalizing to
conjunctive normal form:
module prop-cnf2
imports prop-simplify
rules
E : Or(And(x, y), z) -> And(Or(x, z), Or(y, z))
E : Or(z, And(x, y)) -> And(Or(z, x), Or(z, y))
strategies
main = io-wrap(cnf)
cnf = innermost(E)
Although this solves the non-termination problem, it is not an ideal solution. In the first place it is not possible to apply the two transformations in the same program. In the second place, extrapolating the approach to fine-grained selection of rules might require definition of a single rule per module.
Another common solution to this kind of problem is to introduce
additional constructors that achieve normalization under a restricted
set of rules. That is, the original set of rules p1 -> p2
is transformed into rules of the form f(p_1) -> p_2'
, where f
is some new constructor symbol and the right-hand
side of the rule also contains such new constructors. In this style of
programming, constructors such as f
are called functions and are
distinghuished from constructors. Normal forms over such rewrite systems
are assumed to be free of these `function' symbols; otherwise the
function would have an incomplete definition.
To illustrate the approach we adapt the DNF rules by introducing the
function symbols Dnf
and DnfR
. (We ignore the evaluation rules in
this example.)
module prop-dnf3
imports libstrategolib prop
signature
constructors
Dnf : Prop -> Prop
DnfR : Prop -> Prop
rules
E : Dnf(Atom(x)) -> Atom(x)
E : Dnf(Not(x)) -> DnfR(Not(Dnf(x)))
E : Dnf(And(x, y)) -> DnfR(And(Dnf(x), Dnf(y)))
E : Dnf(Or(x, y)) -> Or(Dnf(x), Dnf(y))
E : Dnf(Impl(x, y)) -> Dnf(Or(Not(x), y))
E : Dnf(Eq(x, y)) -> Dnf(And(Impl(x, y), Impl(y, x)))
E : DnfR(Not(Not(x))) -> x
E : DnfR(Not(And(x, y))) -> Or(Dnf(Not(x)), Dnf(Not(y)))
E : DnfR(Not(Or(x, y))) -> Dnf(And(Not(x), Not(y)))
D : DnfR(Not(x)) -> Not(x)
E : DnfR(And(Or(x, y), z)) -> Or(Dnf(And(x, z)), Dnf(And(y, z)))
E : DnfR(And(z, Or(x, y))) -> Or(Dnf(And(z, x)), Dnf(And(z, y)))
D : DnfR(And(x, y)) -> And(x, y)
strategies
main = io-wrap(dnf)
dnf = innermost(E <+ D)
The Dnf
function mimics the innermost normalization strategy by
recursively traversing terms. The auxiliary transformation function
DnfR
is used to encode the distribution and negation rules. The D
rules are default rules that are only applied if none of the E
rules
apply, as specified by the strategy expression E <+ D
.
In order to compute the disjunctive normal form of a term, we have to
`apply' the Dnf
function to it, as illustrated in the following
application of the prop-dnf3
program:
$ cat test1.dnf
Dnf(And(Impl(Atom("r"),And(Atom("p"),Atom("q"))),ATom("p")))
$ ./prop-dnf3 -i test1.dnf
Or(And(Not(Atom("r")),Dnf(Dnf(ATom("p")))),
And(And(Atom("p"),Atom("q")),Dnf(Dnf(ATom("p")))))
For conjunctive normal form we can create a similar definition, which can now co-exist with the definition of DNF. Indeed, we could then simultaneously rewrite one subterm to DNF and the other to CNF.
E : DC(x) -> (Dnf(x), Cnf(x))
In the solution above, the original rules have been completely
intertwined with the Dnf
transformation. The rules for negation cannot
be reused in the definition of normalization to conjunctive normal form.
For each new transformation a new traversal function and new
transformation functions have to be defined. Many additional rules had
to be added to traverse the term to find the places to apply the rules.
In the modular solution we had 5 basic rules and 2 additional rules for
DNF and 2 rules for CNF, 9 in total. In the functionalized version we
needed 13 rules for each transformation, that is 26 rules in total.
In general, there are two problems with the functional approach to encoding the control over the application of rewrite rules, when comparing it to the original term rewriting approach: traversal overhead and loss of separation of rules and strategies.
In the first place, the functional encoding incurs a large overhead due to the explicit specification of traversal. In pure term rewriting, the strategy takes care of traversing the term in search of subterms to rewrite. In the functional approach traversal is spelled out in the definition of the function, requiring the specification of many additional rules. A traversal rule needs to be defined for each constructor in the signature and for each transformation. The overhead for transformation systems for real languages can be inferred from the number of constructors for some typical languages:
language : constructors
Tiger : 65
C : 140
Java : 140
COBOL : 300 - 1200
In the second place, rewrite rules and the strategy that defines their application are completely intertwined. Another advantage of pure term rewriting is the separation of the specification of the rules and the strategy that controls their application. Intertwining these specifications makes it more difficult to understand the specification, since rules cannot be distinghuished from the transformation they are part of. Furthermore, intertwining makes it impossible to reuse the rules in a different transformation.
Stratego introduced the paradigm of programmable rewriting strategies with generic traversals, a unifying solution in which application of rules can be carefully controlled, while incurring minimal traversal overhead and preserving separation of rules and strategies.
The following are the design criteria for strategies in Stratego:
Separation of rules and strategy
: Basic transformation rules can be defined separately from the strategy that applies them, such that they can be understood independently.
Rule selection
: A transformation can select the necessary set of rules from a collection (library) of rules.
Control
: A transformation can exercise complete control over the application of rules. This control may be fine-grained or course-grained depending on the application.
No traversal overhead
: Transformations can be defined without overhead for the definition of traversals.
Reuse of rules
: Rules can be reused in different transformations.
Reuse of traversal schemas
: Traversal schemas can be defined generically and reused in different transformations.
In the next chapters we will examine the language constructs that Stratego provides for programming with strategies, starting with the low-level actions of building and matching terms. To get a feeling for the purpose of these constructs, we first look at a couple of typical idioms of strategic rewriting.
The basic idiom of program transformation achieved with term rewriting is that of cascading transformations. Instead of applying a single complex transformation algorithm to a program, a number of small, independent transformations are applied in combination throughout a program or program unit to achieve the desired effect. Although each individual transformation step achieves little, the cumulative effect can be significant, since each transformation feeds on the results of the ones that came before it.
One common cascading of transformations is accomplished by exhaustively
applying rewrite rules to a subject term. In Stratego the definition of
a cascading normalization strategy with respect to rules R1
, ... ,Rn
can be formalized using the innermost
strategy that we saw before:
simplify = innermost(R1 <+ ... <+ Rn)
The argument strategy of innermost
is a selection of rules. By
giving different names to rules, we can control the selection used in
each transformation. There can be multiple applications of innermost
to different sets of rules, such that different transformations can
co-exist in the same module without interference. Thus, it is now
possible to develop a large library of transformation rules that can be
called upon when necessary, without having to compose a rewrite system
by cutting and pasting. For example, the following module defines the
normalization of proposition formulae to both disjunctive and to
conjunctive normal form:
module prop-laws
imports libstrategolib prop
rules
DefI : Impl(x, y) -> Or(Not(x), y)
DefE : Eq(x, y) -> And(Impl(x, y), Impl(y, x))
DN : Not(Not(x)) -> x
DMA : Not(And(x, y)) -> Or(Not(x), Not(y))
DMO : Not(Or(x, y)) -> And(Not(x), Not(y))
DAOL : And(Or(x, y), z) -> Or(And(x, z), And(y, z))
DAOR : And(z, Or(x, y)) -> Or(And(z, x), And(z, y))
DOAL : Or(And(x, y), z) -> And(Or(x, z), Or(y, z))
DOAR : Or(z, And(x, y)) -> And(Or(z, x), Or(z, y))
strategies
dnf = innermost(DefI <+ DefE <+ DAOL <+ DAOR <+ DN <+ DMA <+ DMO)
cnf = innermost(DefI <+ DefE <+ DOAL <+ DOAR <+ DN <+ DMA <+ DMO)
main-dnf = io-wrap(dnf)
main-cnf = io-wrap(cnf)
The rules are named, and for each strategy different selections from the ruleset are made.
The module even defines two main strategies, which allows us to use one
module for deriving multiple programs. Using the --main
option of ? we
declare which strategy to invoke as main strategy in a particular
program. Using the -o
option we can give a different name to each
derived program.
$ strc -i prop-laws.str -la stratego-lib --main main-dnf -o prop-dnf4
Cascading transformations can be defined with other strategies as well,
and these strategies need not be exhaustive, but can be simpler
one-pass traversals. For example, constant folding of Boolean
expressions only requires a simple one-pass bottom-up traversal. This
can be achieved using the bottomup
strategy according the the
following scheme:
simplify = bottomup(repeat(R1 <+ ... <+ Rn))
The bottomup
strategy applies its argument strategy to each subterm in
a bottom-to-top traversal. The repeat
strategy applies its argument
strategy repeatedly to a term.
Module prop-eval2
defines the evaluation rules for Boolean expressions
and a strategy for applying them using this approach:
module prop-eval2
imports libstrategolib prop
rules
Eval : Not(True) -> False
Eval : Not(False) -> True
Eval : And(True, x) -> x
Eval : And(x, True) -> x
Eval : And(False, x) -> False
Eval : And(x, False) -> False
Eval : Or(True, x) -> True
Eval : Or(x, True) -> True
Eval : Or(False, x) -> x
Eval : Or(x, False) -> x
Eval : Impl(True, x) -> x
Eval : Impl(x, True) -> True
Eval : Impl(False, x) -> True
Eval : Eq(False, x) -> Not(x)
Eval : Eq(x, False) -> Not(x)
Eval : Eq(True, x) -> x
Eval : Eq(x, True) -> x
strategies
main = io-wrap(eval)
eval = bottomup(repeat(Eval))
The strategy eval
applies these rules in a bottom-up traversal over a
term, using the bottomup(s)
strategy. At each sub-term, the rules are
applied repeatedly until no more rule applies using the repeat(s)
strategy. This is sufficient for the Eval
rules, since the rules never
construct a term with subterms that can be rewritten.
Another typical example of the use of one-pass traversals is desugaring, that is rewriting language constructs to more basic language constructs. Simple desugarings can usually be expressed using a single top-to-bottom traversal according to the scheme
simplify = topdown(try(R1 <+ ... <+ Rn))
The topdown
strategy applies its argument strategy to a term and then
traverses the resulting term. The try
strategy tries to apply its
argument strategy once to a term.
Module prop-desugar
defines a number of desugaring rules for Boolean
expressions, defining propositional operators in terms of others. For
example, rule DefN
defines Not
in terms of Impl
, and rule DefI
defines Impl
in terms of Or
and Not
. So not all rules should be
applied in the same transformation or non-termination would result.
module prop-desugar
imports prop libstrategolib
rules
DefN : Not(x) -> Impl(x, False)
DefI : Impl(x, y) -> Or(Not(x), y)
DefE : Eq(x, y) -> And(Impl(x, y), Impl(y, x))
DefO1 : Or(x, y) -> Impl(Not(x), y)
DefO2 : Or(x, y) -> Not(And(Not(x), Not(y)))
DefA1 : And(x, y) -> Not(Or(Not(x), Not(y)))
DefA2 : And(x, y) -> Not(Impl(x, Not(y)))
IDefI : Or(Not(x), y) -> Impl(x, y)
IDefE : And(Impl(x, y), Impl(y, x)) -> Eq(x, y)
strategies
desugar =
topdown(try(DefI <+ DefE))
impl-nf =
topdown(repeat(DefN <+ DefA2 <+ DefO1 <+ DefE))
main-desugar =
io-wrap(desugar)
main-inf =
io-wrap(impl-nf)
The strategies desugar
and impl-nf
define two different desugaring
transformation based on these rules. The desugar
strategy gets rid of
the implication and equivalence operators, while the impl-nf
strategy
reduces an expression to implicative normal-form, a format in which
only implication (Impl
) and False
are used.
A final example of a one-pass traversal is the downup
strategy, which
applies its argument transformation during a traversal on the way down,
and again on the way up:
simplify = downup(repeat(R1 <+ ... <+ Rn))
An application of this strategy is a more efficient implementation of constant folding for Boolean expressions:
eval = downup(repeat(Eval))
This strategy reduces terms such as
And(... big expression ..., False)
in one step (to False
in this case), while the bottomup
strategy
defined above would first evaluate the big expression.
Cascading transformations apply a number of rules one after another to an entire tree. But in some cases this is not appropriate. For instance, two transformations may be inverses of one another, so that repeatedly applying one and then the other would lead to non-termination. To remedy this difficulty, Stratego supports the idiom of staged transformation.
In staged computation, transformations are not applied to a subject term all at once, but rather in stages. In each stage, only rules from some particular subset of the entire set of available rules are applied. In the TAMPR program transformation system this idiom is called sequence of normal forms, since a program tree is transformed in a sequence of steps, each of which performs a normalization with respect to a specified set of rules. In Stratego this idiom can be expressed directly according to the following scheme:
strategies
simplify =
innermost(A1 <+ ... <+ Ak)
; innermost(B1 <+ ... <+ Bl)
; ...
; innermost(C1 <+ ... <+ Cm)
In conventional program optimization, transformations are applied throughout a program. In optimizing imperative programs, for example, complex transformations are applied to entire programs. In GHC-style compilation-by-transformation, small transformation steps are applied throughout programs. Another style of transformation is a mixture of these ideas. Instead of applying a complex transformation algorithm to a program we use staged, cascading transformations to accumulate small transformation steps for large effect. However, instead of applying transformations throughout the subject program, we often wish to apply them locally, i.e., only to selected parts of the subject program. This allows us to use transformations rules that would not be beneficial if applied everywhere.
One example of a strategy which achieves such a transformation is
strategies
transformation =
alltd(
trigger-transformation
; innermost(A1 <+ ... <+ An)
)
The strategy alltd(s)
descends into a term until a subterm is
encountered for which the transformation s
succeeds. In this case the
strategy trigger-transformation
recognizes a program fragment that
should be transformed. Thus, cascading transformations are applied
locally to terms for which the transformation is triggered. Of course
more sophisticated strategies can be used for finding application
locations, as well as for applying the rules locally. Nevertheless, the
key observation underlying this idiom remains: Because the
transformations to be applied are local, special knowledge about the
subject program at the point of application can be used. This allows the
application of rules that would not be otherwise applicable.
While term rewrite rules can express individual transformation steps, the exhaustive applications of all rules to all subterms is not always desirable. The selection of rules to apply through the module system does not allow transformations to co-exist and may require very small-grained modules. The `functionalization' of rewrite rules leads to overhead in the form of traversal definitions and to the loss of separation between rules and strategy. The paradigm of rewriting with programmable strategies allows the separate definition of individual rewrite rules, which can be (re)used in different combinations with a choice of strategies to form a variety of transformations. Idioms such as cascading transformations, one-pass traversals, staged, and local transformations cater for different styles of applying strategies.
Next up: The next chapters give an in depth overview of the constructs for composing strategies.
In the previous chapter we saw that pure term rewriting is not adequate for program transformation because of the lack of control over the application of rules. Attempts to encoding such control within the pure rewriting paradigm lead to functionalized control by means of extra rules and constructors at the expense of traversal overhead and at the loss of the separation of rules and strategies. By selecting the appropriate rules and strategy for a transformation, Stratego programmers can control the application of rules, while maintaining the separation of rules and strategies and keeping traversal overhead to a minimum.
We saw that many transformation problems can be solved by alternative strategies such as a one-pass bottom-up or top-down traversals. Others can be solved by selecting the rules that are applied in an innermost normalization, rather than all the rules in a specification. However, no fixed set of such alternative strategies will be sufficient for dealing with all transformation problems. Rather than providing one or a few fixed collection of rewriting strategies, Stratego supports the composition of strategies from basic building blocks with a few fundamental operators.
While we have seen rules and strategies in the previous chapters, we have been vague about what kinds of things they are. In this chapter we define the basic notions of rules and strategies, and we will see how new strategies and strategy combinators can be defined. The next chapters will then introduce the basic combinators used for composition of strategies.
A named rewrite rule is a declaration of the form
L : p1 -> p2
where L
is the rule name, p1
the left-hand side term pattern, and
p2
the right-hand side term pattern. A rule defines a transformation
on terms. A rule can be applied through its name to a term. It will
transform the term if it matches with p1
, and will replace the term
with p2
instantiated with the variables bound during the match to
p1
. The application fails if the term does not match p1
. Thus, a
transformation is a partial function from terms to terms
Let's look at an example. The SwapArgs
rule swaps the subterms of the
Plus
constructor. Note that it is possible to introduce rules on the
fly in the Stratego Shell.
stratego> SwapArgs : Plus(e1,e2) -> Plus(e2,e1)
Now we create a new term, and apply the SwapArgs
rule to it by calling
its name at the prompt. (The build !t
of a term replaces the current
term by t
, as will be explained in ?.)
stratego> !Plus(Var("a"),Int("3"))
Plus(Var("a"),Int("3"))
stratego> SwapArgs
Plus(Int("3"),Var("a"))
The application of SwapArgs
fails when applied to a term to which the
left-hand side does not match. For example, since the pattern
Plus(e1,e2)
does not match with a term constructed with Times
the
following application fails:
stratego> !Times(Int("4"),Var("x"))
Times(Int("4"),Var("x"))
stratego> SwapArgs
command failed
A rule is applied at the root of a term, not at one of its subterms.
Thus, the following application fails even though the term contains a
Plus
subterm:
stratego> !Times(Plus(Var("a"),Int("3")),Var("x"))
Times(Plus(Var("a"),Int("3")),Var("x"))
stratego> SwapArgs
command failed
Likewise, the following application only transforms the outermost
occurrence of Plus
, not the inner occurrence:
stratego> !Plus(Var("a"),Plus(Var("x"),Int("42")))
Plus(Var("a"),Plus(Var("x"),Int("42")))
stratego> SwapArgs
Plus(Plus(Var("x"),Int("42")),Var("a"))
Finally, there may be multiple rules with the same name. This has the
effect that all rules with that name will be tried in turn until one
succeeds, or all fail. The rules are tried in some undefined order. This
means that it only makes sense to define rules with the same name if
they are mutually exclusive, that is, do not have overlapping left-hand
sides. For example, we can extend the definition of SwapArgs
with a
rule for the Times
constructor, as follows:
stratego> SwapArgs : Times(e1, e2) -> Times(e2, e1)
Now the rule can be applied to terms with a Plus
and a Times
constructor, as illustrated by the following applications:
stratego> !Times(Int("4"),Var("x"))
Times(Int("4"),Var("x"))
stratego> SwapArgs
Times(Var("x"),Int("4"))
stratego> !Plus(Var("a"),Int("3"))
Plus(Var("a"),Int("3"))
stratego> SwapArgs
Plus(Int("3"),Var("a"))
Later we will see that a rule is nothing more than a syntactical convention for a strategy definition.
A rule defines a transformation, that is, a partial function from terms to terms. A strategy expression is a combination of one or more transformations into a new transformation. So, a strategy expression also defines a transformation, i.e., a partial function from terms to terms. Strategy operators are functions from transformations to transformations.
In the previous chapter we saw some
examples of strategy expressions. Lets examine these examples in the
light of our new definition. First of all, rule names are basic
strategy expressions. If we import module prop-laws
, we have at our
disposal all rules it defines as basic strategies:
stratego> import prop-laws
stratego> !Impl(True(), Atom("p"))
Impl(True, Atom("p"))
stratego> DefI
Or(Not(True),Atom("p"))
Next, given a collection of rules we can create more complex
transformations by means of strategy operators. For example, the
innermost
strategy creates from a collection of rules a new
transformation that exhaustively applies those rules.
stratego> !Eq(Atom("p"), Atom("q"))
Eq(Atom("p"),Atom("q"))
stratego> innermost(DefI <+ DefE <+ DAOL <+ DAOR <+ DN <+ DMA <+ DMO)
Or(Or(And(Not(Atom("p")),Not(Atom("q"))),
And(Not(Atom("p")),Atom("p"))),
Or(And(Atom("q"),Not(Atom("q"))),
And(Atom("q"),Atom("p"))))
(Exercise: add rules to this composition that remove tautologies or
false propositions.) Here we see that the rules are first combined using
the choice operator <+
into a composite transformation, which is the
argument of the innermost
strategy.
The innermost
strategy always succeeds (but may not terminate), but
this is not the case for all strategies. For example bottomup(DefI)
will not succeed, since it attempts to apply rule DefI
to all
subterms, which is clearly not possible. Thus, strategies extend the
property of rules that they are partial functions from terms to terms.
Observe that in the composition innermost(...)
, the term to which the
transformation is applied is never mentioned. The `current term', to
which a transformation is applied is often implicit in the definition of
a strategy. That is, there is no variable that is bound to the current
term and then passed to an argument strategy. Thus, a strategy operator
such as innermost
is a function from transformations to
transformations.
While strategies are functions, they are not necessarily pure
functions. Strategies in Stratego may have side effects such as
performing input/output operations. This is of course necessary in the
implementation of basic tool interaction such as provided by io-wrap
,
but is also useful for debugging. For example, the debug
strategy
prints the current term, but does not transform it. We can use it to
visualize the way that innermost
transforms a term.
stratego> !Not(Impl(Atom("p"), Atom("q")))
Not(Impl(Atom("p"),Atom("q")))
stratego> innermost(debug(!"in: "); (DefI <+ DefE <+ DAOL <+ DAOR <+ DN <+ DMA <+ DMO); debug(!"out: "))
in: p
in: Atom("p")
in: q
in: Atom("q")
in: Impl(Atom("p"),Atom("q"))
out: Or(Not(Atom("p")),Atom("q"))
in: p
in: Atom("p")
in: Not(Atom("p"))
in: q
in: Atom("q")
in: Or(Not(Atom("p")),Atom("q"))
in: Not(Or(Not(Atom("p")),Atom("q")))
out: And(Not(Not(Atom("p"))),Not(Atom("q")))
in: p
in: Atom("p")
in: Not(Atom("p"))
in: Not(Not(Atom("p")))
out: Atom("p")
in: p
in: Atom("p")
in: q
in: Atom("q")
in: Not(Atom("q"))
in: And(Atom("p"),Not(Atom("q")))
And(Atom("p"),Not(Atom("q")))
This session nicely shows how innermost traverses the term it
transforms. The in:
lines show terms to which it attempts to apply a
rule, the out:
lines indicate when this was successful and what the
result of applying the rule was. Thus, innermost
performs a post-order
traversal applying rules after transforming the subterms of a term.
(Note that when applying debug
to a string constant, the quotes are
not printed.)
Stratego programs are about defining transformations in the form of rules and strategy expressions that combine them. Just defining strategy expressions does not scale, however. Strategy definitions are the abstraction mechanism of Stratego and allow naming and parameterization of strategy expresssions for reuse.
A simple strategy definition names a strategy expression. For instance,
the following module defines a combination of rules (dnf-rules
), and
some strategies based on it:
module dnf-strategies
imports libstrategolib prop-dnf-rules
strategies
dnf-rules =
DefI <+ DefE <+ DAOL <+ DAOR <+ DN <+ DMA <+ DMO
dnf =
innermost(dnf-rules)
dnf-debug =
innermost(debug(!"in: "); dnf-rules; debug(!"out: "))
main =
io-wrap(dnf)
Note how dnf-rules
is used in the definition of dnf
, and dnf
itself in the definition of main
.
In general, a definition of the form
f = s
introduces a new transformation f
, which can be invoked by calling f
in a strategy expression, with the effect of executing strategy
expression s
. The expression should have no free variables. That is,
all strategie called in s
should be defined strategies. Simple
strategy definitions just introduce names for strategy expressions.
Still such strategies have an argument, namely the implicit current
term.
Strategy definitions with strategy and/or term parameters can be used to define transformation schemas that can instantiated for various situations.
A parameterized strategy definition of the form
f(x1,...,xn | y1,..., ym) = s
introduces a user-defined operator f
with n
strategy arguments and
m
term arguments. Such a user-defined strategy operator can be
called as f(s1,...,sn|t1,...,tm)
by providing it n
argument
strategies and m
argument terms. The meaning of such a call is the
body s
of the definition in which the actual arguments have been
substituted for the formal arguments. Strategy arguments and term
arguments can be left out of calls and definitions. That is, a call
f(|)
without strategy and term arguments can be written as f()
, or
even just f
. A call f(s1,..., sn|)
without term arguments can be
written as f(s1,...,sn)
The same holds for definitions.
As we will see in the coming chapters, strategies such as innermost
,
topdown
, and bottomup
are not built into the language, but are
defined using strategy definitions in the language itself using more
basic combinators, as illustrated by the following definitions (without
going into the exact meaning of these definitions):
strategies
try(s) = s <+ id
repeat(s) = try(s; repeat(s))
topdown(s) = s; all(topdown(s))
bottomup(s) = all(bottomup(s)); s
Such parameterized strategy operators are invoked by providing arguments
for the parameters. Specifically, strategy arguments are instantiated by
means of strategy expressions. Wherever the argument is invoked in the
body of the definition, the strategy expression is invoked. For example,
in the previous chapter we saw the
following instantiations of the topdown
, try
, and repeat
strategies:
module prop-desugar
// ...
strategies
desugar =
topdown(try(DefI <+ DefE))
impl-nf =
topdown(repeat(DefN <+ DefA2 <+ DefO1 <+ DefE))
There can be multiple definitions with the same name but different numbers of parameters. Such definitions introduce different strategy operators.
Strategy definitions at top-level are visible everywhere. Sometimes it
is useful to define a local strategy operator. This can be done using
a let expression of the form let d* in s end
, where d*
is a list of
definitions. For example, in the following strategy expression, the
definition of dnf-rules
is only visible in the instantiation of
innermost
:
let dnf-rules = DefI <+ DefE <+ DAOL <+ DAOR <+ DN <+ DMA <+ DMO
in innermost(dnf-rules)
end
The current version of Stratego does not support hidden strategy definitions at the module level. Such a feature is under consideration for a future version.
As we saw in ?, a Stratego program can introduce several rules with the same name. It is even possible to extend rules across modules. This is also possible for strategy definitions. If two strategy definitions have the same name and the same number of parameters, they effectively define a single strategy that tries to apply the bodies of the definitions in some undefined order. Thus, a definition of the form
strategies
f = s1
f = s2
entails that a call to f
has the effect of first trying to apply s1
,
and if that fails applies s2
, or the other way around. Thus, the
definition above is either translated to
strategies
f = s1 <+ s2
or to
strategies
f = s2 <+ s1
Stratego provides combinators for composing transformations and basic
operators for analyzing, creating and traversing terms. However, it does
not provide built-in support for other types of computation such as
input/output and process control. In order to make such functionality
available to Stratego programmers, the language provides access to
user-definable primitive strategies through the prim
construct. For
example, the following call to prim
invokes the SSL_printnl
function
from the native code of the C library:
stratego> prim("SSL_printnl", stdout(), ["foo", "bar"])
foobar
""
In general, a call prim("f", s* | t*)
represents a call to a
primitive function f
with strategy arguments s*
and term arguments
t*
. Note that the `current' term is not passed automatically as
argument.
This mechanism allows the incorporation of mundane tasks such as arithmetic, I/O, and other tasks not directly related to transformation, but necessary for the integration of transformations with the other parts of a transformation system.
Primitive functions should take ATerms as arguments. It is not possible
to use `unboxed' values, i.e., raw native types. This requires writing
a wrapper function in C. For example, the addition of two integers is
defined via a call to a primitive prim("SSL_addi",x,y)
, where the
argument should represent integer ATerms, not C integers.
Implementing Primitives.
The Stratego Library provides all the primitives for I/O, arithmetic, string processing, and process control that are usually needed in Stratego programs. However, it is possible to add new primitives to a program as well. That requires linking with the compiled program a library that implements the function. See the documentation of ? for instructions.
The Stratego Compiler is a whole program compiler. That is, the compiler includes all definitions from imported modules (transitively) into the program defined by the main module (the one being compiled). This is the reason that the compiler takes its time to compile a program. To reduce the compilation effort and the size of the resulting programs it is possible to create separately compiled libraries of Stratego definitions. The strategies that such a library provides are declared as external definitions. A declaration of the form
external f(s1 ... sn | x1 ... xm)
states that there is an externally defined strategy operator f
with
n
strategy arguments and m
term arguments. When compiling a program
with external definitions a library should be provided that implements
these definitions.
The Stratego Library is provided as a separately compiled library. The
libstrateglib
module that we have been using in the example programs
contains external definitions for all strategies in the library, as
illustrated by the following excerpt:
module libstrategolib
// ...
strategies
// ...
external io-wrap(s)
external bottomup(s)
external topdown(s)
// ...
When compiling a program using the library we used the -la stratego-lib
option to provide the implementation of those
definitions.
External Definitions cannot be Extended.
Unlike definitions imported in the normal way, external definitions cannot be extended. If we try to compile a module extending an external definition, such as
module wrong
imports libstrategolib
strategies
bottomup(s) = fail
compilation fails:
$ strc -i wrong.str
[ strc | info ] Compiling 'wrong.str'
error: redefining external definition: bottomup/1-0
[ strc | error ] Compilation failed (3.66 secs)
Creating Libraries.
It is possible to create your own Stratego libraries as well. Currently
that exposes you to a bit of C compilation giberish; in the future this
may be incorporated in the Stratego compiler. Lets create a library for
the rules and strategy definitions in the prop-laws
module. We do this
using the --library
option to indicate that a library is being built,
and the -c
option to indicate that we are only interested in the
generated C code.
$ strc -i prop-laws.str -c -o libproplib.rtree --library
[ strc | info ] Compiling 'proplib.str'
[ strc | info ] Front-end succeeded : [user/system] = [4.71s/0.77s]
[ strc | info ] Abstract syntax in 'libproplib.rtree'
[ strc | info ] Concrete syntax in 'libproplib.str'
[ strc | info ] Export of externals succeeded : [user/system] = [2.02s/0.11s]
[ strc | info ] Back-end succeeded : [user/system] = [6.66s/0.19s]
[ strc | info ] Compilation succeeded : [user/system] = [13.4s/1.08s]
$ rm libproplib.str
The result is of this compilation is a module libproplib
that contains
the external definitions of the module and those inherited from
libstrategolib
. (This module comes in to versions; one in concrete
syntax libproplib.str
and one in abstract syntax libproplib.rtree
;
for some obscure reason you should throw away the .str
file.)
Furthermore, the Stratego Compiler produces a C program libproplib.c
with the implementation of the library. This C program should be turned
into an object library using libtool
, as follows:
$ libtool --mode=compile gcc -g -O -c libproplib.c -o libproplib.o -I <path/to/aterm-stratego/include>
...
$ libtool --mode=link gcc -g -O -o libproplib.la libproplib.lo
...
The result is a shared library libproplib.la
that can be used in other
Stratego programs. (TODO: the production of the shared library should
really be incorporated into strc.)
Using Libraries.
Programmers that want to use your library can now import the module with external definitions, instead of the original module.
module dnf-tool
imports libproplib
strategies
main = main-dnf
This program can be compiled in the usual way, adding the new library to the libraries that should be linked against:
$ strc -i dnf-tool.str -la stratego-lib -la ./libproplib.la
$ cat test3.prop
And(Impl(Atom("r"),And(Atom("p"),Atom("q"))),ATom("p"))
$ ./dnf-tool -i test3.prop
Or(And(Not(Atom("r")),ATom("p")),And(And(Atom("p"),Atom("q")),ATom("p")))
To correctly deploy programs based on shared libraries requires some additional effort. ? explains how to create deployable packages for your Stratego programs.
Strategies can be called dynamically by name, i.e., where the name of
the strategy is specified as a string. Such calls can be made using the
call
construct, which has the form:
call(f | s1, ..., sn | t1, ..., tn)
where f
is a term that should evaluate to a string, which indicates
the name of the strategy to be called, followed by a list of strategy
arguments, and a list of term arguments.
Dynamic calls allow the name of the strategy to be computed at run-time. This is a rather recent feature of Stratego that was motivated by the need for call-backs from a separately compiled Stratego library combined with the computation of dynamic rule names. Otherwise, there is not yet much experience with the feature.
In the current version of Stratego it is necessary to 'register' a
strategy to be able to call it dynamically. (In order to avoid deletion
in case it is not called explicitly somewhere in the program.)
Strategies are registered by means of a dummy strategy definition
DYNAMIC-CALLS
with calls to the strategies that should called
dynamically.
DYNAMICAL-CALLS = foo(id)
We have learned that rules and strategies define transformations, that is, functions from terms to terms that can fail, i.e., partial functions. Rule and strategy definitions introduce names for transformations. Parameterized strategy definitions introduce new strategy operators, functions that construct transformations from transformations.
Primitive strategies are transformations that are implemented in some
language other than Stratego (usually C), and are called through the
prim
construct. External definitions define an interface to a
separately compiled library of Stratego definitions. Dynamic calls allow
the name of the strategy to be called to be computed as a string.
We have seen the use of strategies to combine rules into complex
transformations. Rather than providing a fixed set of high-level
strategy operators such as bottomup
, topdown
, and innermost
,
Stratego provides a small set of basic combinators, that can be used to
create a wide variety of strategies. In ? until ? we will introduce
these combinators. In this chapter we start with a set of combinators
for sequential composition and choice of strategies.
The most basic operations in Stratego are id
and fail
. The identity
strategy id
always succeeds and behaves as the identity function on
terms. The failure strategy fail
always fails. The operations have no
side effects.
stratego> !Foo(Bar())
Foo(Bar)
stratego> id
Foo(Bar)
stratego> fail
command failed
The sequential composition s1 ; s2
of the strategies s1
and s2
first applies the strategy s1
to the subject term and then s2
to the
result of that first application. The strategy fails if either s1
or
s2
fails.
Properties.
Sequential composition is associative. Identity is a left and right unit
for sequential composition; since id
always succeeds and leaves the
term alone, it has no additional effect to the strategy that it is
composed with. Failure is a left zero for sequential composition; since
fail
always fails the next strategy will never be reached.
(s1; s2) ; s3 = s1; (s2; s3)
id; s = s
s; id = s
fail; s = fail
However, not for all strategies s
we have that failure is a right zero
for sequential composition:
s ; fail = fail (is not a law)
Although the composition s; fail
will always fail, the execution of
s
may have side effects that are not performed by fail
. For example,
consider printing a term in s
.
Examples.
As an example of the use of sequential composition consider the following rewrite rules.
stratego> A : P(Z(),x) -> x
stratego> B : P(S(x),y) -> P(x,S(y))
The following session shows the effect of first applying B
and then
A
:
stratego> !P(S(Z()), Z())
P(S(Z),Z)
stratego> B
P(Z,S(Z))
stratego> A
S(Z)
Using the sequential composition of the two rules, this effect can be achieved `in one step':
stratego> !P(S(Z()),Z())
P(S(Z),Z)
stratego> B; A
S(Z)
The following session shows that the application of a composition fails if the second strategy in the composition fails to apply to the result of the first:
stratego> !P(S(Z()),Z())
P(S(Z),Z)
stratego> B; B
command failed
Choosing between rules to apply is achieved using one of several choice combinators, all of which are based on the guarded choice combinator. The common approach is that failure to apply one strategy leads to backtracking to an alternative strategy.
The left choice or deterministic choice s1 <+ s2
tries to apply s1
and s2
in that order. That is, it first tries to apply s1
, and if
that succeeds the choice succeeds. However, if the application of s1
fails, s2
is applied to the original term.
Properties.
Left choice is associative. Identity is a left zero for left choice;
since id
always succeeds, the alternative strategy will never be
tried. Failure is a left and right unit for left choice; since fail
always fails, the choice will always backtrack to the alternative
strategy, and use of fail
as alternative strategy is pointless.
(s1 <+ s2) <+ s3 = s1 <+ (s2 <+ s3)
id <+ s = id
fail <+ s = s
s <+ fail = s
However, identity is not a right zero for left choice. That is, not for
all strategies s
we have that
s <+ id = s (is not a law)
The expression s <+ id
always succeeds, even (especially) in the case
that s
fails, in which case the right-hand side of the equation fails
of course.
Local Backtracking.
The left choice combinator is a local backtracking combinator. That is, the choice is committed once the left-hand side strategy has succeeded, even if the continuation strategy fails. This is expressed by the fact that the property
(s1 <+ s2); s3 = (s1; s3) <+ (s2; s3) (is not a law)
does not hold for all s1
, s2
, and s3
. The difference is
illustrated by the following applications:
stratego> !P(S(Z()),Z())
P(S(Z),Z)
stratego> (B <+ id); B
command failed
stratego> !P(S(Z()),Z())
P(S(Z),Z)
stratego> (B <+ id)
P(Z,S(Z))
stratego> B
command failed
stratego> (B; B) <+ (id; B)
P(Z,S(Z))
In the application of (B <+ id); B
, the first application of B
succeeds after which the choice is committed. The subsequent application
of B
then fails. This equivalent to first applying (B <+ id)
and
then applying B
to the result. The application of (B; B) <+ (id; B)
,
however, is successful; the application of B; B
fails, after which the
choice backtracks to id; B
, which succeeds.
Choosing between Transformations.
The typical use of left choice is to create a composite strategy trying one from several possible transformations. If the strategies that are composed are mutually exclusive, that is, don't succeed for the same terms, their sum is a transformation that (deterministically) covers a larger set of terms. For example, consider the following two rewrite rules:
stratego> PlusAssoc : Plus(Plus(e1, e2), e3) -> Plus(e1, Plus(e2, e3))
stratego> PlusZero : Plus(Int("0"),e) -> e
These rules are mutually exclusive, since there is no term that matches
the left-hand sides of both rules. Combining the rules with left choice
into PlusAssoc <+ PlusZero
creates a strategy that transforms terms
matching both rules as illustrated by the following applications:
stratego> !Plus(Int("0"),Int("3"))
Plus(Int("0"),Int("3"))
stratego> PlusAssoc
command failed
stratego> PlusAssoc <+ PlusZero
Int("3")
stratego> !Plus(Plus(Var("x"),Int("42")),Int("3"))
Plus(Plus(Var("x"),Int("42")),Int("3"))
stratego> PlusZero
command failed
stratego> PlusAssoc <+ PlusZero
Plus(Var("x"),Plus(Int("42"),Int("3")))
Ordering Overlapping Rules.
When two rules or strategies are mutually exlusive the order of applying
them does not matter. In cases where strategies are overlapping, that
is, succeed for the same terms, the order becomes crucial to determining
the semantics of the composition. For example, consider the following
rewrite rules reducing applications of Mem
:
stratego> Mem1 : Mem(x,[]) -> False()
stratego> Mem2 : Mem(x,[x|xs]) -> True()
stratego> Mem3 : Mem(x,[y|ys]) -> Mem(x,ys)
Rules Mem2
and Mem3
have overlapping left-hand sides. Rule Mem2
only applies if the first argument is equal to the head element of the
list in the second argument. Rule Mem3
applies always if the list in
the second argument is non-empty.
stratego> !Mem(1, [1,2,3])
Mem(1, [1,2,3])
stratego> Mem2
True
stratego> !Mem(1, [1,2,3])
Mem(1,[1,2,3])
stratego> Mem3
Mem(1,[2,3])
In such situations, depending on the order of the rules, differents
results are produced. (The rules form a non-confluent rewriting system.)
By ordering the rules as Mem2 <+ Mem3
, rule Mem2
is tried before
Mem3
, and we have a deterministic transformation strategy.
Try.
A useful application of <+
in combination with id
is the reflexive
closure of a strategy s
:
try(s) = s <+ id
The user-defined strategy combinator try
tries to apply its argument
strategy s
, but if that fails, just succeeds using id
.
Sometimes it is not desirable to backtrack to the alternative specified
in a choice. Rather, after passing a guard, the choice should be
committed. This can be expressed using the guarded left choice
operator s1 < s2 + s3
. If s1
succeeds s2
is applied, else s3
is
applied. If s2
fails, the complete expression fails; no backtracking
to s3
takes place.
Properties.
This combinator is a generalization of the left choice combinator <+
.
s1 <+ s2 = s1 < id + s2
The following laws make clear that the `branches' of the choice are selected by the success or failure of the guard:
id < s2 + s3 = s2
fail < s2 + s3 = s3
If the right branch always fails, the construct reduces to the sequential composition of the guard and the left branch.
s1 < s2 + fail = s1; s2
Guarded choice is not associative:
(s1 < s2 + s3) < s4 + s5 = s1 < s2 + (s3 < s4 + s5) (not a law)
To see why consider the possible traces of these expressions. For
example, when s1
and s2
succeed subsequently, the left-hand side
expression calls s4
, while the right-hand side expression does not.
However, sequential composition distributes over guarded choice from left and right:
(s1 < s2 + s3); s4 = s1 < (s2; s4) + (s3; s4)
s0; (s1 < s2 + s3) = (s0; s1) < s2 + s3
Examples.
The guarded left choice operator is most useful for the implementation
of higher-level control-flow strategies. For example, the negation
not(s)
of a strategy s
, succeeds if s
fails, and fails when it
succeeds:
not(s) = s < fail + id
Since failure discards the effect of a (succesful) transformation, this
has the effect of testing whether s
succeeds. So we have the following
laws for not
:
not(id) = fail
not(fail) = id
However, side effects performed by s
are not undone, of course.
Therefore, the following equation does not hold:
not(not(s)) = s (not a law)
Another example of the use of guarded choice is the restore-always
combinator:
restore-always(s, r) = s < r + (r; fail)
It applies a `restore' strategy r
after applying a strategy s
, even
if s
fails, and preserves the success/failure behaviour of s
. Since
fail
discards the transformation effect of r
, this is mostly useful
for ensuring that some side-effecting operation is done (or undone)
after applying s
.
For other applications of guarded choice, Stratego provides syntactic sugar.
The guarded choice combinator is similar to the traditional if-then-else
construct of programming languages. The difference is that the `then'
branch applies to the result of the application of the condition.
Stratego's if s1 then s2 else s3 end
construct is more like the
traditional construct since both branches apply to the original term.
The condition strategy is only used to test if it succeeds or fails, but
it's transformation effect is undone. However, the condition strategy
s1
is still applied to the current term. The if s1 then s2 end
strategy is similar; if the condition fails, the strategy succeeds.
Properties.
The if-then-else-end
strategy is just syntactic sugar for a
combination of guarded choice and the where
combinator:
if s1 then s2 else s3 end = where(s1) < s2 + s3
The strategy where(s)
succeeds if s
succeeds, but returns the
original subject term. The implementation of the where
combinator will
be discussed in ?. The following laws show that the branches are
selected by success or failure of the condition:
if id then s2 else s3 end = s2
if fail then s2 else s3 end = s3
The if-then-end
strategy is an abbreviation for the if-then-else-end
with the identity strategy as right branch:
if s1 then s2 end = where(s1) < s2 + id
Examples.
The inclusive or or(s1, s2)
succeeds if one of the strategies s1
or s2
succeeds, but guarantees that both are applied, in the order
s1
first, then s2
:
or(s1, s2) =
if s1 then try(where(s2)) else where(s2) end
This ensures that any side effects are always performed, in contrast to
s1 <+ s2
, where s2
is only executed if s1
fails. (Thus, left
choice implements a short circuit Boolean or.)
Similarly, the following and(s1, s2)
combinator is the non-short
circuit version of Boolean conjunction:
and(s1, s2) =
if s1 then where(s2) else where(s2); fail end
The switch
construct is an n-ary branching construct similar to its
counter parts in other programming languages. It is defined in terms of
guarded choice. The switch
construct has the following form:
switch s0
case s1 : s1'
case s2 : s2'
...
otherwise : sdef
end
The switch first applies the s0
strategy to the current term t
resulting in a term t'
. Then it tries the cases in turn applying each
si
to t'
. As soon as this succeeds the corresponding case is
selected and si'
is applied to the t
, the term to which the switch
was applied. If none of the cases applies, the default strategy sdef
from the otherwise
is applied.
Properties.
The switch construct is syntactic sugar for a nested if-then-else:
{x : where(s0 => x);
if <s1> x
then s1'
else if <s2> x
then s2'
else if ...
then ...
else sdef
end
end
end}
This translation uses a couple of Stratego constructs that we haven't discussed so far.
Examples.
TODO
The deterministic left choice operator prescribes that the left alternative should be tried before the right alternative, and that the latter is only used if the first fails. There are applications where it is not necessary to define the order of the alternatives. In those cases non-deterministic choice can be used.
The non-deterministic choice operator s1 + s2
chooses one of the two
strategies s1
or s2
to apply, such that the one it chooses succeeds.
If both strategies fail, then the choice fails as well. Operationally
the choice operator first tries one strategy, and, if that fails, tries
the other. The order in which this is done is undefined, i.e.,
arbitrarily chosen by the compiler.
The +
combinator is used to model modular composition of rewrite rules
and strategies with the same name. Multiple definitions with the same
name, are merged into a single definition with that name, where the
bodies are composed with +
. The following transformation illustrates
this:
f = s1 f = s2 ==> f = s1 + s2
This transformation is somewhat simplified; the complete transformation needs to take care of local variables and parameters.
While the +
combinator is used internally by the compiler for this
purpose, programmers are advised not to use this combinator, but
rather use the left choice combinator <+
to avoid surprises.
Repeated application of a strategy can be achieved with recursion. There
are two styles for doing this; with a recursive definition or using the
fixpoint operator rec
. A recursive definition is a normal strategy
definition with a recursive call in its body.
f(s) = ... f(s) ...
Another way to define recursion is using the fixpoint operator
rec x(s)
, which recurses on applications of x
within s
. For
example, the definition
f(s) = rec x(... x ...)
is equivalent to the one above. The advantage of the rec
operator is
that it allows the definition of an unnamed strategy expression to be
recursive. For example, in the definition
g(s) = foo; rec x(... x ...); bar
the strategy between foo
and bar
is a recursive strategy that does
not recurse to g(s)
.
Originally, the rec
operator was the only way to define recursive
strategies. It is still in the language in the first place because it is
widely used in many existing programs, and in the second place because
it can be a concise expression of a recursive strategy, since call
parameters are not included in the call. Furthermore, all free variables
remain in scope.
Examples.
The repeat
strategy applies a transformation s
until it fails. It is
defined as a recursive definition using try
as follows:
try(s) = s <+ id
repeat(s) = try(s; repeat(s))
An equivalent definition using rec
is:
repeat(s) = rec x(try(s; x))
The following Stratego Shell session illustrates how it works. We first define the strategies:
stratego> try(s) = s <+ id
stratego> repeat(s) = try(s; repeat(s))
stratego> A : P(Z(),x) -> x
stratego> B : P(S(x),y) -> P(x,S(y))
Then we observe that the repeated application of the individual rules
A
and B
reduces terms:
stratego> !P(S(Z()),Z())
P(S(Z),Z)
stratego> B
P(Z,S(Z))
stratego> A
S(Z)
We can automate this using the repeat
strategy, which will repeat the
rules as often as possible:
stratego> !P(S(Z()),Z())
P(S(Z),Z)
stratego> repeat(A <+ B)
S(Z)
stratego> !P(S(S(S(Z()))),Z())
P(S(S(S(Z))),Z)
stratego> repeat(A <+ B)
S(S(S(Z)))
To illustrate the intermediate steps of the transformation we can use
debug
from the Stratego Library.
stratego> import libstratego-lib
stratego> !P(S(S(S(Z()))),Z())
P(S(S(S(Z))),Z)
stratego> repeat(debug; (A <+ B))
P(S(S(S(Z))),Z)
P(S(S(Z)),S(Z))
P(S(Z),S(S(Z)))
P(Z,S(S(S(Z))))
S(S(S(Z)))
S(S(S(Z)))
A Library of Iteration Strategies.
Using sequential composition, choice, and recursion a large variety of
iteration strategies can be defined. The following definitions are part
of the Stratego Library (in module strategy/iteration
).
repeat(s) =
rec x(try(s; x))
repeat(s, c) =
(s; repeat(s, c)) <+ c
repeat1(s, c) =
s; (repeat1(s, c) <+ c)
repeat1(s) =
repeat1(s, id)
repeat-until(s, c) =
s; if c then id else repeat-until(s, c) end
while(c, s) =
if c then s; while(c, s) end
do-while(s, c) =
s; if c then do-while(s, c) end
The following equations describe some relations between these strategies:
do-while(s, c) = repeat-until(s, not(c))
do-while(s, c) = s; while(c, s)
We have seen that rules and strategies can be combined into more complex
strategies by means of strategy combinators. Cumulative effects are
obtained by sequential composition of strategies using the s1 ; s2
combinator. Choice combinators allow a strategy to decide between
alternative transformations. While Stratego provides a variety of choice
combinators, they are all based on the guarded choice combinator
s1 < s2 + s3
. Repetition of transformations is achieved using
recursion, which can be achieved through recursive definitions or the
rec
combinator.
Next up: ? shows the stuff that rewrite rules are made of.
In previous chapters we have presented rewrite rules as basic transformation steps. However, rules are not really atomic transformation actions. To see this, consider what happens when the rewrite rule
DAOL : And(Or(x, y), z) -> Or(And(x, z), And(y, z))
is applied. First it matches the subject term against the pattern
And(Or(x, y), z)
in the left-hand side. This means that a substitution
for the variables x
, y
, and z
is sought, that makes the pattern
equal to the subject term. If the match fails, the rule fails. If the
match succeeds, the pattern Or(And(x, z), And(y, z))
on the right-hand
side is instantiated with the bindings found during the match of the
left-hand side. The instantiated term then replaces the original subject
term. Furthermore, the rule limits the scope of the variables occurring
in the rule. That is, the variables x
, y
, z
are local to this
rule. After the rule is applied the bindings to these variables are
invisible again.
Thus, rather than considering rules as the atomic actions of transformation programs, Stratego provides their constituents, that is building terms from patterns and matching terms against patterns, as atomic actions, and makes these available to the programmer. In this chapter, you will learn these basic actions and their use in the composition of more complex operations such as various flavours of rewrite rules.
The build operation !p
replaces the subject term with the
instantiation of the pattern p
using the bindings from the environment
to the variables occurring in p
. For example, the strategy
!Or(And(x, z), And(y, z))
replaces the subject term with the
instantiation of Or(And(x, z), And(y, z))
using bindings to variables
x
, y
and z
.
stratego> !Int("10")
Int("10")
stratego> !Plus(Var("a"), Int("10"))
Plus(Var("a"), Int("10"))
It is possible to build terms with variables. We call this building a term pattern. A pattern is a term with meta-variables. The current term is replaced by an instantiation of pattern p.
stratego> :binding e
e is bound to Var("b")
stratego> !Plus(Var("a"),e)
Plus(Var("a"),Var("b"))
stratego> !e
Var("b")
Pattern matching allows the analysis of terms. The simplest case is
matching against a literal term. The match operation ?t
matches the
subject term against the term t
.
Plus(Var("a"),Int("3"))
stratego> ?Plus(Var("a"),Int("3"))
stratego> ?Plus(Int("3"),Var("b"))
command failed
Matching against a term pattern with variables binds those variables
to (parts of) the current term. The match strategy ?
compares the
current term (t) to variable x. It binds variable x to term t in the
environment. A variable can only be bound once, or to the same term.
Plus(Var("a"),Int("3"))
stratego> ?e
stratego> :binding e
e is bound to Plus(Var("a"),Int("3"))
stratego> !Int("17")
stratego> ?e
command failed
The general case is matching against an arbitrary term pattern. The
match strategy ?
compares the current term to a pattern p. It will add
bindings for the variables in pattern p to the environment. The wildcard
_
in a match will match any term.
Plus(Var("a"),Int("3"))
stratego> ?Plus(e,_)
stratego> :binding e
e is bound to Var("a")
Plus(Var("a"),Int("3"))
Patterns may be non-linear. Multiple occurences of the same variable can occur and each occurence matches the same term.
Plus(Var("a"),Int("3"))
stratego> ?Plus(e,e)
command failed
stratego> !Plus(Var("a"),Var("a"))
stratego> ?Plus(e,e)
stratego> :binding e
e is bound to Var("a")
Non-linear pattern matching is a way to test equality of terms. Indeed the equality predicates from the Stratego Library are defined using non-linear pattern matching:
equal = ?(x, x)
equal(|x) = ?x
The equal
strategy tests whether the current term is a a pair of the
same terms. The equal(|x)
strategy tests whether the current term is
equal to the argument term.
stratego> equal = ?(x, x)
stratego> !("a", "a")
("a", "a")
stratego> equal
("a", "a")
stratego> !("a", "b")
("a", "b")
stratego> equal
command failed
stratego> equal(|x) = ?x
stratego> !Foo(Bar())
Foo(Bar)
stratego> equal(|Foo(Baz()))
command failed
stratego> equal(|Foo(Bar()))
Foo(Bar)
Match and build are first-class citizens in Stratego, which means that they can be used and combined just like any other strategy expressions. In particular, we can implement rewrite rules using these operations, since a rewrite rule is basically a match followed by a build. For example, consider the following combination of match and build:
Plus(Var("a"),Int("3"))
stratego> ?Plus(e1, e2); !Plus(e2, e1)
Plus(Int("3"),Var("a"))
This combination first recognizes a term, binds variables to the pattern in the match, and then replaces the current term with the instantiation of the build pattern. Note that the variable bindings are propagated from the match to the build.
Stratego provides syntactic sugar for various combinations of match and build. We'll explore these in the rest of this chapter.
An anonymous rewrite rule (p1 -> p2)
transforms a term matching p1
into an instantiation of p2
.
Such a rule is equivalent to the sequence ?p1; !p2
.
Plus(Var("a"),Int("3"))
stratego> (Plus(e1, e2) -> Plus(e2, e1))
Plus(Int("3"),Var("a"))
Once a variable is bound it cannot be rebound to a different term. Thus,
once we have applied an anonymous rule once, its variables are bound and
the next time it is applied it only succeeds for the same term. For
example, in the next session the second application of the rule fails,
because e2
is bound to Int("3")
and does not match with Var("b")
.
stratego> !Plus(Var("a"),Int("3"))
Plus(Var("a"),Int("3"))
stratego> (Plus(e1,e2) -> Plus(e2,e1))
Plus(Int("3"),Var("a"))
stratego> :binding e1
e1 is bound to Var("a")
stratego> :binding e2
e2 is bound to Int("3")
stratego> !Plus(Var("a"),Var("b"))
Plus(Var("a"),Var("b"))
stratego> (Plus(e1,e2) -> Plus(e2,e1))
command failed
To use a variable name more than once Stratego provides term variable
scope. A scope {x1,...,xn : s}
locally undefines the variables xi
.
That is, the binding to a variable xi
outside the scope is not visible
inside it, nor is the binding to xi
inside the scope visible outside
it. For example, to continue the session above, if we wrap the anonymous
swap rule in a scope for its variables, it can be applied multiple
times.
stratego> !Plus(Var("a"),Int("3"))
Plus(Var("a"),Int("3"))
stratego> {e3,e4 : (Plus(e3,e4) -> Plus(e4,e3))}
Plus(Var("a"),Int("3"))
stratego> :binding e3
e3 is not bound to a term
stratego> !Plus(Var("a"),Var("b"))
Plus(Var("a"),Var("b"))
stratego> {e3,e4 : (Plus(e3,e4) -> Plus(e4,e3))}
Plus(Var("b"),Var("a"))
Of course we can name such a scoped rule using a strategy definition, and then invoke it by its name:
stratego> SwapArgs = {e1,e2 : (Plus(e1,e2) -> Plus(e2,e1))}
stratego> !Plus(Var("a"),Int("3"))
Plus(Var("a"),Int("3"))
stratego> SwapArgs
Plus(Int("3"),Var("a"))
When using match and build directly in a strategy definition, rather than in the form of a rule, the definition contains free variables. Strictly speaking such variables should be declared using a scope, that is one should write
SwapArgs = {e1,e2 : (Plus(e1,e2) -> Plus(e2,e1))}
However, since declaring all variables at the top of a definition is destracting and does not add much to the definition, such a scope declaration can be left out. Thus, one can write
SwapArgs = (Plus(e1,e2) -> Plus(e2,e1))
instead. The scope is automatically inserted by the compiler. This implies that there is no global scope for term variables. Of course, variables in inner scopes should be declared where necessary. In particular, note that variable scope is not inserted for strategy definitions in a let binding, such as
let SwapArgs = (Plus(e1,e2) -> Plus(e2,e1)) in ... end
While the variables are bound in the enclosing definition, they are not
restricted to SwapArgs
in this case, since in a let you typically want
to use bindings to variables in the enclosing code.
Often it is useful to apply a strategy only to test whether some
property holds or to compute some auxiliary result. For this purpose,
Stratego provides the where(s)
combinator, which applies s
to the
current term, but restores that term afterwards. Any bindings to
variables are kept, however.
Plus(Int("14"),Int("3"))
stratego> where(?Plus(Int(i),Int(j)); <addS>(i,j) => k)
Plus(Int("14"),Int("3"))
stratego> :binding i
i is bound to "14"
stratego> :binding k
k is bound to "17"
With the match and build constructs where(s)
is in fact just syntactic
sugar for {x: ?x; s; !x}
with x
a fresh variable not occurring in
s
. Thus, the current subject term is saved by binding it to a new
variable x
, then the strategy s
is applied, and finally, the
original term is restored by building x
.
We saw the use of where
in the definition of if-then-else
in ?.
A simple rewrite rule succeeds if the match of the left-hand side
succeeds. Sometimes it is useful to place additional requirements on the
application of a rule, or to compute some value for use in the
right-hand side of the rule. This can be achieved with conditional
rewrite rules. A conditional rule L: p1 -> p2 where s
is a simple
rule extended with an additional computation s
which should succeed in
order for the rule to apply. The condition can be used to test
properties of terms in the left-hand side, or to compute terms to be
used in the right-hand side. The latter is done by binding such new
terms to variables used in the right-hand side.
For example, the EvalPlus
rule in the following session uses a
condition to compute the sum of i
and j
:
stratego> EvalPlus: Plus(Int(i),Int(j)) -> Int(k) where !(i,j); addS; ?k
stratego> !Plus(Int("14"),Int("3"))
Plus(Int("14"),Int("3"))
stratego> EvalPlus
Int("17")
A conditional rule can be desugared similarly to an unconditional rule. That is, a conditional rule of the form
L : p1 -> p2 where s
is syntactic sugar for
L = ?p1; where(s); !p2
Thus, after the match with p1
succeeds the strategy s
is applied to
the subject term. Only if the application of s
succeeds, is the
right-hand side p2
built. Note that since s
is applied within a
where
, the build !p2
is applied to the original subject term; only
variable bindings computed within s
can be used in p2
.
As an example, consider the following constant folding rule, which reduces an addition of two integer constants to the constant obtained by computing the addition.
EvalPlus : Add(Int(i),Int(j)) -> Int(k) where !(i,j); addS; ?k
The addition is computed by applying the primitive strategy add
to the
pair of integers (i,j)
and matching the result against the variable
k
, which is then used in the right-hand side. This rule is desugared
to
EvalPlus = ?Add(Int(i),Int(j)); where(!(i,j); addS; ?k); !Int(k)
Sometimes it is useful to define a rule anonymously within a strategy expression. The syntax for anonymous rules with scopes is a bit much since it requires enumerating all variables. A `lambda' rule of the form
\ p1 -> p2 where s \
is an anonymous rewrite rule for which the variables in the left-hand
side p1
are local to the rule, that is, it is equivalent to an
expression of the form
{x1,...,xn : (p1 -> p2 where s)}
with x1
,...,xn
the variables of p1
. This means that any variables
used in s
and p2
that do not occur in p1
are bound in the
context of the rule.
A typical example of the use of an anonymous rule is
stratego> ![(1,2),(3,4),(5,6)]
[(1,2),(3,4),(5,6)]
stratego> map(\ (x, y) -> x \ )
[1,3,5]
One frequently occuring scenario is that of applying a strategy to a term and then matching the result against a pattern. This typically occurs in the condition of a rule. In the constant folding example above we saw this scenario:
EvalPlus : Add(Int(i),Int(j)) -> Int(k) where !(i,j); addS; ?k
In the condition, first the term (i,j)
is built, then the strategy
addS
is applied to it, and finally the result is matched against the
pattern k
.
To improve the readability of such expressions, the following two
constructs are provided. The operation <s> p
captures the notion of
applying a strategy to a term, i.e., the scenario !p; s
. The
operation s => p
capture the notion of applying a strategy to the current subject
term and then matching the result against the pattern p
, i.e.,
s; ?p
. The combined operation <s> p1 => p2
thus captures the notion
of applying a strategy to a term p1
and matching the result against
p2
, i.e, !t1; s; ?t2
. Using this notation we can improve the
constant folding rule above as
EvalPlus : Add(Int(i),Int(j)) -> Int(k) where <add>(i,j) => k
Applying Strategies in Build.
Sometimes it useful to apply a strategy directly to a subterm of a pattern, for example in the right-hand side of a rule, instead of computing a value in a condition, binding the result to a variable, and then using the variable in the build pattern. The constant folding rule above, for example, could be further simplified by directly applying the addition in the right-hand side:
EvalPlus : Add(Int(i),Int(j)) -> Int(<add>(i,j))
This abbreviates the conditional rule above. In general, a strategy application in a build pattern can always be expressed by computing the application before the build and binding the result to a new variable, which then replaces the application in the build pattern.
Another example is the following definition of the map(s)
strategy,
which applies a strategy to each term in a list:
map(s) : [] -> []
map(s) : [x | xs] -> [<s> x | <map(s)> xs]
Term wrapping and projection are concise idioms for constructing terms that wrap the current term and for extracting subterms from the current term.
One often write rules of the form \ x -> Foo(Bar(x))\
, i.e. wrapping a
term pattern around the current term. Using rule syntax this is quite
verbose. The syntactic abstraction of term wraps, allows the concise
specification of such little transformations as !Foo(Bar(<id>))
.
In general, a term wrap is a build strategy !p[<s>]
containing one or
more strategy applications <s>
that are not applied to a term. When
executing the the build operation, each occurrence of such a strategy
application <s>
is replaced with the term resulting from applying s
to the current subject term, i.e., the one that is being replaced by the
build. The following sessions illustrate some uses of term wraps:
3
stratego> !(<id>,<id>)
(3,3)
stratego> !(<Fst; inc>,<Snd>)
(4,3)
stratego> !"foobar"
"foobar"
stratego> !Call(<id>, [])
Call("foobar", [])
stratego> mod2 = <mod>(<id>,2)
stratego> !6
6
stratego> mod2
0
As should now be a common pattern, term projects are implemented by
translation to a combination of match and build expressions. Thus, a
term wrap !p[<s>]
is translated to a strategy expression
{x: where(s => x); !p[x]}
where x
is a fresh variable not occurring in s
. In other words, the
strategy s
is applied to the current subject term, i.e., the term to
which the build is applied.
As an example, the term wrap !Foo(Bar(<id>))
is desugared to the
strategy
\{x: where(id => x); !Foo(Bar(x))}
which after simplification is equivalent to \{x: ?x; !Foo(Bar(x))\
}, i.e., exactly the original lambda rule
\ x -> Foo(Bar(x))\
.
Term projections are the match dual of term wraps. Term projections can
be used to project a subterm from a term pattern. For example, the
expression ?And(<id>,x)
matches terms of the form And(t1,t2)
and
reduces them to the first subterm t1
. Another example is the strategy
map(?FunDec(<id>,_,_))
which reduces a list of function declarations to a list of the names of
the functions, i.e., the first arguments of the FunDec
constructor.
Here are some more examples:
[1,2,3]
stratego> ?[_|<id>]
[2,3]
stratego> !Call("foobar", [])
Call("foobar", [])
stratego> ?Call(<id>, [])
"foobar"
Term projections can also be used to apply additional constraints to subterms in a match pattern. For example, `?Call(x,
3>)` matches only with function calls with three arguments. A match expression `?p[