Compiler for Tiger programming language
written in Standard ML for Duke ECE/CS 553: Compiler Construction.
- Jake Derry
Ryan Piersma- Radu Vasilescu
A group of tigers is either called a "streak" or an "ambush."
Source: Archived copy of Animal group names. Zoological Society of San Diego. Archived from the original on July 4, 2013.
To build:
make
To use the compiler:
./ambush [tiger file]
This will place the output file in the same directory as the source file, with
the extension .s
appended (for example test1.tig
becomes test1.tig.s
).
To run the resulting assembly file using SPIM:
spim -file [assembly file]
To perform this whole process at once, use the test.sh
script. For
example, to build the compiler, compile test 51, and run the result using SPIM,
just execute:
tc=test51.tig ./test.sh
For more information, check the Makefile
.
**Note: This has only been tested on MacOS. If using Linux, Windows, or other OS,
please build the project manually using sml
then CM.make "sources.cm"
. **
To test the compiler, open a terminal in the main folder and run:
sml run.sml
In the resulting SML REPL, execute
Main.compile "testcases/myProgram.tig";
errormsg.sml
provides a signature for creating helpful error messagessml-style-guide.pdf
outlines a style convention for writing SML codesources.cm
is the "Makefile" for the Compilation Manager- The
testcases/
folder contains several Tiger example programs - The
.cm/
folder contains Compilation Manager auto-generated files .gitignore
is used by Git to exclude files
tiger.lex
is our ML-Lex definition file for Tigertiger.lex.sml.ours
is the auto-generated lexer from our owntiger.lex
tiger.lex.sml
is the auto-generated lexer from the TEXTBOOK authortokens.sig
andtokens.sml
are the starter code files for tokens and signatures. They are unused currently, in favor of the auto-generated tokens and signatures created by ML-Yacc.
tiger.grm
is our ML-Yacc grammar definition for Tigertiger.grm.desc
is an auto-generated file from ML-Yacctiger.grm.sig
contains auto-generated code from ML-Yacctiger.grm.sml
contains the auto-generated parser by ML-Yacc
TODO: Finish this description/list
The lexer is responsible for turning source code into tokens.
Ambush handles comments by creating a COMMENT
state which represents being
inside a comment. There is alao a variable called inComment
which keeps track
of whether the lexer is currently inside a COMMENT
state.
These are the relevant few rules:
val inComment : bool ref
<REM>"Enter comment." => (continue ());
<INITIAL>"/*" => (YYBEGIN COMMENT; inComment := true; continue ());
<REM>"Exit comment." => (continue ());
<COMMENT>"*/" => (YYBEGIN INITIAL; inComment := false; continue ());
<REM>"Ignore symbols and reserved words in comments." => (continue ());
<COMMENT>. => (continue ());
The inComment
variable is used to detect comments that are left unclosed at
the end of the file (see Error Handling section for more details).
We handle strings by using two additional states: the STRING
state which
represents being inside a string and the ESCAPE
state which represents being
inside an escape character.
After entry into the STRING
state, all characters reached are stored in a
currentString
variable. When existing the STRING
state, the currentString
is used to create the new token which represents a strin.
Additionally, we deal with escape characters after entering the ESCAPE
state, adding the appropriate escape character to the currentString
variable
after identifying the escape character with the character(s) following the
backslash.
If the lexer encounters an invalid character anywhere in any state (that is to
say, any portion of code not already matched by a different rule), it presents
an ErrorMsg
, and does not emit a token for that portion of code:
. => (ErrorMsg.error yypos
("illegal character " ^ yytext ^ "(ASCII "
^ (Int.toString (Char.ord (hd (String.explode yytext))))
^ ")");
continue ());
At EOF
(End-Of-File), we detect any unfinished strings or comments in the
eof
function. If ending in one of these non-accepting states, we raise an
error message that indicates whether the program was ended with an unfinished
string or comment. We always emit the EOF
token whether an error was reported
or not:
(* Deals with reaching the end of file. *)
fun eof () =
let val pos = hd (!linePos)
in case (!currentString, !inComment)
of ("", false) => Tokens.EOF (pos, pos)
| ("", true) => (ErrorMsg.error pos
("Expected end of comment, \
\ found EOF");
Tokens.EOF (pos, pos))
| (_, _) => (ErrorMsg.error pos
("Expected end of string, \
\ found EOF");
Tokens.EOF (pos, pos))
end
Implemented a parser that takes in a set of tokens and outputs an AST.
The parser currently has neither shift/reduce nor reduce/reduce conflicts, and seems to parse the Tiger test cases correctly. For example, the following Tiger program:
/* an array type and an array variable */
let
type arrtype = array of int
var arr1:arrtype := arrtype [10] of 0
in
arr1
end
Parses to the tree:
LetExp([
VarDec(arr1,true,SOME(arrtype),
ArrayExp(arrtype,
IntExp(10),
IntExp(0))),
TypeDec[
(arrtype,
ArrayTy(int))]],
VarExp(
SimpleVar(arr1)))
Our type checker helps the user find typing issues within their program to help debug their programs when the type checker fails. In addition, there are some conditions when the type checker fails because of an internal issue. These conditions are noted by raising an exception that causes the compiler to crash and currently, these conditions are unreachable.
After type checking the entire program (which helps users find the issues within their programs), the rest of the compilation process does not continue.
The type checker produces helpful error messages to make debugging as easy as
possible. A lot of the type checking error outputs are based on the outputs
that SML gives. For example, here is the error message produced for test22.tig
.
let
type rectype = {name:string , id:int}
var rec1 := rectype {name="Name", id=0}
in
rec1.nam := "asd"
end
Error message:
testcases/test22.tig:7.2:Type Checking Error: Could not find field nam
Expected: { nam : 'a, ...}
Actual: { id : int, name : string, }
The next stage is conversion to Intermediate Representation, a format in which the code is represented as a set of trees consisting of basic operations. For example, the following Tiger code:
let var a := 7
var b := 9
var c := 3
in c := a + b
end
Translates to the following IR (comments added for explanation):
(* Assign initial values to variables on stack *)
MOVE(MEM (BINOP (PLUS, TEMP tt25, CONST ~4)),
CONST 7)
MOVE(MEM (BINOP (PLUS, TEMP tt25, CONST ~8)),
CONST 9)
MOVE(MEM (BINOP (PLUS, TEMP tt25, CONST ~12)),
CONST 3)
(* do c <- a + b, where (a, b, c) are on stack *)
MOVE(MEM (BINOP (PLUS, TEMP tt25, CONST ~12)),
BINOP(PLUS, MEM (BINOP (PLUS, TEMP tt25, CONST ~4)),
MEM (BINOP (PLUS, TEMP tt25, CONST ~8))))
In order to produce MIPS assembly language instructions, the IR must be converted to instructions through the Instruction Selection process.
For example, Instruction Selection produces the following output for the IR above:
PROCEDURE L0
L0:
L2:
addi t0, r0, 7
sw t0, ~4(t25)
addi t1, r0, 9
sw t1, ~8(t25)
addi t2, r0, 3
sw t2, ~12(t25)
lw t4, ~4(t25)
lw t5, ~8(t25)
add t3, t4, t5
sw t3, ~12(t25)
j L1
L1:
END L0
The liveness analysis stage first builds a control-flow graph of the program, and then computes the "liveness" of each temp at every node in the graph. Then, it generates an interference graph, where every node is a temp and every edge signifies that those temps are live at the same time.
Here is an example of a Liveness Analysis on the following Tiger program:
let var a := 0
in while(a < 10) do a := a + 1; nil
end
Control-flow Graph:
Liveness Results:
Node: nid = 0 liveIn: , t25 liveOut: , t25 move: N/A
Node: nid = 1 liveIn: , t25 liveOut: , t1, t25 move: N/A
Node: nid = 2 liveIn: , t1, t25 liveOut: , t25 move: N/A
Node: nid = 3 liveIn: , t25 liveOut: , t25 move: N/A
Node: nid = 4 liveIn: , t25 liveOut: , t2, t25 move: N/A
Node: nid = 5 liveIn: , t2, t25 liveOut: , t2, t4, t25 move: N/A
Node: nid = 6 liveIn: , t2, t4, t25 liveOut: , t2, t4, t5, t25 move: N/A
Node: nid = 7 liveIn: , t2, t4, t5, t25 liveOut: , t2, t3, t25 move: N/A
Node: nid = 8 liveIn: , t2, t3, t25 liveOut: , t25 move: N/A
Node: nid = 9 liveIn: liveOut: move: N/A
Node: nid = 10 liveIn: liveOut: move: N/A
Node: nid = 11 liveIn: , t25 liveOut: , t25 move: N/A
Node: nid = 12 liveIn: , t25 liveOut: , t6, t25 move: N/A
Node: nid = 13 liveIn: , t6, t25 liveOut: , t0, t25 move: t0 <- t6
Node: nid = 14 liveIn: , t0, t25 liveOut: , t0, t8, t25 move: N/A
Node: nid = 15 liveIn: , t0, t8, t25 liveOut: , t0, t25 move: N/A
Node: nid = 16 liveIn: , t0, t25 liveOut: , t0, t9, t25 move: N/A
Node: nid = 17 liveIn: , t0, t9, t25 liveOut: , t25 move: N/A
Node: nid = 18 liveIn: , t25 liveOut: , t25 move: N/A
Node: nid = 19 liveIn: liveOut: move: N/A
Which then generates the interference graph:
The Register Allocation phase of the compiler applies a graph-coloring algorithm to the interference graph in order to "color" (assign) temps to certain physical registers. The idea is that multiple temps can be assigned to the same physical register, so long as they do not interfere with one another (share an edge in the interference graph AKA are live at the same time).
In addition, we had to keep track of pre-allocated registers such as special machine registers like the frame pointer, return address, and so on.
We have not implemented spilling or coalescing, meaning that for now, we can only handle compiling Tiger programs that use a limited number of temps-- if they try to use too many temps, we won't have room in the physical registers and are not yet able to spill to memory.
One other improvement that we made during this stage is that instead of creating a series of
MOVE
s to save and restore the caller-saved registers before and after a function call,
the compiler now saves those registers' values to local variables allocated within the
current frame, which saves a lot of register space and helps raise the limit to spilling.
Here is an example of the register allocater at work. The following Tiger program:
let var a := 0
in while(a < 10) do a := a + 1; nil
end
Compiles to the following MIPS assembly with correct physical registers allocated:
.text
j L0
.text
# PROCEDURE L0
L0:
L5:
addi $t1, $0, 0
sw $t1, -4($fp)
L2:
addi $v0, $0, 1
lw $a0, -4($fp)
addi $a1, $0, 10
slt $a0, $a0, $a1
beq $v0, $a0, L3
b L1
L1:
j L4
L3:
lw $t0, -4($fp)
addi $a3, $t0, 1
addi $a2, $a3, 0
sw $a2, -4($fp)
j L2
L4:
# END L0
To finish, we added some miscellaneous fixes throughout the project.
Functions now get their arguments/formals from the right access
es, and follow
proper calling conventions for saving and restoring registers.
We also (fixed and then) added the Tiger runtime library as provided by Professor
Hilton, and augmented it with a few of our own functions, such as print_int
,
which use MIPS syscalls to easily accomplish what would otherwise have been complicated
features to implement in plain Tiger.
We also revamped the project's build and testing system by switching to make
and ml-build
to package a heap snapshot of the SMLofNJ environment for ease
of use.
We had an idea to implement a musical compiler that produces sound output
relevant to each stage of the compilation process as it runs. Upon realizing
the sheer stupidity of this idea, it has been quarantined to its own branch.
To play with this feature, checkout the branch musical
and see its version
of the README.md
.
No guarantees are made that the musical
branch will be maintained or updated
past the state that it was as of 2020-02-27 at 12:15 PM. As of now, it should
not be considered part of our official submission.
- SML Formatter
- Tiger formatter
- Tiger VS Code extension
- Coloring formatting for (let, in, end)
- Garbage collector
- Rich error reporting
- Tiger REPL
- More/better Tiger libraries
- Math library
- Data structures library
- Optimized compiliation for different processors
- Tiger web framework (integration)? See
SML on Stilts
...
Do you want us to send the cocaine directly to your email? -- Jake Derry, 2020, somehow contextually related to this project