I created a 32-bits addressable virtual machine that has some simple instructions long ago to practice writing in assembly. At that time I don't have much clue on how you can make a compiler - I can write in assembly, I can write a parser, but I just didn't have any clue on how to make a compiler.
Years have gone by. Now I actually feel like I can make a compiler myself. So I did. And this was the result. It's cursed, but it is fun.
Before we get into the fun part, let's take a look at the basics of the language.
The only datatype in natolang
is 32-bit int. (Since the target VM is 32-bits addressable.)
123; # => 123
1.2; # bad token '.', won't compile.
'a'; # => 97
'aa'; # => 1067 (97 * 10 + 97)
Maths. Like any other programing language, we need basic math.
10 + 10; # => 10 (ADD)
10 - 5; # => 5 (SUB)
5 * 2; # => 10 (MUL)
5 / 2; # => 2 (DIV)
10 % 4; # => 2 (MOD)
!0; # => 1 (NOT)
!1; # => 0 (NOT)
We got logic too. Pretty much the same as any other programing language except we use &
and |
for logic AND and OR.
1 & 1; # => 1 (AND) (not bitwise)
0 & 1; # => 0 (AND) (not bitwise)
1 | 0; # => 1 (OR) (not bitwise)
1 == 1; # => 1 (EQ)
2 > 1; # => 1 (GT)
2 < 1; # => 0 (LT)
2 >= 1; # => 1 (GE)
2 <= 1; # => 0 (LE)
2 != 1; # => 1 (NE)
The syntax is similar to C, but the way they work is fundamentally different. We will talk about that later.
var a = 0;
var b[10] = "nawoji\n";
var c[10] = { 'n', 'a', 't', 'o', '\n', 0 };
Note that we don't have n-D arrays, only 1-D arrays. Also, the initializer list and literal and not a pointer type. There's no such thing as a pointer in natolang
. More on that later.
You can use assignment operators on variables:
var a = 10;
a += 10;
a -= 2;
a *= 4;
a /= 2;
a %= 3;
And also increment/decrement:
var a = 10;
a++; ++a;
a--; --a;
Note that the good old "pre-increment (++i
) is faster than post-increment (i++
)" is true in natolang
- since the compiler won't optimize post-increment to pre-increment when the value is not used.
You may get the address of a variable or use a variable as address with with *
and &
operator:
var a = 0;
var b = &a;
*b = 1;
printi(a); # prints "1"
Note that b
is not a pointer type - *
just allows you to use the value of a variable as address, &
operator simply returns the address of that variable, and "arrays" are not pointers. More specifically:
var a[10] = {100, 200, 300};
printi(a); # prints "100"
*(&a+1) = 10;
printi(a[1]); # prints "10"
We got conditionals and loops. They should work as you imagined.
var a;
for (a = 0; a < 5; ++a) {
var b = 0;
printi(a);
if (a == 4) break;
else if (a == 3) continue;
while (b < 5) {
++b;
printi(b);
break;
}
}
And we got goto:
lbl1:
prints("Hello\n");
goto lbl1;
Function declaration looks like bash, and you access function parameters like bash:
fun fib {
# use $1, $2, ... to refer to function arguments.
if ($1 <= 1) return $1;
# the last statement/expression in a function automatically becomes return
# value.
fib($1-1) + fib($1-2);
}
fib(10); # invoke function.
Arguments count can be accessed with $$
, and function arguments can be accessed with $(expression)
, like this:
fun printvars {
var i;
prints("number of args: ");
printi($$);
printc('\n');
for (i = 1; i <= $$; ++i) printi($(i));
printc('\n');
}
printvars(1, 1, 4, 5, 1, 4); # prints "114514"
We got some built-in functions (actually, they are handled as operator):
# takes one parameter, print the result as an integer.
printi(n);
printi(1+1);
# takes one parameter, print the result as an integer.
printc(n);
printc('a' + 1);
# prints takes one parameter and print it as string, the parameter must be a
# variable or a string literal.
prints(n);
prints("Hello, world!\n");
# getchar. takes no parameter, get and return one char from stdin.
c = getc();
# returns the number of 32-bits size memory blocks allocated for this variable.
sz = sizeof(c);
# exit.
exit();
Variables in { }
are simply scoped variables, not local variables. Think of them as the C static variable. For example, for the following code:
var a;
for (a = 0; a < 5; ++a) {
var b;
++b;
printi(b);
printc(' ');
}
The output will be 1 2 3 4 5
. Note that the =
in the variable declaration is not initialization. It is an assignment. So if you do:
var a;
for (a = 0; a < 5; ++a) {
var b = 0;
++b;
printi(b);
printc(' ');
}
You will get 1 1 1 1 1
.
Functions can be nested:
fun fa {
fun fb {
$1 + 1;
}
fb($1) + 2;
}
Nested functions are basically scoped variables. They can't be accessed from the outside. (so you can't invoke fb
outside fa
)
And we use #
for comment if you haven't already noticed.
The first thing you need to know is that the idea of the pointer/reference does not exist in natolang
, in the sense that there's no actual pointer type. (I'm lazy)
So, when you do this:
var a = 0;
var c[10] = { 'n', 'a', 't', 'o', '\n', 0 };
The compiler actually puts nato\n\0
at the variable's location.
And arrays are just a normal variable with some extra memory spaces append to them (size of the space specified with []
). So in the data region, the variable a
and c
shown above looks like this:
Address 0 1 2 3 4 5 6 7 8 9 10
Variable a c c c c c c c c c c
Value \0 n a t o \n \0 \0 \0 \0 \0
And this is why we don't have n-D arrays. Since what looks like an array are just a big variable.
Also, you can do this:
var v1 = { 1, 2, 3 };
var v2;
var v3;
prints("v1: "); printi(v1); prints(", ");
prints("v2: "); printi(v2); printc(", ");
prints("v3: "); printi(v3); printc(", ");
prints("v2[1]: "); printi(v2[1]); printc('\n');
Run it yields the following output:
v1: 1, v2: 2, v3: 3, v2[1]: 3
How could this be fun? Let's take a look at how functions in this cursed language work before we get into that.
If you define a variable with ASM in it:
var asm[20] = {
40, # SRS SubRoutine Start
2, 'H', # IMM 'H' IMMediate
44, # PAC Print Accumulator Char
2, 'e', # IMM 'e' IMMediate
44, # PAC Print Accumulator Char
2, 'l', # IMM 'l' IMMediate
44, # PAC Print Accumulator Char
2, 'l', # IMM 'l' IMMediate
44, # PAC Print Accumulator Char
2, 'o', # IMM 'o' IMMediate
44, # PAC Print Accumulator Char
2, '\n', # IMM '\n' IMMediate
44, # PAC Print Accumulator Char
39, # SRE SubRoutine End
};
Guess what? You can invoke it.
asm(); # => Hello
Yes, function a just like any other variables. That means you can do this:
fun self_mod {
printi(1);
self_mod[2] = self_mod[2] + 1;
}
Every time you call self_mod
, it's output will be increased by one. The self_mod
function compiles to:
00 SRS
01 IMM
02 1
03 PAI
.. ...
XX SRE
By changing self_mod[2]
, we changed what will be IMM
to the register next time. This is a function that modifies itself, quite fun, isn't it?
A more involved example will be to search for a certain instruction in function's code and change it:
fun do_op {
$1 + 10;
}
fun find_op {
var i;
for (i = 0; i < $2; i++) {
# looks for "IMM 10; ADD" i.e. "+ 10";
if (*($1 + i) == 2 & *($1 + i + 1) == 10 & *($1 + i + 2) == 20 & *($1 + i + 3) == 39) {
prints("'+' is at: ");
printi(i+2);
printc('\n');
break;
}
}
i + 2;
}
printi(do_op(20)); # prints "30" (20 + 10)
printc('\n');
do_op[find_op(&do_op, sizeof(do_op))] = 22; # opcode for MUL (*)
printi(do_op(5)); # prints "50" (5 * 10)
The virtual machine has the following instructions:
enum {
/* opcodes w/ operand */
ARG, // load imm th function param to acc
ADJ, // adjust sp
IMM, // load immediate val to acc
J, // jump to immediate val
JS, // jump to immediate val and push pc to stack
JZ, // jump to immediate val if zero
JNZ, // jump to immediate val if not zero
ADI, SBI, MUI, DII, MDI, // math op w/ immediate val
EQI, NEI, GTI, LTI, GEI, LEI, // comp
ANI, ORI, // logic
/* opcodes w/o operand */
ADD, SUB, MUL, DIV, MOD, // math op
EQ, NE, GT, LT, GE, LE, // comp
AND, OR, NOT, // logic
LD, // load val at addr in acc to acc
SV, // save acc to stacktop addr
LA, // load acc th function param to acc
PSH, // push acc to stack
POP, // pop stack to acc
SRE, // end subroutine: restore fp & sp
SRS, // start subroutine: push fp on to stack, set fp to sp
PSI, // print stacktop value as int
PSC, // print stacktop value as char
PAI, // print acc value as int
PAC, // print acc value as char
GC, // get char to acc
EXT, // exit
NOP, // no-op
};
Clone the repo and run make. Then compile and run your program with ./compiler -s code.n -o code.bin && ./vm code.bin
:
$ git clone https://github.com/nat-lab/natolang
$ cd natolang
$ make
$ ./compiler -s examples/bubble.n -o bubble.bin
$ ./vm bubble.bin
$ ./vm -v bubble.bin # verbose vm
UNLICENSE