Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the first 6 Chapters we have closely followed the method and the steps that Jack taught us in his original tutorial, which made it so easy to deliver a working compiler in very simple and easy to understand steps.
In Chapter VII we removed the last remaining limitations (single character tokens) and set up the necessary structure to be able to recognise any token and pass it to our parser. We also set up the tools to make our language fully configurable and to be able to define and recognise any token we want.
In Chapters VIII and IX we defined our own programming language: TINSEL, which looks like a mini C. We demonstrated that our compiler can process TINSEL programs with decent error checking.
We still have limitations on the output side, as the code we are producing is still a mixture of M68000 instructions and dummy commands. Obviously this output is not much use for any practical application.
And this is what we will fix in this Chapter: we will build a new code module, for x86-64, and will make any other necessary changes in our compiler to produce assembly code for x86-64, for Linux. That code anyone will be able pass through an assembler and linker and then run it and test it and see the result of.
Some Background
First, I’d like to set the plan and explain what exactly we will do in this Chapter. This will not be a lesson on Assembly language, however, there are certain things which have to be explained in detail so that you understand why our output code is generated in a specific way and how it works.
Please note these important points to ensure that you can actually run the code that our compiler produces:
- The output code that we will produce here is compatible with the GNU Assembler. As you may know, there are two standards in the x86 world, the Intel standard and the AT&T standard. The main difference between the two is the order in which Source – Destination appear in the instructions (plus a few more minor differences). Our compiler will produce AT&T-standard assembly code which is recognisable by the GNU assembler. If anyone wants to use Intel-standard and the NASM assembler, they will have to change the code module accordingly.
- The resulting code can be run only on Linux (has been tested on Ubuntu 20.04 and Fedora 35, 64-bit). The reason for this is that certain system calls have been used for reading, writing and exit program. These calls are specific to Linux and the 64-bit architecture and would not work on Windows. In order for the output code to work on Windows, these system calls would have to be changed to the equivalent Windows system calls.
- The output code is compatible with the x86-64-bit architecture only – it will not run on 32-bit architecture. The 64-bit registers are being used throughout the code but also the calling convention is different in x86-64. In x86-64 the first 6 parameters to a function are passed via CPU registers (as long as they are integers or pointers) while previously they were passed via the Stack. So, to make our output code work in 32-bit architecture, not only the registers have to be changed (
rax -> eax, rbx -> ebx,
etc), but also the calling sequence when a function is called must be changed. Unless you are very comfortable with x86 assembly, this can be a very delicate operation. - Given that we want to be doing input and output in our TINSEL programs, we need a run-time library that will support these functions (we don’t want to keep rewriting these in every program). I have implemented these functions in x86-64 assembly and the source code is in my Github repository. These will have to be linked with your compiled TINSEL program(s) to produce the final executable.
- Any programs you write on TINSEL can either be run autonomously or can be linked with a C program and can be invoked from there. You will have to make certain changes for this and take certain considerations into account – more details later in this chapter.
Please don’t be confused by the above. Everything will be clarified by the end of this chapter. In the end, if you are not interested at all in Assembly, you can just skip to the end of this chapter, take the sources from the Github repository and run them on your Linux box. On the other hand, I’ve always wanted to know “what’s inside” and understand “how it works”, so any opportunity to write Assembly code is a temptation I can’t resist.
For those of you who are interested, a good quick reference is here, while you can find a detailed x86-64 (AT&T flavour) reference manual in the Oracle documentation pages.
The x86-64 Calling Convention
As mentioned above, in x86-64, the function parameters are passed via CPU registers. There are sixteen 64-bit registers: %rax, %rbx, %rcx, %rdx, %rdi, %rsi, %rbp, %rsp
, and %r8-r15
. The following six are used to pass the first six integer or pointer parameters (in the below order) to a function that is called:
first six int or ptr parameters: %rdi, %rsi, %rdx, %rcx, %r8, and %r9
Any additional parameters or any large structures are passed through the stack (first parameter at the top).
The function will return its value (if there is one) in the following CPU register
return value: %rax
.
I have applied this calling convention (a) in the code module of our compiler when a function is called (at the moment this is used only when a library function is called – our Tinsel functions don’t support parameters yet) and (b) in the runtime libraries, which consist of functions that our code will be calling (see below). For this reason the programs that are produced by our TINSEL compiler are not compatible with the 32-bit architecture.
Runtime Libraries
Every programming language has a runtime system that includes libraries, which are linked with the program (statically or dynamically). These libraries provide basic functions that the program needs in order to run.
The first version of our runtime library has only two functions that are available to our TINSEL programs, read_i_
and write_i_
, which are used on the back of the TINSEL commands read
and print
respectively. There is also a number of other functions but those are used internally within the library.
The read_i_
function takes no parameters and returns the integer value read from the keyboard in %rax
. Internally it calls the read_s_
function, part of the same library, which reads a string from the standard input using the Linux system call #0 (= read). To read the string, it uses an internal buffer (defined in the same library) to store it temporarily, and then calls another library function, atoi_
, which converts the string to integer.
The write_i_
function takes one parameter (in %rdi
), which is the integer to be printed. It first converts it to string using itoa_
(another library function) and then calls write_s_
which prints the string to standard output using the Linux system call #1 (= write).
Even though at the moment our programs will be using only read_i_
and write_i
, the other library functions will come useful in later chapters when we introduce strings.
This version of our runtime library consists of two assembly files (readwrite.s
and string_utils.s
) that you can find under the libtinsel/src
directory in the GitHub repository. I will explain at the end of this chapter how to pass them through the Assembler and link them into your TINSEL programs.
Please note, you can also try these library functions from a C program on Linux (Ubuntu 20.04 or Fedora 35, 64-bit) as follows:
/* test1.c */ extern long long int read_i_(); extern void write_i_(long long int); long long int i; void main() { do write_i_(i = read_i_()); while(i); }
Remember the library routines are written for 64-bit architecture, so all integers must be declared as long long int
, otherwise negative integers will not work. You can compile this small test program and link it with our library using this command:
gcc -o test1 test1.c readwrite.s string_utils.s
and you can just run it as ./test1
. It will read an integer from the keyboard and print it on the screen in a loop until you give it a 0 (or Enter
). You may notice that each integer you enter is printed without a new line, and this is exactly what write_i
is supposed to do: write an integer on the screen. However, in order to make the output of our programs easier to read, we will address this gap with a temporary workaround later in this chapter, when we implement the code for print
, and with a more permanent solution when we introduce strings in one of the following chapters.
The x86-64 Code Module for our Compiler
This is where we will create a new code module that will produce x86-64 code. There are almost no changes required in our Lexical Scanner or the Parsers other than a couple of places where we need to call some functions to generate extra code.
Please copy the code file M68000Instructions.kt
to x86-64Instructions.kt
. Remember to also change the class name to class x86_64Instructions
. Also change the code
variable in the CompilerMain.kt
:
// the Intel x86-64 instruction set val code = x86_64Instructions()
Now, let’s first tidy up the routines that send code to the output. We need to write code (a) with or without a tab in the beginning, (b) with or without a newline at the end, (c) output a label and (d) output a comment. The routines to do this will be:
private val COMMENT = "#" /** output code */ private fun outputCode(s: String) = print(s) /** output code with newline */ fun outputCodeNl(s: String = "") = outputCode("$s\n") /** output code with tab */ fun outputCodeTab(s: String) = outputCode("\t$s") /** output code with tab and newline */ fun outputCodeTabNl(s: String) = outputCodeTab("$s\n") /** output comment */ fun outputComment(s: String) = outputCode("$COMMENT $s\n") /** output a label */ fun outputLabel(s: String) = outputCodeNl("$s:")
As you can see, only one of these functions uses print
(and the rest use the first one), so when we decide to send the compiler output to a file instead, it will be a very easy change.
Next step, we need to include some new functions that will create certain sections, symbols and other declarations that the GNU assembler needs in order to produce a complete, runnable object module.
I have decided as first line to have the program name in a comment. We then need to define the .data section, align it to 8-bit boundary and declare a copyright string. The function that does all that is:
/** initialisation code for assembler */ fun progInit(progName: String) { outputComment("program $progName") outputCodeNl(".data") outputCodeNl(".align 8") // copyright message outputCodeTabNl("tinsel_msg_: .string \"TINSEL version 1.0 for x86-84 (Linux) Dec 2021 (c) M.Pappas\\n\"") // newline outputCodeTabNl("newline_: .string \"\\n\"") }
It has to be called at the very beginning, so I have added it to the parseProgramHeader
code:
fun parseProgHeader() { inp.match(Kwd.startOfProgram) code.progInit(inp.match(Kwd.identifier).value) }
Next, we need the necessary code to declare variables. One thing to remember is that in Assembler there are no real variables, there are just labels. So effectively, our variable declaration results in defining a label. In order to treat this as a variable we need to reserve memory where this label is pointing – this is where the value of this variable will go. As our variables are only integer for now we need to allocate 8 bytes (or a quad word) for each variable. In order to reserve these 8 bytes, we have to give them a value. This means that in case of uninitialised variables (in the TINSEL program, that is) we have to choose a value and I have chosen 0. So, any global variable that is not initialised in our TINSEL program will default to 0.
/** declare int variable (64bit) */ fun declareInt (varName: String, initValue: String) { if (initValue == "") outputCodeTabNl("$varName:\t.quad 0") else outputCodeTabNl("$varName:\t.quad $initValue") }
We call this code generating function as the last line in declareVar
(that declares each of our global variables):
code.declareInt(name, initValue)
With the above, the .data
section is completed and it’s time to move to the .text
section. We first declare our program’s entry point (_start
) as global so that the linker can find it.
private val MAIN_ENTRYPOINT = "_start" /** initial code for functions */ fun funInit() { outputCodeNl("\n.text") outputCodeNl(".global $MAIN_ENTRYPOINT") }
This is called in parseProgram after we are done with variables and just before we start processing the function declarations, which is where the .text
section starts.
fun parseProgram() { parseProgHeader() if (inp.lookahead().encToken == Kwd.varDecl) parseVarDecl() code.funInit() if (inp.lookahead().encToken == Kwd.funDecl) parseFunDecl() parseMainBlock() parseProgEnd() }
We also need to implement the declareAsmFun
routine, which is nothing else but a comment with the function name and a label to declare the start of that function:
fun declareAsmFun (name: String) { outputCodeNl() outputComment("function $name") outputLabel(name) }
This is called as the last line in declareFun where each function is declared:
code.declareAsmFun(name)
Then we come to the point where our main
program starts. We mark this with a comment to make it visible in the assembly code, and we declare the _start
label. We also add a couple of lines of executable code which prints the TINSEL hello message on the screen:
fun mainInit() { outputCodeNl() outputComment("main program") outputLabel(MAIN_ENTRYPOINT) outputComment("print hello message") outputCodeTabNl("pushq\t%rbx") outputCodeTabNl("lea\ttinsel_msg_(%rip), %rdi") outputCodeTabNl("call\twrite_s_\n") }
At the end of main, apart from a comment to mark it, we also need the necessary code to stop our program and return to the operating system. We achieve this by using the Linux system call #60 (exit).
fun mainEnd() { outputCodeNl() outputComment("end of main") outputCodeTabNl("popq\t%rbx") outputCodeTabNl("movq\t$60, %rax\t\t$COMMENT exit system call") outputCodeTabNl("xorq\t%rdi, %rdi\t\t$COMMENT exit code 0") outputCodeTabNl("syscall") }
These two functions are called in the beginning and at the end of parseMainBlock
respectively:
fun parseMainBlock() { labelPrefix = "main" // set label prefix and label index labelIndx = 0 inp.match(Kwd.mainToken) code.mainInit() parseBlock() code.mainEnd() }
You may have noticed the addition of labelPrefix
which is set to “main
” above. It’s also set to the function name in parseFunDecl
. This is to help have more meaningful names for the labels in our assembly program (as opposed to them running from L0:
to Lxxx:
)
/** create a unique label*/ fun newLabel(): String = "${labelPrefix}_L${labelIndx++}_"
And with that we come to the end of the program where I have added a comment line to mark the end:
fun parseProgEnd() { inp.match(Kwd.endOfProgram) code.outputCodeNl() code.outputComment("end program") }
Having done the above, it is then straight forward to translate the M68000 instructions to x86-64 instructions. A couple of points to note: I have used %rax
(obviously) as the accumulator, and when an operation between the top of stack and %rax
is to be done, I pop the value from the stack into %rbx
:
/** add top of stack to accumulator */ fun addToAccumulator() { outputCodeTabNl("popq\t%rbx") outputCodeTabNl("addq\t%rbx, %rax") }
Also, all variable references are relative to the program counter, same as in Jack’s M68000 code:
/** set variable to accumulator */ fun assignment(identifier: String) = outputCodeTabNl("movq\t%rax, ${identifier}(%rip)")
Finally, below follows a special mention of how the cmp
instruction and the flags are used. Jack spent quite a few paragraphs explaining this and it’s time I do the same.
A Note on CMP
If you are like me and want to know “what’s inside”, this is the topic to pay attention to. Here’s how I have implemented the is equal (e.g. x == 1
):
fun compareEquals() { outputCodeTabNl("popq\t%rbx") outputCodeTabNl("cmp\t%rax, %rbx") outputCodeTabNl("sete\t%al") // set AL to 1 if comparison is == outputCodeTabNl("andq\t$1, %rax") // zero the rest of rax and set flags - Z flag set = FALSE }
The first two lines are obvious – we pop the top of stack into %rbx
and compare with %rax
. But what do the two lines that follow do?
To understand this, we need first to go back to chapter 6, where we decided that any integer can be treated as boolean and vice-versa. And we also decided that the value of 1 means true and the value of 0 means false. We also need to remember that the conditional commands in assembly (e.g. je
= jump if equal) work based on the values of the flags (or flags register).
Let’s see now how cmp
works:
cmp %rax, %rbx
It subtracts the value of %rax
from the value of %rbx
and, depending on the result, it sets the flags respectively. The most important flags when we are dealing with signed integers are ZF and SF (the Zero and the Signed Flag respectively). The result of this subtraction is then thrown away and %rbx
retains its original value.
So, if we compare x == 1
(and let’s assume that x is actually 1), then the result of the above subtraction will be 0 and the Zero Flag will be set. So if after this we wanted to do a jump if true command, we would do a jz
(jump if Zero Flag is set). So in this case jump if true = jz.
Now let’s assume that we fetch a boolean variable in %rax
, i.e. an interger that would have a value of either 0 or 1. And assume we set the flags according to the contents of %rax
. Same as above, let’s assume that the value is true, i.e. the value of %rax
is 1. And if we wanted to do a jump if true command we would have to do jnz (jump if Zero Flag not set). So in this case jump if true = jnz.
You can see the discrepancy above. In addition, the above cmp instruction just sets the flags and we have no boolean result (0 or 1) to store as a boolean variable.
So, in order to solve this problem I have followed exactly the method Jack used. After the comparison and the setting of the flags, set the value of %rax
to 0 or 1 based on the convention 1=true and 0=false. And as a following step, set the flags again based on the new value of %rax
. This is exactly what the extra two lines of code in compareEquals
do:
sete %al
sets %al
(the lowest 8 bits of %rax
) to 1 if the comparison result is true (if the two numbers are equal, that is) and to 0 if false.
andq $1, %rax
does two jobs (a) performs a bitwise and of the contents of %rax
and the constant $1
, i.e. sets all bits of %rax
to 0 apart from the first bit so that %rax
can take the values of 1 or 0 only and (b) sets the flags (ZF will be set if %rax
is 0).
With this we have satisfied both scenarios: (a) %rax
takes the value of 1 if the condition is true and 0 if false, so it can be used a boolean variable and (b) the Zero Flag is set when %rax
is 0 (=false) which means that jz
always means jump if false.
One more detail to complete this topic. In the family of compare*
functions we are getting the final setting of the Zero Flag for free as it is done by the and
instruction. We need to remember though that we must ensure the Flags register is set every time we set %rax
to a value. I have done this by using the test
instruction in setAccumulator
and setAccumulatorToVar
(in all other cases when %rax is the result of a numeric or boolean operation the Flag register is also set automatically):
fun setAccumulator(value: String) { outputCodeTabNl("movq\t$${value}, %rax") outputCodeTabNl("testq\t%rax, %rax") // also set flags - Z flag set = FALSE } fun setAccumulatorToVar(identifier: String) { outputCodeTabNl("movq\t${identifier}(%rip), %rax") outputCodeTabNl("testq\t%rax, %rax") // also set flags - Z flag set = FALSE }
This way we can write TINSEL code like
while (true) { ...
or
while (nextIteration) { ...
or
print x==1
Before we leave this topic let’s add two more parse functions, parseLessEqual
and parseGreaterEqual
and their corresponding assembly code functions compareLessEqual
and compareGreaterEqual
. These are based on the their sister functions parseLess
, parseGreater
, compareLess
and compareGreater
.
parseLessEqual
and parseGreaterEqual
must be triggered in parseRelation
when the appropriate relational operators are found.
Read and Print
The implementation of parseRead
and parsePrint
that support the read
and print
commands is much more straightforward than that of cmp
in the previous section. We only support integers for now, so parseRead
calls readInt
in our code module, which in turn calls read_i_
from our runtime library, and then assigns the value returned to the variable we are reading.
Similarly, parsePrint
calculates the given expression and then calls printInt
, which in turn calls write_i_
from our library to print the result on the screen. In addition parsePrint
calls println
that prints a newline, otherwise our output would be unreadable. This is temporary workaround until we implement strings, at which point we will implement proper print
and println
in TINSEL.
/**
* parse read statement
* <read> :: = read <identifier> [ , <identifier> ] *
*/
fun parseRead() {
var varName: String
do {
inp.match()
varName = inp.match(Kwd.identifier).value
code.readInt(varName)
code.assignment(varName)
} while (inp.lookahead().encToken == Kwd.commaToken)
}
/**
* parse print statement
* <print> ::= print <b-expression> [ , <b-expression> ] *
*/
fun parsePrint() {
do {
inp.match()
parseBooleanExpression()
code.printInt()
code.println()
} while (inp.lookahead().encToken == Kwd.commaToken)
}
And with this we are done, we are now ready to write a fully functional TINSEL program and run it.
Running our Code
And that’s it! We can now write and compile a complete TINSEL program and can run it on our Linux workstation. This is how you do it:
- First prepare the runtime library (you do this only once and keep the .o files that you will need in every TINSEL program:
as -o readwrite.o readwrite.s
as -o string_utils.o string_utils.s
- Write a TINSEL program, e.g.
prog1.tnsl
- Compile
prog1.tnsl
and copy the Compiler output (the assembly code) to a file:prog1.s
- Pass the Compiler output through the assembler:
as -o prog1.o prog1.s
- Link the code to produce the final executable:
ld -o prog1 prog1.o readwrite.o string_utils.o
- Run your program:
./prog1
If you don’t fancy this multiple step process, you can always use gcc and do it all in one step:
gcc -o prog1 prog1.s readwrite.s string_utils.s
This will link your TINSEL program with the C runtime libraries, so for this to work you must change the entry point of the program to: val MAIN_ENTRYPOINT = "main"
.
And to enjoy the result of our work try this TINSEL program:
program myprog1
var i = 0, n
fun i_square() {
if (i <= 0) { return -1 }
return i * i
}
main {
read n
repeat {
i = i + 1
print i_square()
} until (i >= n)
}
endprogram
You give it an integer (n
) as input and it will print the squares of all the integers from 1 to n. Try it on your Linux box. (please ignore the if(i<=0)...
line in function i_square
– it’s there just to test the correct generation of the labels)
I will stop here. This has been a long chapter, but I did not want to interrupt until we got to this point. We have reached a major milestone, where we can actually run our TINSEL programs on a real computer and see the real result on our screen.
Enjoy!
And as always, you can find the code in my GitHub repository.
Coming up next: Semicolons and Comments