The next stage of the Compiler project, the TINSEL Compiler for the Raspberry Pi is here.
Chapter XVI: Local Variables and Recursion
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the previous chapter I introduced functions with parameters, which made TINSEL a much more mature programming language. In this chapter I will go one more step further and introduce local variables.
Until now, the only variables TINSEL recognises are global variables that must be declared immediately after the program <name>
line using the var
keyword. In this chapter I will introduce local variables, i.e. variables that have limited scope, within the block of code that they have been declared.
Local Variables Declaration and Scope
In TINSEL I will let a local variable be declared anywhere in a block of code. By block of code I mean any part of code surrounded by the block_start and block_end delimiters { }
. This could be the body of a function, an if-block, a while-block, or any other block of code surrounded by the block delimiters.
The local variables will be declared exactly in the same way as the global variables:
var i: int var a: string(20)
Their scope will be within the block they have been declared in, i.e. they will be recognised from the line they have been declared in, to the closing block delimiter.
Unlike global variables, which are instantiated at compile-time and live in the .data
section of the assembly output, the local variables will be instantiated at run-time and will live in the stack.
Our compiler has so far all the tools to implement the above enhancements, so implementing local variables should be a relatively easy job.
Declaring Local Variables
Starting top-down, we want to be able to declare a local variable anywhere in a block. This means that we need to recognise and process the var
keyword as part of the processing of the block. It looks like a good place to do this is in parseStatement
which is the heart of the parseBlock
function. For this we need to add one line:
fun parseStatement(breakLabel: String, continueLabel: String, blockName: String) { when (inp.lookahead().encToken) { Kwd.varDecl -> parseLocalVars(blockName) Kwd.startBlock -> parseBlock(breakLabel, continueLabel) Kwd.ifToken -> parseIf(breakLabel, continueLabel) ...
This will recognise a local var declaration anywhere in the block. Let’s see what is does:
fun parseLocalVars(blockName: String) {
parseVarDecl(VarScope.local, blockName)
}
Very simple: it calls the existing parseVarDecl
function (that until now used to process global vars) with two parameters: the flag to tell it to process a local variable and the name of the block (more about this further down). The only change required in parseValDecl
is to pass these parameters on to parseOneValDecl
:
fun parseVarDecl(scope: VarScope = VarScope.global, blockName: String = "") { while (inp.lookahead().encToken == Kwd.varDecl) { do { inp.match() parseOneVarDecl(scope, blockName) } while (inp.lookahead().encToken == Kwd.commaToken) } }
and this is where we will start doing some real work:
fun parseOneVarDecl(scope: VarScope, blockName: String) { val varName = inp.match(Kwd.identifier).value inp.match(Kwd.colonToken) when (inp.lookahead().encToken) { Kwd.intType -> parseOneIntDecl(varName, scope) Kwd.stringType -> parseOneStringDecl(varName, scope) else -> inp.expected("variable type (int or string)") } if (scope == VarScope.local) { // add any local vars to the local vars map for this block val localVarsList: MutableList<String> = localVarsMap[blockName] ?: mutableListOf() localVarsList.add(varName) localVarsMap[blockName] = localVarsList } }
First, the scope
is passed further on to parseOneIntDecl
and parseOneStringDecl
. In addition we have a new section added above that maintains a list of all the local variables declared in each block in the localVarsMap
. The key to this map is the name of the block, which is set automatically by the compiler in the beginning of each block:
fun parseBlock(breakLabel: String = "", continueLabel: String = "") { inp.match(Kwd.startBlock) mustRestoreSP = true val blockName = "$BLOCK_NAME${blockId++}" // blockName is used as key to the local vars map for this block while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) { parseStatement(breakLabel, continueLabel, blockName) } releaseLocalVars(blockName, mustRestoreSP) inp.match(Kwd.endBlock) }
Going back to parseOneIntDecl
and parseOneStringDecl
, they just pass the scope
on to declareVar
:
fun parseOneIntDecl(varName: String, scope: VarScope) { ... declareVar(varName, DataType.int, initValue, INT_SIZE, scope) }
declareVar
simply checks the scope and calls one of the two functions, declareGlobalVar
(which is our old declareVar) and declareLocalVar
, which is where we need to do the work to declare the local variables:
fun declareVar(name: String, type: DataType, initValue: String, length: Int, scope: VarScope) { // check for duplicate var declaration if (identifiersMap[name] != null) abort ("line ${inp.currentLineNumber}: identifier $name already declared") when (scope) { VarScope.global -> declareGlobalVar(name, type, initValue, length) VarScope.local -> declareLocalVar(name, type, initValue, length) } }
So, until now, apart from the introduction of the block name, the scope and the local vars map, we have only made minimum changes to the compiler code.
fun declareLocalVar(name: String, type: DataType, initValue: String, length: Int) { val stackOffset: Int val lengthRoundedTo64bits = (length / 8 + 1) * 8 when (type) { DataType.int -> { stackOffset = code.allocateStackVar(INT_SIZE) initLocalIntVar(stackOffset, initValue) } DataType.string -> { stackOffset = code.allocateStackVar(STRPTR_SIZE) initLocalStringVar(name, stackOffset, initValue, lengthRoundedTo64bits) } else -> return } identifiersMap[name] = IdentifierDecl( TokType.variable, type, initialised = true, size = lengthRoundedTo64bits, isStackVar = true, stackOffset = stackOffset ) }
As you can see above, declareLocalVar
checks the type of the variable and calls first the function to allocate the space in the stack for an integer or a string pointer (which both happen to be 8 bytes by the way) and then initialises that stack variable. Finally, it adds the local variable to the identifiersMap
so that all the details are at hand for when this variable is referenced. Please note that in the beginning of this function, the size of the local variable is rounded up to the nearest 8-byte size, in order to keep the stack in good shape.
Also, regarding local vars initialisation, in case of an int, then the local variable may or may not be initialised with the declaration. In case of string though, the local variable must be always initialised, either by specifying the length of the string or the initial value. The reason for this is the way TINSEL handles strings, which requires each string pointer to point to a certain memory area where the contents of the string lives. Here’s the code that takes care of the initialisation of the string local variables:
fun initLocalStringVar(name: String, stackOffset: Int, initValue: String, length: Int) {
if (initValue.isEmpty() && length == 0)
abort ("line ${inp.currentLineNumber}: local variable $name is not initialised")
var constStringAddress = ""
// check for the constant string init value
stringConstants.forEach { (k, v) -> if (v == initValue) constStringAddress = k }
if (constStringAddress == "") { // if not found
// save the string in the map of constant strings
constStringAddress = STRING_CONST_PREFIX + (++stringCnstIndx).toString()
stringConstants[constStringAddress] = initValue
}
val stringDataOffset = code.allocateStackVar(length)
code.initStackVarString(stackOffset, stringDataOffset, constStringAddress)
}
This function first checks whether the local string var is initialised and aborts otherwise (exactly same as the global string vars) and then checks for the scenario where an initial value for the string is given. In that case checks to see if the value already exists in the stringConstants
map, and if not, it adds it. Finally, it allocates space for the contents of the string in the stack and calls the function that will produce the code to initialise it at run time:
fun initStackVarString(stackOffset: Int, stringDataOffset: Int, constStrAddress: String) {
outputCodeTabNl("lea\t$stringDataOffset(%rbp), %rax")
outputCodeTab("movq\t%rax, $stackOffset(%rbp)\t\t")
outputCommentNl("initialise local var string address")
if (constStrAddress.isNotEmpty()) {
outputCodeTabNl("lea\t$constStrAddress(%rip), %rsi")
outputCodeTabNl("movq\t$stackOffset(%rbp), %rdi")
outputCodeTab("call\tstrcpy_\t\t")
outputCommentNl("initialise local var string")
}
}
This function first sets the value of the string pointer to the block in the stack where the contents of the string lives and then checks whether there is initial value, which is then copied onto the string.
Cleaning up
This is the only bit left to be done at the end of the block:
fun parseBlock(breakLabel: String = "", continueLabel: String = "") { inp.match(Kwd.startBlock) mustRestoreSP = true val blockName = "$BLOCK_NAME${blockId++}" // blockName is used as key to the local vars map for this block while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) { parseStatement(breakLabel, continueLabel, blockName) } releaseLocalVars(blockName, mustRestoreSP) inp.match(Kwd.endBlock) } fun releaseLocalVars(blockName: String, restoreSP: Boolean) { var localVarSize = 0 localVarsMap[blockName]?.forEach { localVarSize += when (identifiersMap[it]?.type) { DataType.int-> INT_SIZE DataType.string-> STRPTR_SIZE + identifiersMap[it]?.size!! else-> INT_SIZE } identifiersMap.remove(it) } if (localVarSize > 0 && restoreSP) code.releaseStackVar(localVarSize) }
The function releaseLocalVars
is called at the end of the block and does two things: (a) calculates the stack space allocated by all the local vars in this block and releases it and (b) removes these local vars from the identifiersMap
so that they cannot be accessed outside the block.
You may have noticed that releaseLocalVars
has an additional parameter (apart from the block name), restoreSP
, which tells it to release the stack space used by the local vars in this block or not. This parameter is set to true
in the beginning of the block so that at the end of it the stack space will be released. With one exception: when we have a return
statement, then this parameter is set to false
. The reason is that when we return from a function, the whole stack frame is restored to its previous state, so any space allocated to local vars is automatically released. No need to increase the stack pointer.
fun parseReturn() { inp.match() ... code.returnFromCall() mustRestoreSP = false }
And that was it. TINSEL now supports local variables. And with this, we can write a professional-looking recursive version of the n-factorial program (strictly speaking it does not require local variables but it does require functions with parameters, which we covered in chapter xv). It works nicely until it hits integer overflow for input values above 25… And of course, now that we have recursion, we need to be careful not to overflow the stack.
And as always, you can find the code in my GitHub repository.
Coming up next – the next big step for TINSEL: TINSEL for Raspberry Pi
Chapter XV: Functions with Parameters
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
Those of you who tried the “diamond” program from the last chapter (diamond.tnsl) must have noticed a certain weakness: when the print_line
function is called from inside the for
loop in the main program, it needs to know the value of i
so that it can print the correct pattern. And the only way we could do that in the previous chapter, was by assigning a global variable with the value of the for
loop control variable. This is not very neat, so let’s go and improve it by adding parameters to our functions.
The Ground Rules
In the x86-64 assembly standard the first 6 parameters to a function are passed via the registers %rdi, %rsi, %rdx, %rcx, %r8,
and %r9
. Any additional parameters or any parameters with size over 64 bits are passed via the stack. TINSEL will support up to 6 parameters (which are too many anyway) so only these 6 registers will be used for passing function parameters.
TINSEL supports int
and string
variables. In case of an int
parameter, the value of the variable will be passed and in case of a string
parameter the address of the string variable will be passed (i.e. int
parameters will be passed by value and string
parameters will be passed by reference).
As a side note, passing function parameters directly via the CPU registers may sound simpler than passing them via the stack (which is the traditional method), however, it introduces certain complexity, which we will address further down in this chapter.
Declaring Function with Parameters
This should be a relatively easy job. The function declaration will look like this:
<function declaration> ::= fun <identifier> : <type> ( [ <param declaration> ] * ) <block>
<param declaration> ::= <identrifier> : <type>
So, at first we need to add the code to parse any parameter declarations between the '('
and the ')'
. The code below does exactly this:
fun parseFunDecl() { while (inp.lookahead().encToken == Kwd.funDecl) { inp.match() val functionName = inp.match(Kwd.identifier).value ... inp.match(Kwd.leftParen) parseFunParams(functionName) // <---- inp.match(Kwd.rightParen) inp.match(Kwd.colonToken) ... inp.match() declareFun(functionName, funType) storeParamsToStack(functionName) // <---- parseFunctionBlock() } }
As you can see the first item that was added to our parseFunDecl
function is to parse the parameters:
fun parseFunParams(functionName: String) { var paramCount = 0 val paramTypesList = mutableListOf<FunctionParameter>() if (inp.lookahead().encToken == Kwd.identifier) { do { if (paramCount++ >= code.MAX_FUN_PARAMS) abort("line ${inp.currentLineNumber}: a function can have only up to ${code.MAX_FUN_PARAMS} parameters maximum") if (inp.lookahead().encToken == Kwd.commaToken) inp.match() paramTypesList.add(parseOneFunParam()) } while (inp.lookahead().encToken == Kwd.commaToken) } funParamsMap[functionName] = paramTypesList } fun parseOneFunParam(): FunctionParameter { val paramName = inp.match(Kwd.identifier).value if (identifiersMap[paramName] != null) abort("line ${inp.currentLineNumber}: parameter name $paramName has already been declared") inp.match(Kwd.colonToken) var paramType = DataType.none when (inp.lookahead().encToken) { Kwd.intType -> paramType = DataType.int Kwd.stringType -> paramType = DataType.string else -> inp.expected("variable type (int or string)") } inp.match() return FunctionParameter(paramName, paramType) }
parseFunParams
is a loop that processes a comma-separated parameter list. For each parameter it calls parseOneFunParam
.
parseOneFunParam
first checks whether the name has already been declared (which could be either as a global variable or as another parameter in this function) and if not, it processes the declaration by checking the type of the parameter. Finally it returns an object that contains the name and type of that parameter.
parseFunParams
builds a list of all the function parameters with their names and types and saves it in the funParamsMap
. This map will be used in two places: (a) in the beginning of the code for the function block and (b) when a function is called and the parameters must be assigned values.
Let’s look at (a) first.
You may have noticed above that there is one more line added to parseFunDecl
: it’s the call to storeParamsToStack
, which does exactly what it’s name says: stores the values of the parameters from the registers (up to six) passed by the caller to the stack. Why? Here’s the reason:
Based on the x86-64 architecture, the 6 registers used to pass the function parameters, are part of the so-called caller-save group of registers, which means that their value is not necessarily retained across function calls. Even though it would have been nice to keep the 6 parameters in these 6 registers and have them handy to use them through the function code, it’s very likely that we will have the scenario where another function is called that modifies e.g. %rdi
or %rsi
or %rcx
or %rdx
or %r8
or %r9
(the first 4 are very commonly used – they are also used in the TINSEL runtime library), and this means that the value of one (or more) parameter(s) would be corrupted.
The solution to the problem is to store the values in the stack as stack variables. We have used the stack to store a variable already in chapter 13 for the for-loop control variable. When the parameters are needed later in the function code, their value will be retrieved from the stack. This, by the way, is exactly how the GNU C compiler has dealt with this problem.
fun storeParamsToStack(functionName: String) { val paramsList = funParamsMap[functionName] ?: listOf() for (i in paramsList.indices) { val paramVarOffs = code.allocateStackVar(INT_SIZE) identifiersMap[paramsList[i].name] = IdentifierDecl( TokType.variable, paramsList[i].type, initialised = true, size = INT_SIZE, isStackVar = true, stackOffset = paramVarOffs, canAssign = false ) code.storeFunParamToStack(i, paramVarOffs) } }
This function foes the following for each parameter: allocates space in the stack, creates an entry in the identifiersMap
and finally saves the value of the parameter in the stack, which is done by the below assembly code:
val funInpParamsCpuRegisters = arrayOf("%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9")
fun storeFunParamToStack(paramIndx: Int, stackOffset: Int) {
outputCodeTabNl("movq\t${funInpParamsCpuRegisters[paramIndx]}, $stackOffset(%rbp)")
}
There is one more thing required to be done just before the end of the function. We need to remove the parameters from the identifiersMap
, otherwise a subsequent function declaration would not be able to use parameters with the same names. This is done at the end of parseFunctionBlock
:
fun parseFunctionBlock() { hasReturn = false parseBlock() if (!hasReturn) abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retToken)}") // clean up declarations of parameters so that the names can be reused in other functions funParamsMap[funName]?.forEach { identifiersMap.remove(it.name) } }
Please note, there is no need here to release the space that was allocated in the stack for the function parameters, because the code produced on the back of the return
instruction takes care of this:
private fun funEnd() {
restoreStackFrame()
outputCodeTab("popq\t%rbx\t\t")
outputCommentNl("restore \"callee\"-save registers")
}
private fun restoreStackFrame() {
outputCodeTab("movq\t%rbp, %rsp\t\t")
outputCommentNl("restore stack frame")
outputCodeTabNl("popq\t%rbp")
}
Calling a Function with Parameters
So, most of the work is already done. Let’s go now and make the necessary changes to pass parameters when we call a function.
The value of each parameter will be the result of a boolean expression, which, as you remember from the previous chapters (and also from the BNF definitions TINSEL_BNF.md), can be a boolean result (0 or 1), a result of a numberical expression or even a result of string expression.
As per the x86-64 standard we have to set the CPU registers mentioned above with the value of each parameter. Here we have the same problem mentioned in the previous section, which is that these registers are not presevred across function calls. This means that we may have a scenario where we set %rdi to the value of the first parameter, we set %rsi to the vallue of the second parameter, and then we call a function to set %rdx to the value of the third parameter, which function changes the values of %rdi and %rsi corrupting the first two parameters.
Again, I will follow the excample of the GNU C compiler and do this:
First, I will set the registers %rbx, %r12-%r15
and %rax
to the values of the parameters (that is, to the values of the parameters that have been declared – up to 6). The first 5 of these registers are guaranteed to maintain their value across function calls so the above issue goes away. %rax
obviously does not maintain its value across function calls, but will be the last one set, so we will not lose the value of the sixth parameter.
Second, once all the values of the parameters have been set into these temporary registers, I will move them to the formal x86 parameter registers: %rdi, %rsi, %rdx, %rcx, %r8,
and %r9
.
And here’s how this is done:
fun parseFunctionCall(): DataType { val funcName = inp.match(Kwd.identifier).value inp.match(Kwd.leftParen) parseAssignFunParams(funcName) setInpFunParams(funcName) inp.match(Kwd.rightParen) code.callFunction(funcName) restoreParamRegisters(funcName) return getType(funcName) }
As you can see there are three new functions called by parseFunctionCall
:
The first one, parseAssignFunParams
, first produces the necessary code that will return the value for each parameter and checks that the value will be of the same type as the parameter, and then sets the temporary registers to the value of the parameters as mentioned above. As you can see in setIntTempFunParam
, each egister is first saved in the stack – based on the x86-64 standard it is our responsibility to ensure that these registers will retain their values when we return / exit.
fun parseAssignFunParams(functionName: String) { val paramTypeList = funParamsMap[functionName] ?: listOf() for (i in paramTypeList.indices) { if (i > 0) inp.match(Kwd.commaToken) val paramExprType = parseBooleanExpression() if (paramExprType != paramTypeList[i].type) abort ("line ${inp.currentLineNumber}: parameter #${i+1} must be type ${paramTypeList[i].type}, found $paramExprType") when (paramExprType) { DataType.int -> code.setIntTempFunParam(i) // the same code is used both for int and for string parameters DataType.string -> code.setIntTempFunParam(i) // i.e. moves %rax to the appropriate register for this parameter else -> {} } } }
fun setIntTempFunParam(paramIndx: Int) {
if (funTempParamsCpuRegisters[paramIndx] == "%rax")
return
outputCodeTabNl("pushq\t${funTempParamsCpuRegisters[paramIndx]}")
outputCodeTabNl("movq\t%rax, ${funTempParamsCpuRegisters[paramIndx]}")
}
The second one, setInpFunParams
, simply transfers the value of each temporary parameter register to the corresponding formal parameter register as per x86-64 standard:
fun setInpFunParams(functionName: String) {
val paramTypeList = funParamsMap[functionName] ?: listOf()
for (i in paramTypeList.indices)
code.setFunParamReg(i)
}
// these registers hold the fun params at the time of the call val funInpParamsCpuRegisters = arrayOf("%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9") // during the assignment of the parameters, their values are saved temporarily here, // so that they are not corrupted by function calls executed during the assignment of the parameters val funTempParamsCpuRegisters = arrayOf("%rbx", "%r12", "%r13", "%r14", "%r15", "%rax") // 6 params maximum allowed val MAX_FUN_PARAMS = funInpParamsCpuRegisters.size fun setFunParamReg(paramIndx: Int) { outputCodeTabNl("movq\t${funTempParamsCpuRegisters[paramIndx]}, ${funInpParamsCpuRegisters[paramIndx]}") }
Finally, restoreParamRegisters
, restores the values of the temporary registers from the stack so that we are compliant with the x86-64 standard for register usage:
fun restoreParamRegisters(functionName: String) { val paramTypeList = funParamsMap[functionName] ?: listOf() for (i in paramTypeList.indices) code.restoreFunTempParamReg(paramTypeList.size - i - 1) }
fun restoreFunTempParamReg(paramIndx: Int) {
if (funTempParamsCpuRegisters[paramIndx] == "%rax")
return
outputCodeTabNl("popq\t${funTempParamsCpuRegisters[paramIndx]}")
}
Plase note %rax
is neither saved nor restored, as it is not required to maintain its value across function calls.
A Point to note about Function Return Values
As we are in the subject of calling a function, I’d like to point out the following scenario:
Suppose we have a TINSEL function (let’s call it test_alpha()
) that returns an integer value, which we want to treat as boolean (0 for false, non-zero for true). We would want to use it like this:
if (test_alpha()) { // do something }
Please note that this code simply works because of the way the expressions are calculated and the flags register is set. So, when at some point in function test_alpha
there is a return
statement
return <some_expression>
the result of some_expression
is calculated and is placed in %rax
. At the same time the Z-flag
is set either by the instructions executed as part of the expression calculation (addq
, subq
, etc) or by specifically executing
testq %rax, %rax
which sets the flags based on the contents of %rax
, when %rax
is set by a movq
instruction from the contents of a memory location.
This means that a function not only returns a value in register %rax
, but also returns the flags register, which is set based on the contents of %rax
. And this is what makes the above code work.
Now TINSEL looks a bit more like a grown up language (although there is still a long way to go) and you can modify the diamond.tnsl example to use a parameter for the print_line()
function instead of the global variable k
.
And as always, you can find the code in my GitHub repository.
Coming up next: Local Variables and Recursion
Chapter XIV: Strings – Hello, World!
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the last chapter I included an example TINSEL program that demonstrated the use of nested for
loops. Those of you who tried it on your Linux box, I’m sure you are convinced it works. However, it brought up another limitation of the current version of TINSEL. The lack of string handling, which made the printout of that program look like an awkward series of numbers that we had to try hard to follow. Not to mention that we have no way to print a prompt to the user when the program is waiting for input for the variable n
. To rectify these limitations I will introduce strings and the associated handling in this chapter. I consider this a major enhancement, given that until now TINSEL supported only integers, hence it will give us TINSEL v2.0. There will be significant changes across many parts of the compiler, so please bear with me.
Defining Different Types
Until now everything in TINSEL was of type integer and this made things very simple in terms of declaring variables, processing expressions, executing numerical operations, calling functions, retrieving their return value, etc. The introduction of Strings means that we need to support Types. So, let’s start with this part. A new file DataTypesAndDecl.kt
is created where all the necessary structures to support the handling of various types will go:
The following enumeration is first declared; it defines the different types supported in TINSEL v2.0:
enum class DataType { int, string, void, none }
The class that defines the identifiers (variables/functions) and the map that actually holds the mapping of all these identifiers also go there:
/** the declaration space (variables and functions) */ class IdentifierDecl(var fv: TokType, var type: DataType, var initialised: Boolean = false, var size: Int = 0, var stackVar: Boolean = false, var stackOffset: Int = 0, var canAssign: Boolean = true) // the identifiers space map val identifiersMap = mutableMapOf<String,IdentifierDecl>()
The declareVar
and declareFun
functions that produce the assembly code for a variable or a function also go there, as well as two new functions, getType
and getCanAssign
that return the type of an identifier and whether it can be assigned a new value (if you remember from the previous chapter, the control variable in a for
loop cannot be assigned a value in an assignment statement).
Variable Declarations
The next thing that has to change is how we define our variables. We now have to tell our compiler what type each variable is. The Integer variables must now be defined like this:
var i: int
and may optionally be followed by an initialisation clause:
var i: int = -1
As you may remember form previous chapters, TINSEL uses 64-bit integers, so each integer variable declaration will allocate 8 bytes in the assembly output for that variable.
Let’s get to Strings now. I have decided to deal with strings in the same way as C does, i.e. treat a string as a pointer in memory where the contents of the string is. Also, same as in C, strings in TINSEL will be null-terminated. With this in mind, when we declare a sting variable, we must allocate the necessary space in memory where its value will be stored. This is done in one of two ways:
– initialise the string with a specific value:
var s1: string = "1234567890"
This will allocate 11 bytes in memory containing the above 10 characters plus the null
terminator, while s1
will be pointing to the first of those bytes.
– or, allocate a number of bytes in memory
var s2: string(20)
This will allocate 20 bytes in memory with s2 pointing to the first of those bytes.
Please note that these memory blocks allocated for the string variables are fixed size and cannot grow. This is important when we get to string expressions.
Function Declarations
Similarly to strings, our function declarations must now include a type:
fun square(): int fun alower(): string
This type declaration will be checked later when we parse the return statement for each function, to ensure it returns the correct type (more on this later in this chapter).
Expressions
And now it’s time for the heavy lifting: we need to change the way we process expressions in order to support processing of strings.
First, let me summarise how string expressions will look like:
- adding strings
s1 + s2
will result in an operation same as the Cstrcat(s1,s2)
i.e. will append the contents of the memory area where s2 is pointing to the end of the memory area where s1 is pointing (the end of the string being the position of thenull
character)
- comparing strings
s1 == s2
will result in an operation similar to the Cstrcmp(s1,s2)
that will return 1 if the strings are equal and 0 if not.
- comparing strings
s1 != s2
will result in the same operation as above but with exactly the opposite result.
- assigning a string variable
s = "abcd"
ors = s1 + s2
will result in an operation same as the C strcpy(s, “abcd”) or strcpy(s, s1) (after s2 has been concatenated at the end of s1).
And these are the only string operations that TINSEL 2.0 will support for now.
Please remember the point about declaring string variables: each string variable is allocated a fixed memory space at declaration time, so it’s important to ensure this space is adequate to hold the result of the strcat or the strcpy, otherwise other variables will be corrupted or the program will crash (exactly the same as in C).
Now it’s time to think how we deal with expressions that can potentially contain different types of variables or constants. We need to be able to validate whether an operation is valid of that specific type of variable or whether the two sides of an operation are valid for that operation. E.g. we can compare a string to a string and assign the result to an integer variable (remember TINSEL treats boolean vars as integers – same as C), we can obviously add or multiply an integer with an integer, we can add a string to a string as mentioned above, but we cannot multiply a string with a string or a string with an integer, we cannot do boolean xor
between a string and a string, etc.
To achieve this level of control we need to change how the various parse
functions deal with the expressions, the terms and the factors. Starting from the bottom, we have already the identifiersMap
that contains the type of each identifier (int
or string
). In addition, when scan
finds a quote (the start of a string), it will call the new function getString
, that will return the string literal as the next token:
private function scan(): Token { .... if (checkQuote()) return Token(getString(), Kwd.string, TokType.none) ....
/** get a string literal */
private fun getString(): String {
var value = ""
if (nextChar != '"')
return value
getNextChar()
while (nextChar != '"') {
value += nextChar.toString()
getNextChar()
}
getNextChar()
return value
}
So now, when we get a string literal or a number, we can tell what the type is by looking at the type
of the Token.
Working our way up, parseFactor
now checks the type of the next token and calls parseNumber
or parseStringiteral
respectively.
fun parseFactor(): DataType { when (inp.lookahead().encToken) { Kwd.leftParen -> return parseParenExpression() Kwd.identifier -> return parseIdentifier() Kwd.number -> return parseNumber() Kwd.string -> return parseStringLiteral() else -> inp.expected("valid factor (expression, number or string)") } return DataType.void // dummy instruction }
In addition, it now returns the type of the factor as you can see above (int or string). This will be used when it comes to validating whether a specific type or type combination is allowed in a specific operation.
The parseNumber
function is what we used to have
fun parseNumber(): DataType {
code.setAccumulator(inp.match(Kwd.number).value)
return DataType.int
}
The parseStringLiteral
however, is a bit more complicated:
fun parseStringLiteral(): DataType { val stringValue = inp.match(Kwd.string).value // check if this string exists already in our map of constant strings and add it if not var stringAddress = "" stringConstants.forEach { (k, v) -> if (v == stringValue) stringAddress = k } if (stringAddress == "") { // if not found // save the string in the map of constant strings stringAddress = STRING_CONST_PREFIX + (++stringCnstIndx).toString() stringConstants[stringAddress] = stringValue } code.getStringAddress(stringAddress) return DataType.string }
Apart from producing the necessary assembly code, to load the Accumulator with the address of the string literal, it has to find some place to store that string literal. For this purpose it builds a table of all the string literals (stringConstants
) which at the end of the compilation is added to the end of the assembly output in a separate .data
section. As you can see above, if the same string literal is found multiple times, it is stored in the table only once and is reused.
Similarly, parseTerm
has been modified to get the type of the first factor and pass it on as a parameter to multiply
or divide
. It also returns the type to the caller:
fun parseTerm(): DataType {
val typeF1 = parseSignedFactor()
while (inp.lookahead().type == TokType.mulOps) {
if (typeF1 == DataType.string)
code.saveString()
else
code.saveAccumulator()
when (inp.lookahead().encToken) {
Kwd.mulOp -> multiply(typeF1)
Kwd.divOp -> divide(typeF1)
else -> inp.expected("multiply or divide operator")
}
}
return typeF1
}
And in the same way, parseExpression
has been modified to get the type of the first term and pass it on as a parameter to add
or subtract
and, of course, to return it to the caller:
fun parseExpression(): DataType {
val typeT1 = parseTerm()
while (inp.lookahead().type == TokType.addOps) {
if (typeT1 == DataType.string)
code.saveString()
else
code.saveAccumulator()
when (inp.lookahead().encToken) {
Kwd.addOp -> add(typeT1)
Kwd.subOp -> subtract(typeT1)
else -> inp.expected("add or subtract operator")
}
}
return typeT1
}
So, now, parseAssignment
knows what it’s dealing with and can make a decision whether a string literal can be assigned to an integer variable or not:
fun parseAssignment() { val identName: String = inp.match(Kwd.identifier).value checkCanAssign(identName) val typeVar = getType(identName) inp.match(Kwd.equalsOp) val typeExp = parseBooleanExpression() checkOperandTypeCompatibility(typeVar, typeExp, ASSIGN) when (typeVar) { DataType.int -> parseNumAssignment(identName) DataType.string -> parseStringAssignment(identName) else -> {} } }
You can see that it now calls parseNumAssignment or parseStringAssignment depending on the type of the variable being assigned. The former is simply what we used to have as assignment:
fun parseNumAssignment(varName: String) {
code.assignment(varName)
}
However, the latter, even though still one line at the parser level, is a bit more complex when it comes to the code we generate as it involves a call to our own strpcy_
(see above).
fun parseStringAssignment(identName: String) {
code.assignmentString(identName)
}
fun assignmentString(identifier: String) {
outputCodeTab("movq\t%rax, %rsi\t\t")
outputCommentNl("assign string - strcpy_(identifier, %rax)")
outputCodeTabNl("lea\t${identifier}(%rip), %rdi")
outputCodeTabNl("call\tstrcpy_")
}
The key change in our upgraded parseAssignemnt
is the call to checkOperandTypeCompatibility
with the two types and the operation as parameters. This is where the decision is made whether these two types are allowed in this operation. And this is how it works:
At the end of the new file (DataTypesAndDecl
) you will see a map that defines the allowed combinations for each operation
val typesCompatibility = mapOf( // int with int allowed for all operations TypesAndOpsCombi(DataType.int, DataType.int, ALL_OPS) to true, TypesAndOpsCombi(DataType.int, DataType.none, ALL_OPS) to true, // string with string allowed only for assign, add, compare_eq compare_ne TypesAndOpsCombi(DataType.string, DataType.string, ASSIGN) to true, TypesAndOpsCombi(DataType.string, DataType.string, ADD) to true, TypesAndOpsCombi(DataType.string, DataType.string, COMPARE_EQ) to true, TypesAndOpsCombi(DataType.string, DataType.string, COMPARE_NE) to true, TypesAndOpsCombi(DataType.string, DataType.none, PRINT) to true, // all other combinations forbidden unless set here )
In order to make this easier to manage, we have a “wildcard” operation, ALL_OPS
. So, the first two lines say that all binary operations between integers are allowed and all unary operations on an integer are allowed. The following five lines say that we can assign a string to a string variable, we can add two strings, we can compare (equal or not-equal) two strings and we can print a string. No other combination is allowed. With this definition, the logic in checkOperandTypeCompatibility
is very simple:
fun checkOperandTypeCompatibility(t1: DataType, t2: DataType, operation: String) {
var typesAreCompatible = typesCompatibility[TypesAndOpsCombi(t1, t2, operation)] ?: false
if (!typesAreCompatible)
typesAreCompatible = typesCompatibility[TypesAndOpsCombi(t1, t2, ALL_OPS)] ?: false
if (!typesAreCompatible) {
var message = "line ${inp.currentLineNumber}: $operation $t1 "
if (t2 != DataType.none)
message += "with $t2 "
message += "not supported"
abort(message)
}
}
This function is now called in every parse
function that performs a binary or unary operation to decide whether the type(s) is/are compatible. If they are not, a compilation error is thrown.
Operations
Given that the execution of an operation is now different, depending on the type(s) involved, we have separate functions for operations on integers and on strings (which are called after checking the type compatibility):
fun add(typeT1: DataType) {
inp.match()
val typeT2 = parseTerm()
checkOperandTypeCompatibility(typeT1, typeT2, ADD)
when (typeT1) {
DataType.int -> addNumber()
DataType.string -> addString()
else -> {}
}
}
The same goes for subtract, multiply, and all the other operations.
These functions are now placed in two separate files OpsNumeric.kt
and OpsStrings.kt
.
Boolean Expressions
Similarly to the Numerical Expressions, our Boolean expressions module has to be changed to support strings. As mentioned early in this chapter, strings can only be used with the relational operators ==
and !=
and this is defined in the typesCompatibilityMap
.
Our parseBooleanExpression
, parseBooleanTerm
, parseNotFactor
, parseBooleanFactor
and parseRelation
have been changed to return the type of the entry they processed. Also, all the boolean operations functions are now called with a parameter that’s telling them what the type of the first operand was. This way they can check the type compatibility after retrieving the type of the second operand.
Notably, parseRelation
, parseEquals
and parseNotEquals
have had their logic changed as this is where strings are getting into the picture:
fun parseRelation(): DataType {
val typeE1 = parseExpression()
if (inp.lookahead().type == TokType.relOps) {
if (typeE1 == DataType.string)
code.saveString()
else
code.saveAccumulator()
when (inp.lookahead().encToken) {
Kwd.isEqual -> parseEquals(typeE1)
Kwd.isNotEqual -> parseNotEquals(typeE1)
Kwd.isLess -> parseLess(typeE1)
Kwd.isLessOrEq -> parseLessEqual(typeE1)
Kwd.isGreater -> parseGreater(typeE1)
Kwd.isGreaterOrEq -> parseGreaterEqual(typeE1)
else -> inp.expected("relational operator")
}
return DataType.int
}
else
return typeE1
}
fun parseEquals(typeE1: DataType) { inp.match() val typeE2 = parseExpression() checkOperandTypeCompatibility(typeE1, typeE2, COMPARE_EQ) when (typeE1) { DataType.int -> code.compareEquals() DataType.string -> code.compareStringEquals() else -> {} } }
These changes are consistent with the changes in the Numeric expressions parsing, where different routines have to be called to execute the operation depending on whether it’s integer or string operation.
Return from Function
Given that our functions have to return a specific type, we need to check whether the return statement is returning the correct type. This is done quite easily with the below code:
fun parseReturn() { inp.match() if (labelPrefix == MAIN_BLOCK) abort("line ${inp.currentLineNumber}: return is not allowed in [main]") hasReturn = true // set the return flag for this function val funType = getType(funName) if (funType != DataType.void) { val expType = parseExpression() if (expType != funType) abort("line ${inp.currentLineNumber}: $funType function cannot return $expType") } code.returnFromCall() }
I’ve also taken the opportunity to simplify the logic that checks whether a function has a return statement in its block. If you remember, the initial implementation (Chapter 12) was quite complex: each block parsing function was returning a boolean value to the caller, which indicated whether it found a return or not, and when that value reached the top of the recursion, which is the parseBlock
called by the parseFunctionBlock
, it was checked and generated an error if that function had no return statement.
Even though that logic worked, it was unnecessarily complicated, because we can only declare a function at the top level only – we cannot declare a function within a function. This means that a global variable hasReturn
, which is reset in the beginning of parseFunctionBlock
, set in parseReturn
(see code above) and checked at the end of parseFunctionBlock
, can do the job in a much simpler way.
fun parseFunctionBlock() {
hasReturn = false
parseBlock()
if (!hasReturn)
abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retToken)}")
}
Code Generator
There is one more change that was needed in our code generator this time, in order to support string operations (+, ==, !=
).
Let’s go back to how we calculate this expression: A + ( B + C )
We set the accumulator to the value of A and then push it to the stack. Then, we set the accumulator to the value of B and, given that we have parenthesis, push it also to the stack. Then, we set the accumulator to the value of C, pop the top of stack and add it to the accumulator (this gives us B+C). Finally, we pop the top of stack again and add it to the accumulator (and this gives us A + (B+C) ).
The above works well with numbers. In case of strings we will have to do the same – right? But, what is the value of A, B or C if these are strings? It’s the address of each string that points to the memory area where the actual contents of the string is stored. So, we are pushing the address of the string to the stack, but what about its contents? Technically we could push it to the stack too and that would have been fully consistent with the handling of integers. However, this would introduce unnecessary complexity that has to do with the fact that the strings are variable-length plus they are made of bytes, which means that a string is most likely not an exact multiple of a 64 or a 32-bit word. Even though we could use padding to make a string multiple of a 64 or 32-bit word and use the null terminator to know that we have reached the end of the string to stop popping from the stack, this is totally unnecessary for the simple reason, because A + (B + C) is the same as A + B + C.
The answer is that we will not allow parenthesis in strings expressions (try it and see the compiler error generated). This means that we don’t really need a stack to store the value of all the strings involved in the expression. We just need a temporary buffer to store the contents of the last string processed in the expression and retrieve it to add it (strcat_
) to the next one and store the new result to the buffer again. Please note, the value of the accumulator (string address) is still stored in the stack to keep things simple. You can see how this logic works in the saveString
function which is called when a Term is processed:
fun saveString() {
outputCodeTab("movq\t%rax, %rsi\t\t")
outputCommentNl("save string - strcpy_(string_buffer, %rax)")
outputCodeTabNl("lea\t$STRING_BUFFER(%rip), %rdi")
outputCodeTabNl("call\tstrcpy_")
outputCodeTabNl("pushq\t%rax")
includeStringBuffer = true
}
The buffer is a dedicated memory space called string_buffer_
that has a default length of 1024 bytes. This means that the maximum length of any string in our program can be 1023 characters (+1 for the null terminator). Any string operation that resulted in a string longer than that would most likely crash the program or, best case, would overwrite other variables. If however you need a string of higher length, you can simply call the compiler with this option:
-maxstring nnnnn
This will allocate a string_buffer of nnnnn
bytes to prevent overflow and crash of the program.
If you write a program that involves string operations you will see the string_buffer
space allocated at the end of the compiler output:
... .data .align 8 # buffer for string operations - max str length limit string_buffer_: .space 1024
Runtime Library
Finally, the TINSEL runtime library has also been enhanced to support Version 2.0. The following functions have been added:
- strcpy_
- strcat_
- strlen_
- streq_
These functions are not directly visible to the TINSEL program. They are called from the assembly program that the compiler produces.
You will need to download the files string_utils.s
and readwrite.s
from the directory libtinel/src in the repository and assemble them using the GNU assembler so that they can be linked to your TINSEL output program (see below and also Chapter 10).
And with that, we have completed the implementation of the first version of string support in TINSEL.
This chapter was quite extensive and this reflects the extent of the changes in our compiler. I have not described every little change in every detail here, because that would double the size of this chapter. I have described the concept though and I hope you can follow the changes by looking at the code.
We have made a significant improvement to our TINSEL language and can now, finally, write the “hello world” program! Here’s a slightly more “complex” version of it:
program hello_world var s: string(20), s1: string(20), s2: string(20) main { s1 = "Hello" print "please enter your name (or Enter to continue): " read s2 if (s2 == "") { s2 = "World" } s = s1 + ", " + s2 println s, "!" } endprogram
You can also try a more complex example here: diamond.tnsl
Remember, to run your own TINSEL programs, you need this compiler and a Linux computer. You will need to download the source libraries from the repository, which you will have to assemble and link with your TINSEL program as described at the end of Chapter 10 – Running our Code: Let’s Build a Compiler – Chapter X, TINSEL for x86-64
And as always, you can find the code in my GitHub repository.
Coming up next: Functions with Parameters
Chapter XIII: Fixing the For Loop
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
Our newly introduced programming language, TINSEL, has now taken shape and we can write a decent program. We have improved our error checking and it gives us meaningful error messages pointing to the line where the error was. There is still one thing missing though – the FOR loop. If you look at parseFor
, you will see that it still produces dummy code, so, let’s go and fix it in this chapter and produce our TINSEL V1.2.
Fixing the For Loop
Jack had pointed out in his tutorial that “the for loop is a bear to translate” and I would agree. There are a lot of moving parts: (a) the control variable – where do we keep it? in the stack? in a register? (b) the start value (c) the end value – do we recalculate it at the end of each iteration or only once in the beginning? (d) the step. So, let’s set the scene and define first what we want to do and then decide how we do it.
This will be my specification of the for loop:
<for> ::= for ( <identifier> = <expression1> [ down ] to <expression2> [ step <expression3> ] ) <block>
It will work like this:
- Set the control variable to expression1
- Check if it has exceeded the limit (expression2) and if yes exit the loop
- Execute the block
- Update the control variable
- Loop back to point 2.
In addition I will support an optional down
keyword, to count down as opposed to up. If it’s there, then the control variable will be decreased by 1 at the end of each iteration (while, if it isn’t there, it will obviously be increased by 1).
Finally there is an optional step
keyword followed by an expression. If this is present, then the value of expression3 will be added to the control variable at the end of each iteration (as opposed to the control variable increased or decreased by 1).
It is important to remember that it will be the responsibility of the programmer to set the step
expression (but also the initial and the to
expressions) to the right values so that the control variable will eventually reach the limit and the loop will be correctly terminated. Please note, if down
is used, then the result of the step
expression must be a negative number.
Furthermore, in my implementation, expression2 and expression3 (if it exists) are evaluated only once when the loop is entered and not in every iteration.
The Control Variable
The control variable must be a new variable not previously defined in that function or main program. It exists only within the for
block and cannot be assigned a value in an assignment statement – it is assigned a new value automatically at the end of each iteration. However, it’s value can be used on the right-hand side of an assignment within the for
block.
The next question is “where do we store it?”. Jack had decided to use the global variables space in his implementation of the for loop. That would not work with TINSEL, e.g. in the scenario where we have a for
loop in the main
block that calls a function that has a for
loop and uses the same name for the control variable. This is valid scenario and TINSEL should support it.
I would be tempted to use a CPU register given that the x86 architecture has a few spare ones (e.g. %r8 – %r15). This would be a really easy option, but apart from the above, it presents an additional problem: how do we handle nested for
loops? We would have to use a new register in each level and also keep track of which level of for
we are in and use the appropriate register when we want to access the control variable. This would introduce unnecessary complexity plus we would soon run out of registers… Hence, the answer is: the stack. It’s the most appropriate place, as the nature of the control variable is exactly the same as of a local variable.
Using the Stack for Storing a Variable
In the x86 architecture there are standard techniques for using the stack for temporary variable storage (e.g. for local variables or, as in our case, for the for
loop control variable). The first step is to create a new stack frame in the beginning of each function. This means that the old stack frame must be restored at the of the function. I do the same in the beginning and at the end of the main block as well.
To create a new stack frame we just need to save the Base Pointer in the stack and then give it the value of the Stack Pointer (i.e. have the new stack frame start where the stack pointer is pointing after the last push):
pushq %rbp movq %rsp, %rbp
Consequently, to restore the old stack frame we need to restore the Stack Pointer from the Base Pointer and then restore the Base Pointer from the stack:
movq %rbp, %rsp popq %rbp
I have implemented the two functions newStackFrame
and restoreStackFrame
in the code module that produce the code for these two operations. They are called in the beginning and at the end (where we have the return
statement) of each function block and of the main block.
With our new stack frame in place, we can allocate space in the stack for a temporary variable by reducing the stack pointer by the appropriate number of bytes. E.g. to allocate space for a 64-bit integer (i.e. 8 bytes) we just do:
subq $8, %rsp
In addition we need to keep track of the offset of each variable we allocate from the base of the stack (don’t forget, rbp
had the value of rsp
in the beginning of the block). This means that when we define a new stack frame, we start with an offset of 0, and this is decreased by the appropriate amount of bytes every time space for a variable is allocated in the stack. E.g. the offset of the first 64-bit integer we allocated is -8, and we can access it like this:
movq -8(%rbp), %rax
The current offset from the Base Pointer is managed within the code module.
To release the space that has just been allocated at the top of the stack, we just add the variable’s size in bytes to the stack pointer, e.g.:
addq $8, %rsp
For these two operations I have implemented allocateStackVar(size: Int)
and releaseStackVar(size: Int)
in the code module.
With all that clarified, here’s what’s needed in our compiler:
In the IdentifierDecl
map that holds all the variables and functions declarations I have added two more parameters:
var stackVar: Boolean = false
which is true
when the variable is stored in the stack and false
if it’s a global variable
var stackOffset: Int = 0
which holds the offset from the base pointer for a stack variable.
And to actually save or retrieve a stack variable, I have implemented the functions assignmentLocalVar(offset: Int)
and setAccumulatorToLocalVar(offset: Int)
respectively in the code module.
With these tools in our toolbox we can now go ahead and code the parseFor
function
The New Shape of parseFor
In my first attempt, I started rewriting the for
parser implementing all the various aspects (the control variable in the stack, the initial expression, the to
expression, the optional down
keyword and the optional step
keyword) and the outcome was a function that was over 80 lines long… I had to break that down in smaller pieces so that I could read it and debug it. So, here’s the new parseFor
function:
fun parseFor(): Boolean {
inp.match()
parseForLine()
presetCtrlVar()
val label1 = newLabel()
val label2 = newLabel()
postLabel(label1) // actual start of the loop
stepAndCheck() // increase (or decrease) ctrl var and check
code.branchIfFalse(label2) // if limit reached, exit
val foundReturn = parseBlock(label2) // the FOR block
code.branch(label1) // loop back to the beginning of the loop
postLabel(label2) // exit point of the loop
cleanUpStack()
return foundReturn
}
I will go through this at high level and will let you check all the details by looking at the code in the repository. Due to the complexity and the number of lines, I have put the for
parser and the related functions in a separate file and actually a separate class: ProgParser1aControlFor.kt, class ForParser
.
And now let’s go through the parts of parseFor
:
First comes parseForLine
, which parses the actual for
line from the for
keyword to the closing parenthesis.
It first gets the control variable name and checks if it’s been declared already. If not, it allocates space for it in the stack and adds it to the variables’ map marking it as stack variable and saving its offset from the base pointer. It then parses the “from” expression and produces the code to set the initial value of the control variable.
Next, it checks for a down
keyword and sets the relevant flag if it finds it.
It then producers the code for the to
expression and saves its value also in the stack, but without assigning a variable name to it; it’s not needed. However, it has to save the offset from rbp
and for this a class variable is used.
And the last bit of this first part is to process the step
expression (if there is any), set a flag if there is one, and produce the code to calculate the value of the expression and save it also in the stack. This is not saved as a variable in the variables’ map either, so a class variable is also used to save the offset from rbp
.
Next we have presetCtrlVar
, which pre-decrements (or pre-increments if we have down
) the control variable to get it ready for the first “increment-and-check” (or decrement-and-check respectively).
Then we produce the two labels that every loop needs: label1
, which marks the beginning of the loop and label2
, which marks the first instruction after the end of the loop.
Here starts the loop: we first call stepAndCheck
. This function first increments the control variable by 1 (or decrements it by 1 if we have down
). Please note that if we have step, then the step value is always added to the control variable, as it will be positive if there is no down
and negative if there is down
.
It then invokes the appropriate comparison, >= or <= depending on the presence of down
or not.
And this is where we know whether we have to stop the loop by executing branchIfFalse
, same as in the while and repeat implementations.
Having established whether we can continue with the next iteration, we execute the actual for
block (parseBlock
) and then go back to the beginning of the loop to check all over again.
The final part in our for parser, after we are done with the loop, is to clean up the stack and the variables’ space: cleanUpStack
. It first removes the entry for the control variable from the variables’ map so that it can be potentially reused in a new for
block.
It then releases the stack space allocated for the step
value (if it was there) plus the space for the to
value and the control variable.
And that was it – this is our for
parser done. This class is almost 150 lines long but at least one can read it and understand what each function does.
We also need to change the line in parseStatement
where for
is called:
Kwd.forToken -> return ForParser().parseFor() // in separate module due to increased complexity
Now you can try this nested loop, compile it and run it on your Linux machine and see the output:
/*-- test nested for loops */ program TestFor var n = 0 main { read n for (i=n down to 1 step 2) { print i for (j=1 to i) { print i*j } print 999 } } endprogram
Continue
Now that we are in the topic of “for”, it’s a good opportunity to add continue
. We have break
already, implemented based on the method that Jack has shown us, so it will be quite easy to add the necessary processing for continue
.
If you remember, every loop parser generates two labels: label1
and label2
. Label1 marks the start of the iteration, while label2 marks the next instruction following the loop (the one that is executed next, after the loop is terminated).
And it’s label2
that is passed as a parameter to parseBlock
, and then on to parseStatement
and then on to parseIf
, and on and on down the recursion tree, so that if and when a break
statement is encountered, it will produce an unconditional jump instruction to that label in order to break the loop.
We will use exactly the same technique to implement continue. We will add a second parameter to parseBlock
, parseStatement
and parseIf
, which will be the label to be used in parseContinue
:
fun parseBlock(breakLabel: String = "", continueLabel: String = ""): Boolean
The “continue” label will be passed down the recursion tree alongside the “break” label and if and when a continue
statement is found, it will use it:
Kwd.continueToken -> { parseContinue(continueLabel); return false }
And the actual parseContinue
function is effectively a copy of parseBreak
, as it does exactly the same thing (generates an unconditional jump to the label it has received as parameter). In fact we could have used the very same function, but in the interest of clean code I will keep them separate:
fun parseContinue(label: String) { inp.match() if (label == "") abort("line ${inp.currentLineNumber}: no loop to continue") code.jump(label) }
Don’t forget, we need to add the continueToken
in our LanguageTokens
file.
And that was it – with this minor change in our code, we can now support continue
, and this is thanks to Jack’s method, which resulted in code that is so easy to expand.
In this version I have also introduced two small enhancements in the code module:
First, a constant CODE_ID
that contains an identification of the flavour of assembly code produced. It is currently set to “x86-64 Assembly Code – AT&T format” and is printed as a comment at the top of the compiler output file. This is to help the reader identify easily what kind of code has been produced in that file. When a new code module is developed (e.g. to support ARM architecture) this string will be updated.
Second, I have added a line counter for the assembly code produced: outputLines
. This is being incremented based on the number of new lines in the output string in outputCode
and is printed on the console when the compiler finishes (together with the number of source lines).
And always, you can find the code in my GitHub repository.
Coming up next: Strings
Chapter XII: Improving our Error Checking
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
We now have a fully working compiler that can compile TINSEL v1.0 programs, including comments, which we can run on Linux. So, let’s work on some details to close some small gaps and also improve our error checking.
More Input after the EndProgram Token
Let’s start with an easy one. Our compiler stops compiling when it sees endprogram
which is the token that terminates our program. It will ignore anything that follows “endprogram” and this is fine – everything works. However, having extra characters in the input stream after the end of the actual program, which are being ignored, is not a clean scenario. Let’s add a little check so that our compiler flags this as an error. After the “endprogram” token we would expect the end of input, so all we need is to add one line at the end of parseProgEnd
:
fun parseProgEnd() { inp.match(Kwd.endOfProgram) code.progEnd() inp.match(Kwd.endOfInput) }
With this simple check, our compiler will not allow anything after the “endprogram” token other than white spaces (and new lines).
As you may have noticed, I’ve also moved the two lines that were printing a newline and the comment “end of program” to the output, to our code module into a function called progEnd
, similar to progInit
– it’s cleaner this way.
Output to a File
Now that we can directly execute our programs on Linux, it does not feel right to have to copy and paste our compiler’s output to a file so that we can then pass it through the GNU assembler. Let’s make our compiler do this for us.
So far we have been using only one mandatory command line argument, which was the input file. I will add one more option, to tell our compiler to create a file for the code that it produces:
CompilerMain [-o output_file] input_file
For this to work, let’s declare a new variable in CompilerMain.kt
that will hold the name of the output file:
var outFile = ""
We will need to pass this variable as a parameter to the x86_64Instructions
class constructor:
class x86_64Instructions(outFile: String = "") { ....
In this class we will declare our output stream that will be initialised by default to the standard output:
private var outStream: PrintStream = out
What we need now is some initialisation code for this class, which will check the value of outFile
, and if not empty, will set the output stream to point to that file:
/** initialisation code - class InputProgramScanner */
init {
if (outFile != "") {
try {
outStream = PrintStream(File(outFile))
} catch (e: Exception) {
err.println("could not create output file - $e")
err.println("output code will be sent to stdout")
}
}
}
As you can see, if the output file cannot be created/opened, the compiler will print an error message and will send the output to stdout. By the way, if the output file exists already, it will be overwritten.
And to make the output go to the output stream as opposed to stdout, a small change is required in outputCode
. This is the only place where we need to change the call to print
, as this is where the code is actually written to the output. All the other functions (code with a newline, labels, comments, etc.) end up calling outputCode
.
fun outputCode(s: String) = outStream.print(s)
And that was it in terms of sending the code either to a file or to stdout, however, what’s still missing is how to set the value of the output file name (outFile). Here’s how we will do this:
We need a new function to parse the command line arguments. We will call it (what else?) processCmdLineArgs
.
In my years of programming I’ve seen (and written) numerous different ways to process command line arguments. A good one that I recently came across, is to do it in two stages: (a) convert all the -x options and their parameters to a map with the option “x” as key and the parameters as data, and (b) process that map, i.e. process each option and set things accordingly. This method works very well when we have a large number of options (think of those complex Linux commands that have 20 or 30 different options). However it would be an overkill in our case as we only have one: “-o output_file” (and maybe we can add one more: “-?” for usage).
Another possible option would be to use the techniques we have learnt in the scanner implementation (by following Jack’s tutorial). After all, the command line arguments are a series of tokens that have to be parsed and actions must be taken based on their values. I have implemented this, as it was too tempting to resist. It appears to be very flexible and very thorough and can deal with anything you can throw at it (same as our compiler). However, the resulting parseCmdLineArgs
function is too lengthy to use just to parse a -o output_file option followed by input_file. I want to retain this solution though for future use, so I have saved the supporting class and a use case in this repository.
So, for now, I have used a simple, short, straightforward function that parses the args
list. It loops through the list using the helper function getNextArg
, and first checks for a "-"
to determine if the arg is an option. If yes, it checks for "-o"
or "-?"
and if "-o"
, it gets the name of the outfile from the next argument. If not "-"
, then it assigns the value of the arg to infile. All that is done with the appropriate error checking of course :
const val USAGE = "usage: CompilerMain [-o output_file] input_file" /** process command line arguments */ fun getNextArg(args: Array<String>, index: Int, argName: String = ""): String { if (index >= args.size) exit("missing argument(s) $argName, $USAGE") return args[index] } fun processCmdLineArgs(args: Array<String>) { var argIndx = -1 while (++argIndx < args.size) { val arg = getNextArg(args, argIndx) if (arg[0] == '-') when (arg) { "-?", "-h", "-H" -> exit(USAGE) "-o", "-O" -> { outFile = getNextArg(args, ++argIndx, "output_file"); continue } else -> exit("invalid option [$arg]\n$USAGE") } else inFile = arg } if (inFile == "") exit("missing argument input_file, $USAGE") }
To activate this function a one-line change is needed in the main initialisation function:
fun initCompiler(args: Array<String>) { processCmdLineArgs(args) // initialise the code module code = x86_64Instructions(outFile) // initialise the scanner module inp = InputProgramScanner(inFile) }
While we are on this topic, now that the compiler can (and most of the times will) send the resulting code to a file, the console would be totally silent. So, I have added one “Hello” message in the beginning and one “Finished” message at the end:
fun main(args: Array<String>) { println("TINSEL(c) compiler v1.1 Jan 2022, Copyright M.Pappas\n") initCompiler(args) parseProgram() println("Successful compilation, ${inp.currentLineNumber-1} source lines") // -1 is needed as an extra new line was added when the input was read }
Finally, I have added an extra comment at the top of the assembly output showing the date and time it was compiled:
fun progInit(progName: String) { outputCommentNl("program $progName") outputCommentNl("compiled on ${Date()}") .... }
Undeclared Identifiers
You may have noticed that our compiler will happily accept any variable name in an expression without checking whether it’s been declared first. This will cause a failure when our compiler’s output is sent to the assembler, so let’s fix this by adding an undeclared variable check. Actually, the same applies to functions.
In addition our compiler will allow the following:
var a ... fun a() { ...
Even though our compiler can cope with this, declaring a variable and a function with the same name, first is not good programming practice, and second it will cause an issue in our assembly code as both will translate to the same symbol:
a: ... a: ...
and this will cause an assembler error.
So, first, let’s combine the check “variable already defined” with the check “function already defined”. For this we need to combine the two maps:
// the variables space
val variablesSpace = mutableMapOf<String,VariableDecl>()
// the functions space
val functionsSpace = mutableMapOf<String,VarType>()
into one:
// the identifiers space map val identifiersSpace = mutableMapOf<String,IdentifierDecl>() /** the declaration space (variables and functions) */ class IdentifierDecl(var fv: TokType, var type: VarType, var initialised: Boolean = false)
With this, our declareVar
and declareFun
functions can check against one single map (identifiersSpace
) whether an identifier has been declared, and if not, they can update this map setting the fv
field accordingly (for a variable or function – the corresponding values have already been defined in the TokType
enumeration).
/** declare a variable */
fun declareVar(name: String, initValue: String) {
// check for duplicate var declaration
if (identifiersSpace[name] != null)
abort ("line ${inp.inLineNumber()}: identifier $name already declared")
identifiersSpace[name] = IdentifierDecl(TokType.variable, VarType.int, initValue!="")
code.declareInt(name, initValue)
}
/** process a function declaration */
fun declareFun(name: String) {
if (identifiersSpace[name] != null)
abort ("line ${inp.inLineNumber()}: identifier $name already declared")
identifiersSpace[name] = IdentifierDecl(TokType.function, VarType.int)
code.declareAsmFun(name)
}
Now that we have joined the declarations space for our variables and functions and can easily tell whether an identifier is a function or a variable, let’s go back to scan
. When scan
sees an alphabetic token, which is not a keyword, it returns exactly that: a token of type identifier without any additional information. And it’s down to our parser to decide whether that identifier is a variable or a function depending on whether there is a "("
following. It looks like we can do this in a smarter way. Let’s change scan
to return the type of the token as well (function, variable or none).
/** get the next token and advance the "cursor" */ private fun scan(): Token { newLine = skipWhite() if (checkEndofInput()) return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none) if (checkNumeric()) return Token(getNumber(), Kwd.number, TokType.none) if (checkAlpha()) return keywordOrFunctionOrVariable(getName()) if (checkSpecialToken()) return getSpecialToken() return getInvalidToken() } /** return a keyword or identifier token based on the keyword tokens list */ private fun keywordOrFunctionOrVariable(name: String): Token { val indx = isKeyword(name) return (if (indx >= 0) languageTokens[indx] // keyword found else // function, variable or other identifier found (determined by Token type) Token(name, Kwd.identifier, identifiersSpace[name]?.fv ?: TokType.none)) }
And with this additional information in the token coming from the scanner, we can improve our parsers. Until now, our compiler cannot process a plain function call (not part of an assignment):
fun a() { ... } main { ... a() <-- this will produce an error ("=" expected), which should not be the case ... }
To fix this let’s first update the BNF definition:
<block> ::= { <statement> * } <statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> | <return> | <read> | <print> | <assignment> | <function_call> | null [ ; ] <if> ::= if ( <b-expression> ) <block> [ else <block> ] <while> ::= while ( <b-expression> ) <block> <repeat> ::= repeat <block> until ( <b-expression> ) <for> ::= ( <identifier> = <b-expression> to <b-expression> ) <block> <break> ::= break <return> ::= return <b-expression> <assignment> ::= <identifier> = <b-expression> <function_call> ::= <function_name> ( ) <read> :: = read <identifier> [ , <identifier> ] * <print> ::= print <b-expression> [ , <b-expression> ] *
And based on this, the parseStatement
becomes:
fun parseStatement(breakLabel: String) { when (inp.lookahead().encToken) { Kwd.startBlock -> parseBlock(breakLabel) Kwd.ifToken -> parseIf(breakLabel) Kwd.whileToken -> parseWhile() Kwd.repeatToken -> parseRepeat() Kwd.forToken -> parseFor() Kwd.breakToken -> parseBreak(breakLabel) Kwd.retTok -> parseReturn() Kwd.readToken -> parseRead() Kwd.printToken -> parsePrint() Kwd.identifier -> if (inp.lookahead().type == TokType.variable) parseAssignment() else if (inp.lookahead().type == TokType.function) parseFunctionCall() else abort("line ${inp.inLineNumber()}: identifier ${inp.lookahead().value} not declared") Kwd.semiColon -> inp.match() // semicolons are simply ignored else -> inp.expected("valid keyword, semicolon or identifier") } }
Lastly, the parseIdentfier
function can now change to:
fun parseIdentifier() {
when (inp.lookahead().type) {
TokType.variable -> parseVariable()
TokType.function -> parseFunctionCall()
else -> abort("line ${inp.inLineNumber()}: identifier ${inp.lookahead().value} not declared")
}
}
/**
* parse a function call
* <function_call> ::= <function_name> ( )
*/
fun parseFunctionCall() {
val funcName = inp.match(Kwd.identifier).value
inp.match(Kwd.leftParen)
inp.match(Kwd.rightParen)
code.callFunction(funcName)
}
/** parse a reference to a variable */
fun parseVariable() {
val varName = inp.match(Kwd.identifier).value
code.setAccumulatorToVar(varName)
}
With these changes, our compiler now supports the “plain” function call and also checks every time an identifier is mentioned, whether it has been declared (as a variable or a function) and produces an error if not.
Uninitialised Variables
It would be good to have a check for uninitialised variables but we need to think how this will work. For the moment we have the flag initialised in our identifiers space map, which is set to true when a variable is initialised at declaration time. Also need to remember that for now we only use global variables.
So, regardless of initialisation at declaration time, we can set the initialised flag to true when the variable is assigned a value, which is when it appears on the left hand side of an assignment or in a read statement. And then, every time the variable appears on the right hand side of an assignment, we can check the initialised flag and trigger a compile error if not initialised. So far, so good.
However, we are talking about global variables here. They can be modified by any function in our program, even by functions in other files (when TINSEL supports this). This means it’s almost impossible for the compiler to determine at compile time whether a variable has been initialised or not when it appears on the right hand side of an assignment.
So, it looks like we will abandon the idea of checking for uninitialised variables as far as the global variables go, but we will come back to it when we introduce local variables. And that’s what the commercial compilers do when it comes to uninitialised variables.
Finally, don’t forget that global variables that are not initialised in our compiler when they are declared, will be initialised to 0 (zero) by default (see chapter 10).
Return
As mentioned in Chapter VIII, each function must have a return statement. So, to be thorough, our compiler must check this and produce an error if that return statement is not found. The assembler would not complain about the absence of a return statement, however this scenario would lead to unpredictable results, as the code would continue executing whatever bytes it found after the end of our function (most likely some other function or the main program itself from the beginning…). A decent compiler would also check that the return statement is actually reachable and would produce a fatal error otherwise. We will keep it simple though and will only check for the existence of the return statement.
The first question is, where do we do this? The most obvious answer must be in parseBlock
, as this is where the body of the function is processed. However, parseBlock
is called in quite a few other cases like from parseIf
, parseWhile
, parseRepeat
, etc. and even from parseMain
. So, let’s create a new version of parseBlock
, that will be called only form parseFuncDecl
to parse the function block. And let’s assume for a moment that parseBlock
returns a Boolean value indicating whether the block included a return
statement or not (we will see in a minute how we will do this). Our new parseFunctionBlock
will now check the return value of parseBlock
and if false, it knows there was no return
statement:
fun parseFunctionBlock(funName: String) {
val hasReturn = parseBlock()
if (!hasReturn)
abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retTok)}")
}
The next question is, how will parseBlock
check for the return statement? parseBlock
is recursive: it calls parseStatement
, which calls parseWhile
, which calls parseBlock
, which calls parseStatement
, which calls parseIf
, which calls parseBlock
, etc. and at some point in this chain there might be a return
statement.
Let’s take a step back and think of how we have implemented return
. It is done by a parseReturn
function that has been added to the parseStatement
where
block. It is at this level where we know that we have a return
statement. And it looks like there is an “easy” way to check, very similar to how Jack implemented the breakLabel
, which is passed down the recursion
chain as a parameter. I will change parseStatement
, to return a Boolean value indicating whether the statement it executed has found a return
statement or not. This return value will be returned further on, by the previous function, up the recursion chain until it reaches the top level parseBlock
(the one that was called from parseFunctionBlock
). This may sound more complicated than it really is, but it boils down to this: each parser that calls parseBlock
, simply reads its return value and returns it to the caller. In the end this value reaches parseFunctionBlock
. And here’s the code for it:
fun parseBlock(breakLabel: String = ""): Boolean {
var foundReturn = false
inp.match(Kwd.startBlock)
while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
if (parseStatement(breakLabel))
foundReturn = true
}
inp.match(Kwd.endBlock)
return foundReturn
}
fun parseStatement(breakLabel: String): Boolean { when (inp.lookahead().encToken) { Kwd.startBlock -> return parseBlock(breakLabel) Kwd.ifToken -> return parseIf(breakLabel) Kwd.whileToken -> return parseWhile() Kwd.repeatToken -> return parseRepeat() Kwd.forToken -> return parseFor() Kwd.breakToken -> { parseBreak(breakLabel); return false } Kwd.retTok -> { parseReturn(); return true } Kwd.readToken -> { parseRead(); return false } Kwd.printToken -> { parsePrint(); return false } Kwd.identifier -> { if (inp.lookahead().type == TokType.variable) parseAssignment() else if (inp.lookahead().type == TokType.function) parseFunctionCall() else abort("line ${inp.currentLineNumber}: identifier ${inp.lookahead().value} not declared") return false } Kwd.semiColon -> { inp.match(); return false } // semicolons are simply ignored else -> { inp.expected("valid keyword, semicolon or identifier"); return false } } }
As you can see, parseStament
now returns a Boolean value indicating whether the statement it executed has found a return statement
or not. parseBlock
checks this return value and sets foundReturn
to true if at least one statement has found a return. At the end of the block, this Boolean variable has the answer to the question “was there a return statement in this block?”. At the same time the foundReturn
value is also returned to the caller, as this block could have been called by a parseStatement
or a parseIf
, etc, and they need to return that value up the recursion chain. This value is checked at the end of the chain by parseFunctionBlock
and produces a compiler error if false.
Looking at parseStatement
, we can see that it returns true on when it executes parseReturn
(as this is where the return
statement is found). It returns false if it executes parseBreak, parseRead, parsePrint, parseFunctionCall
and parseAssignment
(no return
statement there). For all the other statements (parseIf, parseWhile, parseRepeat, parseFor
) it returns their return value. These parsers must record the return value of the parseBlock
that they call and return that to the caller. Where parseBlock
is called more than once (like in parseIf
) then all the return values must be OR
‘ed as the return statement can be in any of these blocks. Easy, isn’t it!!
On a side note, you may have noticed a new variable above: currentLineNumber
. If you remember, when I introduced line numbers in Chapter IX, I had some doubts about the accuracy of the line number. And there is a good reason for this. The lineNumber
variable is increased by 1 with every new line in function getNextChar
, which is called by skipWhite
, which is called by scan
, which is called at the end of match
to advance to the next token. So, lineNumber
simply points to the line where the next token is, not the one we are processing right now.
To fix this, I have introduced the new variable currentLineNumber
, which is set to the lineNumber
at the beginning of match
(i.e. in the beginning of the next cycle). This way currentLineNumber
points always to the line of the current token and can be safely used in our error messages. Don’t forget, we need to set it both in the beginning of match
and also in the beginning of lookahead
, as these are the two points that our parsers call to check the current token.
fun match(keyWord: Kwd = Kwd.any): Token { currentLineNumber = lineNumber ....
fun lookahead(): Token {
currentLineNumber = lineNumber
return nextToken
}
Finally, I have added a -debug
option to the main program that activates a debug function (debugCompiler
) instead of the standard parseProgram
function. In this debug mode our compiler will scan the input and will print each token, next token, line number and other details so that we can easily verify how (and if) our scanner works.
And as always, you can find the code in my GitHub repository.
Coming up next: Fixing the For Loop
Chapter XI: Semicolons and Comments
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the previous chapter we produced our first real running program written in TINSEL, our own programming language. We could give it input from the keyboard and and see its output on the screen. Yes, I admit, it’s a bit basic, but we’ve come a long way from the one-character tokens and single-line programs from the beginning of the series that were producing a mixture of M68000 instructions and pseudo-code. This has been an amazing journey that we’ve managed to travel following on Jack’s footsteps. And in this chapter we will add a couple of features that will make TINSEL even better.
Semicolons
Same as Jack, I had to use semicolons since I learnt Algol! And I continued using them with Pascal, C and Java. But, guess what – I don’t have to use them anymore now that I have started programming in Kotlin!
So, does a modern language really need semicolons or not? The answer is “yes” and “no”. As Jack explained in detail, one school says that they are just a nuisance – they are adding nothing to the syntax, and a modern language can function perfectly well without them (and Kotlin does actually). Don’t forget that our compiler can compile the below statements correctly, without the need of any semicolons:
a = b * 2 + (x+y) / z i = 1
It can tell where each statement finishes and where the next one starts, even if they extend across multiple lines, so, as the minimalists would say, why use semicolons? Why add an extra mandatory keystroke in every statement, which will be a source of another potential mistake?
The other side of the coin is that semicolons add clarity and make the code less cryptic and much easier to read. Think of the case where we have 2 or 3 short statements in one line:
i=1; a=10; b=a+1;
Without semicolons this line would look a bit “funny”:
i=1 a=10 b=a+1
But, again, including multiple statements in one line is not good programming practice, and should we make semicolons mandatory for that rare case?
So, I will do exactly what Jack did: make them optional. The compiler will process one when it sees one, but will not ask for one if it’s not there.
Based on this, our BNF definition of statement changes as per below:
<statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> | <return> | <read> | <print> | <assignment> | null [ ; ]
which says that any statement can be optionally terminated by a semicolon and we can also have an empty statement. And the code to implement this is the following three lines in parseStatement
:
... Kwd.identifier -> parseAssignment() Kwd.semiColon -> inp.match() // semicolons are simply ignored else -> inp.expected("valid keyword, semicolon or identifier") }
The changes here are: (a) we only trigger parseAssignment
when we actually see an identifier (and not in the else
clause as before), (b) when we see a semicolon we recognise it and ignore it and (c) we catch any invalid tokens in the else
clause. And that was it!
Now, if you want to make semicolons mandatory, you will need to add the line
inp.match(Kwd.semiColon)
at the end of the parser functions for break, return, read, print and assignment. In that case, you may also want to add them to the parsers for ProgHeader, VarDecl, and ProgEnd for consistency. I’ll let you decide whether you want to make semicolons mandatory in your own version of TINSEL or not.
Comments
Some people are against comments, some are in favour. Some say the code speaks for itself, some see the need of a comment in certain cases to explain what is happening and why, as no matter how “clean” the code is, every now and then there will be an obscure point that needs a comment to clarify. I have decided to include comments in TINSEL so that you can use them if you want to.
The processing of comments can be as easy or as complicated as we make it.
Here’s my spec: I want both block comments like /* comment here... */
and also inline comments like this // comment inline
.
I will also add a little twist: I will define an extra comment token that will transfer those comments from the source code to the output code. For this I will use the tokens /*--
and //--
<block-comment> ::= /* <comment> */ | /*-- <comment> */ <inline-comment> ::= // <comment> <newline> | //-- <comment> <newline>
With these specs in mind, the next question is: where is the best place to process comments?
Jack has put his skipComment
function as low in the scanner as possible, inside skipWhite
. And this is an approach that served him well, especially in the case where he was working with single-character tokens. I tried this in my code but I had two issues with it: (a) it resulted in a very complicated function to detect the start of a comment (and the kind of the comment) – 27 lines – the reason for this is that by working inside skipWhite
, I had to go back to string comparisons and could not use the power of scan
that translates the character sequences to tokens, and (b) the comments that are transferred to the output code are printed in the wrong place (I will explain this in more detail shortly). So based on this, I abandoned the idea. If you want to see how I implemented it, I have saved it in the file InputProgramScanner_1.kt
in the GitHub repository.
Another idea would be to process comments as high as possible, in the parser. That would mean a parseComment
function that would have to be triggered when a start-of-comment token was returned by scan
(or actually match
). However, that parse function would have to be called in all the other parser functions where a comment could be seen and this is too many places. Also, we would have another issue in case of an in-line comment that would have to be transferred to the output – our parseComment
would not know where to stop in that case, as the newline is not visible at the parser level (it is skipped in skipWhite
). To resolve this we would have to treat newline as another token and have special logic in scan
to decide when to skip it and when to return it as the next token. For these reasons I abandoned this idea too.
So, I have decided to go for a solution in the middle – process comments in match
. This is the place where we match each token and then we advance to the next one (including skipping white spaces in scan
), so naturally this would be the right place to process any comments if the next token is the start of one. Normally we would want to process the comments in the beginning of match so that any comments would be processed (and printed in the right place) and then nextToken
would be advanced to the next actionable token to be matched. And this would be a good approach apart from one issue: match
is not the only place where we check the next token. We also check it via lookahead
, which returns the next token without advancing the cursor. So we have a discrepancy here where match
would point to the next actionable token skipping any comments, while lookahead
would point to the next token even if that was a comment. To fix this we would have to process the comments in lookahead
as well, but that would mean that lookahead would also advance the cursor and this would cause another discrepancy, as lookahead
is supposed to not move the cursor.
It looks like the processing of the comments has to be done at the end of match
. And this makes sense based on the logic of my code, given that this is where the previous token has been matched, we have called scan
, skipped white spaces and advanced to the next token. In my mind, this is the right place to check for any comments and process them, so that nextToken
will point to the next actionable token. We still have one issue though. Imagine this scenario:
program prog1 var a //-- function to calculate square fun square() { ...
Here we are processing the var
declaration, we have matched “var” and are calling match
to get the variable called “a”. This call, after matching “a” would process (and print) the comment that follows and then advance to the fun
token, ready for the next match. Given that the assembly code is transferred to the output after match
returns, while the comment would be printed from inside match
, the assembly code would look like this:
# program prog1 .data .align 8 # # function to calculate square a: .quad 0 .text .align 8 .global _start square: ...
As you can see the comment is transferred to the output code in the wrong place – it is printed just before the declaration of var a. There is an easy way to fix it. If we have a printable comment, we don’t actually print it, but keep it in a variable instead. We then check the contents of that variable in the beginning of match and print its contents if any. This way each comment that has been processed is printed in the output code in the beginning of the next match cycle, which is where it should be.
So, here’s the new function getComment
, which branches into two functions that process block and inline comments respectively.
/** get a comment */ private fun getComment() { while (nextToken.type == TokType.commentStart) when (nextToken.encToken) { Kwd.blockComment -> getCommentBlock() Kwd.blockCommentOut -> getCommentBlock(true) Kwd.inlineComment -> getCommentInline() Kwd.inlineCommentOut -> getCommentInline(true) else -> expected("start of comment") } } /** get a block comment */ private fun getCommentBlock(printToOut: Boolean = false) { var localCommentString = code.COMMENT val endComment: String = decodeToken(Kwd.commentEnd) while (!inputProgram.substring(cursor).startsWith(endComment) && nextChar != endOfInput) { localCommentString += nextChar if (nextChar == '\n') localCommentString += code.COMMENT getNextChar() } localCommentString += '\n' nextToken = scan() // nextToken now points to endComment or endOfInput if (nextToken.encToken == Kwd.endOfInput) expected(endComment) nextToken = scan() // nextToken now points to the next token after the comment if (printToOut) commentString += localCommentString } /** get an in-line comment */ private fun getCommentInline(printToOut: Boolean = false) { var localCommentString = code.COMMENT while (nextChar != '\n' && nextChar != endOfInput) { localCommentString += nextChar getNextChar() } localCommentString += '\n' nextToken = scan() if (printToOut) commentString += localCommentString }
getCommentBlock
reads the input until it finds the end-of-commend token, while getCommentInline
reads the input until it gets a newline. Both transfer the comment to the commentString
variable which is printed in the beginning of match
in the next call. This is the modified match
function:
fun match(keyWord: Kwd = Kwd.any): Token { printComment() // any comments found in the previous call must be printed in the output code now if (keyWord != Kwd.any && nextToken.encToken != keyWord) // check keyword to match expected(decodeToken(keyWord)) val thisToken = nextToken nextToken = scan() // advance to next token getComment() // process any comments return thisToken } private fun printComment() { if (commentString != "") { code.outputCode(commentString) commentString = "" } }
The only thing to add is that any comments before our first token (which is program
if you remember) will not be processed here. To fix this we just need to add one more call to getComment
at the end of the init
section of our scanner. This will take care of any comments before the program...
Also, please note, this version does not support nested comments.
I did not expect the processing of comments to be so complex, but it is there now and it’s working. As I spent quite some time on the scanner in this chapter, I have taken the opportunity and improved the scan
function, which is the brains and engine of our scanner. I have broken it down to smaller functions and made it easier to read (full details in the GitHub repository):
private fun scan(): Token { skipWhite() if (checkEndofInput()) return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none) if (checkNumeric()) return Token(getNumber(), Kwd.number, TokType.none) if (checkAlpha()) return keywordOrIdentifier(getName()) if (checkSpecialToken()) return getSpecialToken() return getInvalidToken() }
And in the spirit of tidying up, I have also removed the field info
from the Token
class. It has not been used and I don’t think it will be needed at all. With these enhancements in place, you can rewrite the TINSEL program from chapter X like this:
/*-- My First TINSEL program demonstrates the basic capabilities of the language Accepts an integer n as input and prints the squares of all the integers from 1 to n */ program tinselExample //-- variables declarations var i = 0, n /*-- calculate the square of i only for positive i */ fun i_square() { if (i <= 0) { return -1 } return i * i; } // main entry point main { read n; //-- main loop repeat { i = i + 1 print i_square(); } until (i >= n) } endprogram
As I said earlier in this chapter, it’s up to you how much you want to use comments in your code and whether you want to use semicolons and when.
And as always, you can find the code in my GitHub repository.
Coming up next: Improving our Error Checking
Chapter X: TINSEL for x86-64
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the first 6 Chapters we have closely followed the method and the steps that Jack taught us in his original tutorial, which made it so easy to deliver a working compiler in very simple and easy to understand steps.
In Chapter VII we removed the last remaining limitations (single character tokens) and set up the necessary structure to be able to recognise any token and pass it to our parser. We also set up the tools to make our language fully configurable and to be able to define and recognise any token we want.
In Chapters VIII and IX we defined our own programming language: TINSEL, which looks like a mini C. We demonstrated that our compiler can process TINSEL programs with decent error checking.
We still have limitations on the output side, as the code we are producing is still a mixture of M68000 instructions and dummy commands. Obviously this output is not much use for any practical application.
And this is what we will fix in this Chapter: we will build a new code module, for x86-64, and will make any other necessary changes in our compiler to produce assembly code for x86-64, for Linux. That code anyone will be able pass through an assembler and linker and then run it and test it and see the result of.
Some Background
First, I’d like to set the plan and explain what exactly we will do in this Chapter. This will not be a lesson on Assembly language, however, there are certain things which have to be explained in detail so that you understand why our output code is generated in a specific way and how it works.
Please note these important points to ensure that you can actually run the code that our compiler produces:
- The output code that we will produce here is compatible with the GNU Assembler. As you may know, there are two standards in the x86 world, the Intel standard and the AT&T standard. The main difference between the two is the order in which Source – Destination appear in the instructions (plus a few more minor differences). Our compiler will produce AT&T-standard assembly code which is recognisable by the GNU assembler. If anyone wants to use Intel-standard and the NASM assembler, they will have to change the code module accordingly.
- The resulting code can be run only on Linux (has been tested on Ubuntu 20.04 and Fedora 35, 64-bit). The reason for this is that certain system calls have been used for reading, writing and exit program. These calls are specific to Linux and the 64-bit architecture and would not work on Windows. In order for the output code to work on Windows, these system calls would have to be changed to the equivalent Windows system calls.
- The output code is compatible with the x86-64-bit architecture only – it will not run on 32-bit architecture. The 64-bit registers are being used throughout the code but also the calling convention is different in x86-64. In x86-64 the first 6 parameters to a function are passed via CPU registers (as long as they are integers or pointers) while previously they were passed via the Stack. So, to make our output code work in 32-bit architecture, not only the registers have to be changed (
rax -> eax, rbx -> ebx,
etc), but also the calling sequence when a function is called must be changed. Unless you are very comfortable with x86 assembly, this can be a very delicate operation. - Given that we want to be doing input and output in our TINSEL programs, we need a run-time library that will support these functions (we don’t want to keep rewriting these in every program). I have implemented these functions in x86-64 assembly and the source code is in my Github repository. These will have to be linked with your compiled TINSEL program(s) to produce the final executable.
- Any programs you write on TINSEL can either be run autonomously or can be linked with a C program and can be invoked from there. You will have to make certain changes for this and take certain considerations into account – more details later in this chapter.
Please don’t be confused by the above. Everything will be clarified by the end of this chapter. In the end, if you are not interested at all in Assembly, you can just skip to the end of this chapter, take the sources from the Github repository and run them on your Linux box. On the other hand, I’ve always wanted to know “what’s inside” and understand “how it works”, so any opportunity to write Assembly code is a temptation I can’t resist.
For those of you who are interested, a good quick reference is here, while you can find a detailed x86-64 (AT&T flavour) reference manual in the Oracle documentation pages.
The x86-64 Calling Convention
As mentioned above, in x86-64, the function parameters are passed via CPU registers. There are sixteen 64-bit registers: %rax, %rbx, %rcx, %rdx, %rdi, %rsi, %rbp, %rsp
, and %r8-r15
. The following six are used to pass the first six integer or pointer parameters (in the below order) to a function that is called:
first six int or ptr parameters: %rdi, %rsi, %rdx, %rcx, %r8, and %r9
Any additional parameters or any large structures are passed through the stack (first parameter at the top).
The function will return its value (if there is one) in the following CPU register
return value: %rax
.
I have applied this calling convention (a) in the code module of our compiler when a function is called (at the moment this is used only when a library function is called – our Tinsel functions don’t support parameters yet) and (b) in the runtime libraries, which consist of functions that our code will be calling (see below). For this reason the programs that are produced by our TINSEL compiler are not compatible with the 32-bit architecture.
Runtime Libraries
Every programming language has a runtime system that includes libraries, which are linked with the program (statically or dynamically). These libraries provide basic functions that the program needs in order to run.
The first version of our runtime library has only two functions that are available to our TINSEL programs, read_i_
and write_i_
, which are used on the back of the TINSEL commands read
and print
respectively. There is also a number of other functions but those are used internally within the library.
The read_i_
function takes no parameters and returns the integer value read from the keyboard in %rax
. Internally it calls the read_s_
function, part of the same library, which reads a string from the standard input using the Linux system call #0 (= read). To read the string, it uses an internal buffer (defined in the same library) to store it temporarily, and then calls another library function, atoi_
, which converts the string to integer.
The write_i_
function takes one parameter (in %rdi
), which is the integer to be printed. It first converts it to string using itoa_
(another library function) and then calls write_s_
which prints the string to standard output using the Linux system call #1 (= write).
Even though at the moment our programs will be using only read_i_
and write_i
, the other library functions will come useful in later chapters when we introduce strings.
This version of our runtime library consists of two assembly files (readwrite.s
and string_utils.s
) that you can find under the libtinsel/src
directory in the GitHub repository. I will explain at the end of this chapter how to pass them through the Assembler and link them into your TINSEL programs.
Please note, you can also try these library functions from a C program on Linux (Ubuntu 20.04 or Fedora 35, 64-bit) as follows:
/* test1.c */ extern long long int read_i_(); extern void write_i_(long long int); long long int i; void main() { do write_i_(i = read_i_()); while(i); }
Remember the library routines are written for 64-bit architecture, so all integers must be declared as long long int
, otherwise negative integers will not work. You can compile this small test program and link it with our library using this command:
gcc -o test1 test1.c readwrite.s string_utils.s
and you can just run it as ./test1
. It will read an integer from the keyboard and print it on the screen in a loop until you give it a 0 (or Enter
). You may notice that each integer you enter is printed without a new line, and this is exactly what write_i
is supposed to do: write an integer on the screen. However, in order to make the output of our programs easier to read, we will address this gap with a temporary workaround later in this chapter, when we implement the code for print
, and with a more permanent solution when we introduce strings in one of the following chapters.
The x86-64 Code Module for our Compiler
This is where we will create a new code module that will produce x86-64 code. There are almost no changes required in our Lexical Scanner or the Parsers other than a couple of places where we need to call some functions to generate extra code.
Please copy the code file M68000Instructions.kt
to x86-64Instructions.kt
. Remember to also change the class name to class x86_64Instructions
. Also change the code
variable in the CompilerMain.kt
:
// the Intel x86-64 instruction set val code = x86_64Instructions()
Now, let’s first tidy up the routines that send code to the output. We need to write code (a) with or without a tab in the beginning, (b) with or without a newline at the end, (c) output a label and (d) output a comment. The routines to do this will be:
private val COMMENT = "#" /** output code */ private fun outputCode(s: String) = print(s) /** output code with newline */ fun outputCodeNl(s: String = "") = outputCode("$s\n") /** output code with tab */ fun outputCodeTab(s: String) = outputCode("\t$s") /** output code with tab and newline */ fun outputCodeTabNl(s: String) = outputCodeTab("$s\n") /** output comment */ fun outputComment(s: String) = outputCode("$COMMENT $s\n") /** output a label */ fun outputLabel(s: String) = outputCodeNl("$s:")
As you can see, only one of these functions uses print
(and the rest use the first one), so when we decide to send the compiler output to a file instead, it will be a very easy change.
Next step, we need to include some new functions that will create certain sections, symbols and other declarations that the GNU assembler needs in order to produce a complete, runnable object module.
I have decided as first line to have the program name in a comment. We then need to define the .data section, align it to 8-bit boundary and declare a copyright string. The function that does all that is:
/** initialisation code for assembler */ fun progInit(progName: String) { outputComment("program $progName") outputCodeNl(".data") outputCodeNl(".align 8") // copyright message outputCodeTabNl("tinsel_msg_: .string \"TINSEL version 1.0 for x86-84 (Linux) Dec 2021 (c) M.Pappas\\n\"") // newline outputCodeTabNl("newline_: .string \"\\n\"") }
It has to be called at the very beginning, so I have added it to the parseProgramHeader
code:
fun parseProgHeader() { inp.match(Kwd.startOfProgram) code.progInit(inp.match(Kwd.identifier).value) }
Next, we need the necessary code to declare variables. One thing to remember is that in Assembler there are no real variables, there are just labels. So effectively, our variable declaration results in defining a label. In order to treat this as a variable we need to reserve memory where this label is pointing – this is where the value of this variable will go. As our variables are only integer for now we need to allocate 8 bytes (or a quad word) for each variable. In order to reserve these 8 bytes, we have to give them a value. This means that in case of uninitialised variables (in the TINSEL program, that is) we have to choose a value and I have chosen 0. So, any global variable that is not initialised in our TINSEL program will default to 0.
/** declare int variable (64bit) */ fun declareInt (varName: String, initValue: String) { if (initValue == "") outputCodeTabNl("$varName:\t.quad 0") else outputCodeTabNl("$varName:\t.quad $initValue") }
We call this code generating function as the last line in declareVar
(that declares each of our global variables):
code.declareInt(name, initValue)
With the above, the .data
section is completed and it’s time to move to the .text
section. We first declare our program’s entry point (_start
) as global so that the linker can find it.
private val MAIN_ENTRYPOINT = "_start" /** initial code for functions */ fun funInit() { outputCodeNl("\n.text") outputCodeNl(".global $MAIN_ENTRYPOINT") }
This is called in parseProgram after we are done with variables and just before we start processing the function declarations, which is where the .text
section starts.
fun parseProgram() { parseProgHeader() if (inp.lookahead().encToken == Kwd.varDecl) parseVarDecl() code.funInit() if (inp.lookahead().encToken == Kwd.funDecl) parseFunDecl() parseMainBlock() parseProgEnd() }
We also need to implement the declareAsmFun
routine, which is nothing else but a comment with the function name and a label to declare the start of that function:
fun declareAsmFun (name: String) { outputCodeNl() outputComment("function $name") outputLabel(name) }
This is called as the last line in declareFun where each function is declared:
code.declareAsmFun(name)
Then we come to the point where our main
program starts. We mark this with a comment to make it visible in the assembly code, and we declare the _start
label. We also add a couple of lines of executable code which prints the TINSEL hello message on the screen:
fun mainInit() { outputCodeNl() outputComment("main program") outputLabel(MAIN_ENTRYPOINT) outputComment("print hello message") outputCodeTabNl("pushq\t%rbx") outputCodeTabNl("lea\ttinsel_msg_(%rip), %rdi") outputCodeTabNl("call\twrite_s_\n") }
At the end of main, apart from a comment to mark it, we also need the necessary code to stop our program and return to the operating system. We achieve this by using the Linux system call #60 (exit).
fun mainEnd() { outputCodeNl() outputComment("end of main") outputCodeTabNl("popq\t%rbx") outputCodeTabNl("movq\t$60, %rax\t\t$COMMENT exit system call") outputCodeTabNl("xorq\t%rdi, %rdi\t\t$COMMENT exit code 0") outputCodeTabNl("syscall") }
These two functions are called in the beginning and at the end of parseMainBlock
respectively:
fun parseMainBlock() { labelPrefix = "main" // set label prefix and label index labelIndx = 0 inp.match(Kwd.mainToken) code.mainInit() parseBlock() code.mainEnd() }
You may have noticed the addition of labelPrefix
which is set to “main
” above. It’s also set to the function name in parseFunDecl
. This is to help have more meaningful names for the labels in our assembly program (as opposed to them running from L0:
to Lxxx:
)
/** create a unique label*/ fun newLabel(): String = "${labelPrefix}_L${labelIndx++}_"
And with that we come to the end of the program where I have added a comment line to mark the end:
fun parseProgEnd() { inp.match(Kwd.endOfProgram) code.outputCodeNl() code.outputComment("end program") }
Having done the above, it is then straight forward to translate the M68000 instructions to x86-64 instructions. A couple of points to note: I have used %rax
(obviously) as the accumulator, and when an operation between the top of stack and %rax
is to be done, I pop the value from the stack into %rbx
:
/** add top of stack to accumulator */ fun addToAccumulator() { outputCodeTabNl("popq\t%rbx") outputCodeTabNl("addq\t%rbx, %rax") }
Also, all variable references are relative to the program counter, same as in Jack’s M68000 code:
/** set variable to accumulator */ fun assignment(identifier: String) = outputCodeTabNl("movq\t%rax, ${identifier}(%rip)")
Finally, below follows a special mention of how the cmp
instruction and the flags are used. Jack spent quite a few paragraphs explaining this and it’s time I do the same.
A Note on CMP
If you are like me and want to know “what’s inside”, this is the topic to pay attention to. Here’s how I have implemented the is equal (e.g. x == 1
):
fun compareEquals() { outputCodeTabNl("popq\t%rbx") outputCodeTabNl("cmp\t%rax, %rbx") outputCodeTabNl("sete\t%al") // set AL to 1 if comparison is == outputCodeTabNl("andq\t$1, %rax") // zero the rest of rax and set flags - Z flag set = FALSE }
The first two lines are obvious – we pop the top of stack into %rbx
and compare with %rax
. But what do the two lines that follow do?
To understand this, we need first to go back to chapter 6, where we decided that any integer can be treated as boolean and vice-versa. And we also decided that the value of 1 means true and the value of 0 means false. We also need to remember that the conditional commands in assembly (e.g. je
= jump if equal) work based on the values of the flags (or flags register).
Let’s see now how cmp
works:
cmp %rax, %rbx
It subtracts the value of %rax
from the value of %rbx
and, depending on the result, it sets the flags respectively. The most important flags when we are dealing with signed integers are ZF and SF (the Zero and the Signed Flag respectively). The result of this subtraction is then thrown away and %rbx
retains its original value.
So, if we compare x == 1
(and let’s assume that x is actually 1), then the result of the above subtraction will be 0 and the Zero Flag will be set. So if after this we wanted to do a jump if true command, we would do a jz
(jump if Zero Flag is set). So in this case jump if true = jz.
Now let’s assume that we fetch a boolean variable in %rax
, i.e. an interger that would have a value of either 0 or 1. And assume we set the flags according to the contents of %rax
. Same as above, let’s assume that the value is true, i.e. the value of %rax
is 1. And if we wanted to do a jump if true command we would have to do jnz (jump if Zero Flag not set). So in this case jump if true = jnz.
You can see the discrepancy above. In addition, the above cmp instruction just sets the flags and we have no boolean result (0 or 1) to store as a boolean variable.
So, in order to solve this problem I have followed exactly the method Jack used. After the comparison and the setting of the flags, set the value of %rax
to 0 or 1 based on the convention 1=true and 0=false. And as a following step, set the flags again based on the new value of %rax
. This is exactly what the extra two lines of code in compareEquals
do:
sete %al
sets %al
(the lowest 8 bits of %rax
) to 1 if the comparison result is true (if the two numbers are equal, that is) and to 0 if false.
andq $1, %rax
does two jobs (a) performs a bitwise and of the contents of %rax
and the constant $1
, i.e. sets all bits of %rax
to 0 apart from the first bit so that %rax
can take the values of 1 or 0 only and (b) sets the flags (ZF will be set if %rax
is 0).
With this we have satisfied both scenarios: (a) %rax
takes the value of 1 if the condition is true and 0 if false, so it can be used a boolean variable and (b) the Zero Flag is set when %rax
is 0 (=false) which means that jz
always means jump if false.
One more detail to complete this topic. In the family of compare*
functions we are getting the final setting of the Zero Flag for free as it is done by the and
instruction. We need to remember though that we must ensure the Flags register is set every time we set %rax
to a value. I have done this by using the test
instruction in setAccumulator
and setAccumulatorToVar
(in all other cases when %rax is the result of a numeric or boolean operation the Flag register is also set automatically):
fun setAccumulator(value: String) { outputCodeTabNl("movq\t$${value}, %rax") outputCodeTabNl("testq\t%rax, %rax") // also set flags - Z flag set = FALSE } fun setAccumulatorToVar(identifier: String) { outputCodeTabNl("movq\t${identifier}(%rip), %rax") outputCodeTabNl("testq\t%rax, %rax") // also set flags - Z flag set = FALSE }
This way we can write TINSEL code like
while (true) { ...
or
while (nextIteration) { ...
or
print x==1
Before we leave this topic let’s add two more parse functions, parseLessEqual
and parseGreaterEqual
and their corresponding assembly code functions compareLessEqual
and compareGreaterEqual
. These are based on the their sister functions parseLess
, parseGreater
, compareLess
and compareGreater
.
parseLessEqual
and parseGreaterEqual
must be triggered in parseRelation
when the appropriate relational operators are found.
Read and Print
The implementation of parseRead
and parsePrint
that support the read
and print
commands is much more straightforward than that of cmp
in the previous section. We only support integers for now, so parseRead
calls readInt
in our code module, which in turn calls read_i_
from our runtime library, and then assigns the value returned to the variable we are reading.
Similarly, parsePrint
calculates the given expression and then calls printInt
, which in turn calls write_i_
from our library to print the result on the screen. In addition parsePrint
calls println
that prints a newline, otherwise our output would be unreadable. This is temporary workaround until we implement strings, at which point we will implement proper print
and println
in TINSEL.
/**
* parse read statement
* <read> :: = read <identifier> [ , <identifier> ] *
*/
fun parseRead() {
var varName: String
do {
inp.match()
varName = inp.match(Kwd.identifier).value
code.readInt(varName)
code.assignment(varName)
} while (inp.lookahead().encToken == Kwd.commaToken)
}
/**
* parse print statement
* <print> ::= print <b-expression> [ , <b-expression> ] *
*/
fun parsePrint() {
do {
inp.match()
parseBooleanExpression()
code.printInt()
code.println()
} while (inp.lookahead().encToken == Kwd.commaToken)
}
And with this we are done, we are now ready to write a fully functional TINSEL program and run it.
Running our Code
And that’s it! We can now write and compile a complete TINSEL program and can run it on our Linux workstation. This is how you do it:
- First prepare the runtime library (you do this only once and keep the .o files that you will need in every TINSEL program:
as -o readwrite.o readwrite.s
as -o string_utils.o string_utils.s
- Write a TINSEL program, e.g.
prog1.tnsl
- Compile
prog1.tnsl
and copy the Compiler output (the assembly code) to a file:prog1.s
- Pass the Compiler output through the assembler:
as -o prog1.o prog1.s
- Link the code to produce the final executable:
ld -o prog1 prog1.o readwrite.o string_utils.o
- Run your program:
./prog1
If you don’t fancy this multiple step process, you can always use gcc and do it all in one step:
gcc -o prog1 prog1.s readwrite.s string_utils.s
This will link your TINSEL program with the C runtime libraries, so for this to work you must change the entry point of the program to: val MAIN_ENTRYPOINT = "main"
.
And to enjoy the result of our work try this TINSEL program:
program myprog1
var i = 0, n
fun i_square() {
if (i <= 0) { return -1 }
return i * i
}
main {
read n
repeat {
i = i + 1
print i_square()
} until (i >= n)
}
endprogram
You give it an integer (n
) as input and it will print the squares of all the integers from 1 to n. Try it on your Linux box. (please ignore the if(i<=0)...
line in function i_square
– it’s there just to test the correct generation of the labels)
I will stop here. This has been a long chapter, but I did not want to interrupt until we got to this point. We have reached a major milestone, where we can actually run our TINSEL programs on a real computer and see the real result on our screen.
Enjoy!
And as always, you can find the code in my GitHub repository.
Coming up next: Semicolons and Comments
Chapter IX: More TINSEL
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the previous chapter we Introduced the basic structure of TINSEL, our mini programming language. Here, we will continue defining these structures to make it even friendlier. The aim is to write a complete working program for x86 in the next chapter.
Commas
Looking at the 6 lines that were needed to define 6 variables in the TINSEL example at the end of the previous chapter, it is obvious we would benefit a lot by processing a comma separated list of variables:
<variable declaration> ::= var <identifier> [ = <value> ] [ , <identifier> [ = <value> ] ] *
It should be very simple to extend our current parseVarDecl
to support this. Given that this function must process at least one variable declaration, it makes sense to use a do - while
loop:
/** * parse variables declarations * <variable declarations> ::= var <identifier> [ = <value> ] [ , <identifier> [ = <value> ] ] * */ fun parseVarDecl() { while (inp.lookahead().encToken == Kwd.varDecl) { do { inp.match() parseOneVarDecl() } while (inp.lookahead().encToken == Kwd.commaToken) } } /** parse one variable declaration */ fun parseOneVarDecl() { val varName = inp.match(Kwd.identifier).value var initValue = "" if (inp.lookahead().encToken == Kwd.equalsOp) { inp.match() initValue = inp.match(Kwd.number).value } declareVar(varName, initValue) }
Don’t forget that in order to activate a new token, its encoded value must be added to the Kwd
enumeration and the actual symbol/keyword (the ','
in this case) must be added to the languageTokens
list.
enum class Kwd { ... commaToken, ... } languageTokens.add(Token(",", Kwd.commaToken, TokInfo.keyword, TokType.none))
That was a quick one! Change the previous TINSEL example so that all the vars or combinations of some vars are in a comma separated list and make sure it works.
Line Numbers
I believe you will agree with me that our compiler has decent error checking. But as our programs can now grow in size, it might be difficult to spot in our TINSEL input program where exactly the error was. I think a line number in the error message might be really helpful.
First we will need a global variable in the scanner, which will be increased by 1 every time we see a newline. I believe the best place to do this is in skipNextChar
as this is the place where our code goes through the whole input program one character at a time (including white spaces and newlines):
// input program line number var lineNumber = 1
/** set the lookahead character to the next char from input */ private fun skipNextChar() { if (nextChar == '\n') ++lineNumber nextChar = if (cursor < inputProgram.length-1) inputProgram[++cursor] else endOfInput }
With this done, we just need to print this line number when we print an error message:
/** report what was expected and abort */ fun expected(expMsg: String) { abort("line $lineNumber: expected [$expMsg] found [${nextToken.value}]") }
That was easy! But is it working? Let’s feed our compiler deliberately wrong input e.g. by removing an opening {
or adding an unrecognised symbol at the end of a line. Now observe the error message and the line number. Is it correct? Well, it’s not far off, but it’s actually pointing to the line where the next token is instead of the line where the error occurred.
And there is a reason for this: every time we extract a token from the input (notably in getName, getNumber and getSpecSeq
) we skip all the following whitespaces and newlines until we reach the next token, ready for the next match
. And this is why our line number is a bit off.
There is an easy solution to this. Instead of advancing the cursor after we read each token and have it pointing to the next token, we will skip the white spaces and newlines just before we read the next token. This way the line number will always be pointing to the line of the last token we’ve just read.
To achieve this, remove the call to skipWhite
from getName, getNumber, getNextChar
, getSpecSeq
and init
, and add it just once at the top of scan
as this is the function we call every time we want to return a token:
/** get the next token and advance the "cursor" */ private fun scan(): Token { // first skip white space skipWhite() if (nextChar == endOfInput) // check for end of input ....
Try it again. You will see that the line number is now pointing to the line where the error is. There might be a few outlier cases where the line number might be off, but I believe in most cases it will be accurate.
After this change, if you look carefully, you will see that getNextChar
has now been reduced to just one line that calls skipNextChar
. Based on this, we can remove this version of getNextChar
, and rename skipNextChar
to getNextChar
(which is exactly how Jack had it originally designed!).
In addition, if we look now at getName
and getNumber
, they can be further simplified. Both of them are called from only one place, from scan
, and only when scan
knows that there is an alpha or a numeric character ahead:
if (isNumeric(nextChar)) // check for number return Token(getNumber(), ... ... if (isAlpha(nextChar)) { // check for name val name = getName() ...
With this in mind, the check in the beginning of getName
and getNumber
, to abort if they don’t see an alpha or a numeric character respectively, is now redundant and can be removed.
With this change, all of our error checking at the scanner level is now done in match
(with the help of scan
of course). The syntax of our whole language is checked in just one line and any error is caught there. This is really neat and is based on the design from Jack.
I’ll let you manage these minor code changes, but if you need to verify, please check the source in the GitHub repository.
Input / Output
No program is any use unless it can communicate with the user. I will implement this through read
and print
.
read
will be reading integer values from the standard input into a comma separated list of variables and print
will be printing onto the standard output the integer result(s) of a comma separated list of expressions:
/**
* parse read statement
* <read> :: = read var [ , var ] *
*/
fun parseRead() {
var varName: String
do {
inp.match()
varName = inp.match(Kwd.identifier).value
code.dummyInstr("Read value into $varName")
code.assignment(varName)
} while (inp.lookahead().encToken == Kwd.commaToken)
}
/**
* parse print statement
* <print> ::= print expression [ , expression ] *
*/
fun parsePrint() {
do {
inp.match()
parseBooleanExpression()
code.dummyInstr("Print value")
} while (inp.lookahead().encToken == Kwd.commaToken)
}
Here you see some more dummy instructions but these will go away in the next chapter.
Remember, we also have to define the two new keywords (read
and print
) in the LanguageTokens
file and also to recognise them and point to the two new parsers in parseStatement
.
/** parse a statement */
fun parseStatement(breakLabel: String) {
when (inp.lookahead().encToken) {
Kwd.ifToken -> parseIf(breakLabel)
Kwd.whileToken -> parseWhile()
Kwd.repeatToken -> parseRepeat()
Kwd.forToken -> parseFor()
Kwd.breakToken -> parseBreak(breakLabel)
Kwd.retTok -> parseReturn()
Kwd.readToken -> parseRead()
Kwd.printToken -> parsePrint()
else -> parseAssignment()
}
}
It’s amazing how with an extra 7-8 lines of code we can add functionality like reading and printing a series of variables and expressions respectively. This is the beauty in Jack’s method.
Case Sensitive Names
To close this chapter, I’d like to ensure all our names and keywords are case-sensitive, same as with all modern languages. This can be achieved with some minor changes:
Remove the uppercase()
conversion in getName
:
token += nextChar.uppercase()
Remove the ignoreCase
parameter (currently set to true
) when we check for a keyword in isKeyword
. By default it’s false
, so the comparison is case-sensitive:
if (languageTokens[i].value.equals(name, true)) // check for keyword match
It can also be written like this (in Kotlin):
if (languageTokens[i].value == name) // check for keyword match
With this, program
is a valid keyword, while Program
isn’t and the variable Alpha
is different to alpha
.
In the last three chapters I put emphasis on reshaping the lexical scanner (to make it recognise all possible tokens and also make it configurable), and also on defining the structures of our “toy” language to ensure we can write programs one can read and understand and which will be able to do something useful and produce some output. As a result I have neglected the part where the output code is generated, hence you see a number of dummy instructions. This will go away in the next chapter where I will start adapting the code module for x86-64.
And as always, you can find the code in my GitHub repository.
Coming up next: TINSEL for x86-64
Chapter VIII: Introducing TINSEL
Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the previous chapter we changed our lexical scanner so that it can now recognise any token and any operator that we have defined in our programming language definition. And we verified this, by seeing the input stream broken down into distinct tokens that were encoded and categorised and ready to be passed to the parser.
In this chapter we will use this capability and define the full specification of our own programming language that we will call TINSEL (Technology Integration and Next Step in Evolutionary Learning). I believe this is an appropriate name for our mini-language, given that this chapter is published in the beginning of December.
At the end of this chapter we will be able to write a complete TINSEL program that will be as easy to read as C or Java or Kotlin, and which we will be able to compile with decent error checking included.
But first, let’s complete the change we started in the previous chapter, by making the necessary changes in the 3 parser modules to make them compatible with the new scanner.
Connecting Parser and Scanner
Now that our tokens consist of multiple characters and are being returned as an enumerated value, our parser routines have to be changed accordingly. Please take the three files that were part of Chapter 7 and create a new package under your project together with the three ProgParser* and the M68000Instructions files from Chapter 6.
Please make the following changes:
- Where we check against the value of a token (e.g.
'addOp'
or'ifToken'
) the corresponding enumerated value must be used (e.g.Kwd.addOp
orKwd.ifToken
) - Where we check the value of
inp.nextChar
, we must now use the value ofinp.lookahead().encToken
and must check against the corresponding enumerated value (see point above) - Where
match
is called with achar
parameter, it must now be called with the corresponding enumerated value (inp.match(endIfToken)
becomesinp.match(Kwd.endIfToken)
) - Where the type of token is checked by using e.g.
isEndOfBlock
orisAddop
orisOrOp
thetype
attribute of the token must be checked instead: e.g.inp.lookahead().type == TokType.endOfBlock
orinp.lookahead().type == TokType.addOps
- Where we check for an Alpha or a Numeric token with
inp.isAlpha(inp.nextChar)
orinp.isNumber(inp.nextChar)
, we must now useinp.lookahead().encToken == Kwd.identifier
orinp.lookahead().encToken == Kwd.number
- The same goes for
inp.isLeftParen
and all the other similar checks - Replace any call to
inp.getName()
withinp.match(Kwd.identifier).value
and any call toinp.getNumber()
withinp.match(Kwd.number).value
; these will fetch an identifier or a number from the input with the appropriate error checking - Replace the line with the call to
inp.getBoolean()
inparseBooleanFactor()
withif (inp.match(Kwd.booleanLit).value == BOOLEAN_TRUE)
- Reinstate the line
val code = M68000Instructions()
inCompilerMain.kt
For example, here’s how parseBlock
and parseFactor
will look after these changes:
/** * parse a block * <block> ::= [ <statement> ] * */ fun parseBlock(breakLabel: String = "") { while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) { when (inp.lookahead().encToken) { Kwd.ifToken -> parseIf(breakLabel) Kwd.whileToken -> parseWhile() Kwd.repeatToken -> parseRepeat() Kwd.forToken -> parseFor() Kwd.breakToken -> parseBreak(breakLabel) else -> parseAssignment() } } }
/**
* parse a factor
* <factor> ::= ( <expression> ) | <number> | <identifier>
*/
fun parseFactor() {
if (inp.lookahead().encToken == Kwd.leftParen) {
// ( Expression )
inp.match()
parseExpression()
inp.match(Kwd.rightParen)
}
else
if (inp.lookahead().encToken == Kwd.identifier)
// Identifier
parseIdentifier()
else
// Number
code.setAccumulator(inp.match(Kwd.number).value)
}
These changes would normally come to mind naturally, based on the new definition of Token
and match()
. However, this may turn into a delicate operation, so please compare the outcome of your changes with the files from GitHub (url at the bottom as usual) just in case you have missed something.
Let’s Define our Language
We will now start defining our language, TINSEL. We will take a top-down approach and will start from the top level, which is the Program itself. We will then continue down the hierarchy and define the building blocks one by one, until we reach the Statement level, which is what we can support already.
As Jack has shown us, the best tool for this is the BNF notation, which, as we’ve seen in previous chapters, we can translate directly to code in our parser.
The Program
The first programming language I learnt was FORTRAN, so I’d like to have a program header in the beginning, something like program calculate_differences
, and an end of program mark at the end, something like endprogram
.
Having clarified this, let’s think what other structural blocks our program will consist of.
Immediately after program...
I’d like to have an optional section where the global variables are declared. These will all be integer for now.
After this, I would like to see a section (also optional) where functions can be declared. They will also return an integer for now.
Finally, the only mandatory section of our program will be the main section, which will be clearly marked (e.g. with the token main
)
Based on this our program can be defined as:
<program> ::= <prog header> [ <var declarations> ] [ <fun declarations> ] <main block> <prog end>
And the function to process this will naturally be:
fun parseProgram() {
parseProgHeader()
if (inp.lookahead().encToken == Kwd.varDecl)
parseVarDecl()
if (inp.lookahead().encToken == Kwd.funDecl)
parseFunDecl()
parseMainBlock()
parseProgEnd()
}
The functions to process the program header and program end will be:
fun parseProgHeader() { inp.match(Kwd.startOfProgram) println("program ${inp.match(Kwd.identifier).value}\n") }
and
fun parseProgEnd() { inp.match(Kwd.endOfProgram) println("\nend program") }
You will notice that the Kotlin compiler is now giving us a warning on all those when
blocks. The reason is that we now use enumerations in the when
block and it is advisable that we include an else
case, even though it’s not 100% needed as long as the definitions of our tokens are correct. If however, there is an error in our tokens definitions, this “catch-all” else
case will help us identify it. E.g. in parseExpression()
:
else -> inp.expected("add or subtract operator")
As a result, the function expected
must now become public
.
Please note, we will include all these functions in a new file called ProgParser0TopLevel.kt
, so please remove the parseProgram()
function from ProgParser1ControlStruct.kt
.
Global Variables Declarations
Given that we have only numeric variables, we can declare them just by using the keyword var
followed by the name (no need to specify the type – everything is integer for now). In addition we could have an optional initialisation of the variable. To do this properly we should be able to calculate the result of a numeric expression, which would potentially include variables that had been declared and initialised before. However, our compiler is not able to perform calculations at this stage (see chapter 4 – Interpreters) so for now we will keep it simple and support only simple numeric values. Our variable declaration will look like this:
<variable declaration> ::= var <identifier> [ = <value> ]
We will keep processing variables declarations in a loop as long as we see the var
token:
/** parse variables declarations */ fun parseVarDecl() { while (inp.lookahead().encToken == Kwd.varDecl) { inp.match() val varName = inp.match(Kwd.identifier).value var initValue = "" if (inp.lookahead().encToken == Kwd.equalsOp) { inp.match() initValue = inp.match(Kwd.number).value } declareVar(varName, initValue) } }
You may have noticed the function declareVar
above. This is how this will work: we will maintain all the declared variables in a Map
with the variable name as the key. The data for each variable will be it’s type
(something’s telling me that we may want later to extend our language to support more than just int
) and an initialised
flag. All modern compilers point out references to uninitialised variables, so why not prepare the ground for this? If our variable declaration contains an initialisation part, we will set this flag to true
, otherwise it will be false
.
Regarding variables initialisation, please note that even though the initialisation part is optional in TINSEL, every global variable has to have an initial value – more details on this in chapter 10. So for now let’s say that the initialised
flag will be true
if a variable has been specifically initialised by the programmer in the TINSEL program, and false
if it hasn’t been initialised by the programmer but instead is initialised to the default value by the Compiler.
In declareVar
we first check in case the variable has already been declared, and if not, we process the new declaration together with the initialisation. The initialised
flag is set accordingly.
// the variables space val variablesSpace = mutableMapOf<String,VariableDecl>() /** our variable types */ enum class VarType { int } /** our variables declaration */ class VariableDecl(var type: VarType, var initialised: Boolean = false) /** declare a variable */ fun declareVar(name: String, initValue: String) { // check for duplicate var declaration if (variablesSpace[name] != null) abort ("variable $name already declared") variablesSpace[name] = VariableDecl(VarType.int, initValue!="") code.dummyInstr("declare variable $name initValue $initValue") }
You may also have noticed the use of dummy code here. The actual assembly code for the variables declarations depends (a) on the specific assembly language and (b) on the specific assembler. So, I will give you the detailed full code for this in chapter 10, when we start producing actual code for x86.
Functions Declarations
Similarly to the variables, our functions will only return numeric result, so no type needed. We will also keep it simple by having no parameters for now. A function declaration will look like this:
<function declaration> ::= fun <identifier> ( ) <block>
There are a couple or things to note above:
- To parse the function body I’m using
<block>
which we already have a parser function for. We will see shortly whether our existingparseBlock
will need any changes to support the parsing of the function body - A function must contain a
return
statement. We will see later on how our compiler will verify this
The parser for a function declaration will look like this:
/** parse functions declarations */ fun parseFunDecl() { while (inp.lookahead().encToken == Kwd.funDecl) { inp.match() val funName = inp.match(Kwd.identifier).value inp.match(Kwd.leftParen) inp.match(Kwd.rightParen) declareFun(funName) parseBlock() } }
You can see that same as with variables, we are processing the function declarations in a loop as long as we see a fun
keyword. Same as with declareVar
above, we need a declareFun
, which will produce the necessary assembly code for the function declaration. And same as with var
, this function uses dummy code for now until chapter 10.
// the functions space val functionsSpace = mutableMapOf<String,VarType>() /** process a function declaration */ fun declareFun(name: String) { if (functionsSpace[name] != null) abort ("function $name already declared") functionsSpace[name] = VarType.int code.dummyInstr("declare function $name") }
Return Statement
The return statement that must be included in every function, will require some new routines to process it. Its format will be:
<return> ::= return <expression>
The parser for this will be:
fun parseReturn() { inp.match() parseExpression() code.returnFromCall() }
And we need to activate it in the when
block in parseBlock
when the return
keyword is recognised:
Kwd.retTok -> parseReturn()
We will also need the return statement in our assembly code module:
/** return from function call */
fun returnFromCall() = outputCodeNl("RET")
The Main Program
I would like to create a resemblance to C, so I will use the token main
to mark the start of the main program bock:
<main block> ::= main <block>
That was an easy one. The parser will be like this:
fun parseMainBlock() { inp.match(Kwd.mainToken) code.outputLabel("main") parseBlock() code.dummyInstr("stop running") }
As you can see in the main block, I have added a label called main
for now to mark the entry point to the main program and also the final instruction, the one that tells the target machine to stop running the program. The exact instructions will also be clarified in chapter 10.
The Block
Looking at our current definition of <block>, which is
<block> ::= [ <statement> ] *
it looks like it’s not adequate for it’s new upgraded role. It will at least need a set of delimiters for start-block and end-block. I’d like to expand the definition to:
<block> ::= { [ <statement> ] * }
<statement> ::= <block> | if | while | repeat | for | break | return | <assignment>
which says that a block consists of 0 or more <statements> and, Yes, it’s surrounded by curly brackets. So, now we have a distinct terminator for the block and don’t need the ENDIF, ENDWHILE or ENDFOR anymore. The statement can be one of the statements we have already worked on or actually a block.
With the above definition in mind I would like to split our existing parseBlock
as follows:
/** parse a block */ fun parseBlock(breakLabel: String = "") { inp.match(Kwd.startBlock) while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) { parseStatement(breakLabel) } inp.match(Kwd.endBlock) } /** parse a statement */ fun parseStatement(breakLabel: String) { when (inp.lookahead().encToken) { Kwd.startBlock -> parseBlock(breakLabel) Kwd.ifToken -> parseIf(breakLabel) Kwd.whileToken -> parseWhile() Kwd.repeatToken -> parseRepeat() Kwd.forToken -> parseFor() Kwd.breakToken -> parseBreak(breakLabel) Kwd.retTok -> parseReturn() else -> parseAssignment() } }
By adding the block-start and block-end tokens (the curly brackets) inside parseBlock
, we only do it once and don’t have to do it in every parser function that calls parseBlock
(and by now, there’s quite a lot of them).
If – While – Repeat
In order to maintain the resemblance to C, I would like to have the condition in the if
, while
and repeat
blocks surrounded by parentheses. This is a small change in the three parser functions to add the corresponding match
calls before and after the call to parseBooleanExpression()
. I’m leaving the for
loop for later due to its complexity. I will fix this separately, together with the code that is needed to execute it properly (if you remember, for
currently generates dummy code).
... inp.match(Kwd.leftParen) parseBooleanExpression() inp.match(Kwd.rightParen) ...
With these changes, it looks like we are ready to write a program like this:
Program TestTinsel1 var a = 2 var b = 5 var x var y var i var result fun alpha() { x = a * b - (y-1)*y i = 1 while (true) { i = i + 1 if (i > 10) { return x+i } } } main { result = alpha() a = 20 b = 80 result = alpha() } EndProgram
Finally change the main function as per below and try it out!
/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseProgram()
}
The output code will still be a mixture of M68000 assembly and dummy instructions but I will soon produce x86 code that anyone will be able to run and test on Linux (and Windows with some minor modifications).
Try other structures like for
and repeat
, if - else
, break
, and try removing or changing things and see how the compiler will respond to invalid syntax.
Before I leave you, I realised that a minor bug has crawled in our scanner – if you run the compiler with an empty file as input, or if there is no newline character at the end of the last line of the program, it will crash with “string index out of bounds exception”. To stop this, we will bring back the extra newline character that we had added at the end of our program in one of the early chapters:
// read the whole program into a string // add a newline at the end to deal with end of input easier inputProgram = f.readText() + '\n'
Have fun.
And as always, you can find the code in my GitHub repository.
Coming up next: More TINSEL