Chapter XVI: Local Variables and Recursion

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter I introduced functions with parameters, which made TINSEL a much more mature programming language. In this chapter I will go one more step further and introduce local variables.

Until now, the only variables TINSEL recognises are global variables that must be declared immediately after the program <name> line using the var keyword. In this chapter I will introduce local variables, i.e. variables that have limited scope, within the block of code that they have been declared.

Local Variables Declaration and Scope

In TINSEL I will let a local variable be declared anywhere in a block of code. By block of code I mean any part of code surrounded by the block_start and block_end delimiters { } . This could be the body of a function, an if-block, a while-block, or any other block of code surrounded by the block delimiters.

The local variables will be declared exactly in the same way as the global variables:

    var i: int
    var a: string(20)

Their scope will be within the block they have been declared in, i.e. they will be recognised from the line they have been declared in, to the closing block delimiter.

Unlike global variables, which are instantiated at compile-time and live in the .data section of the assembly output, the local variables will be instantiated at run-time and will live in the stack.

Our compiler has so far all the tools to implement the above enhancements, so implementing local variables should be a relatively easy job.

Declaring Local Variables

Starting top-down, we want to be able to declare a local variable anywhere in a block. This means that we need to recognise and process the var keyword as part of the processing of the block. It looks like a good place to do this is in parseStatement which is the heart of the parseBlock function. For this we need to add one line:

fun parseStatement(breakLabel: String, continueLabel: String, blockName: String) {
    when (inp.lookahead().encToken) {
        Kwd.varDecl -> parseLocalVars(blockName)
        Kwd.startBlock -> parseBlock(breakLabel, continueLabel)
        Kwd.ifToken -> parseIf(breakLabel, continueLabel)
        ...

This will recognise a local var declaration anywhere in the block. Let’s see what is does:

fun parseLocalVars(blockName: String) {
parseVarDecl(VarScope.local, blockName)
}

Very simple: it calls the existing parseVarDecl function (that until now used to process global vars) with two parameters: the flag to tell it to process a local variable and the name of the block (more about this further down). The only change required in parseValDecl is to pass these parameters on to parseOneValDecl:

fun parseVarDecl(scope: VarScope = VarScope.global, blockName: String = "") {
    while (inp.lookahead().encToken == Kwd.varDecl) {
        do {
            inp.match()
            parseOneVarDecl(scope, blockName)
        } while (inp.lookahead().encToken == Kwd.commaToken)
    }
}

and this is where we will start doing some real work:

fun parseOneVarDecl(scope: VarScope, blockName: String) {
    val varName = inp.match(Kwd.identifier).value
    inp.match(Kwd.colonToken)
    when (inp.lookahead().encToken) {
        Kwd.intType -> parseOneIntDecl(varName, scope)
        Kwd.stringType -> parseOneStringDecl(varName, scope)
        else -> inp.expected("variable type (int or string)")
    }
    if (scope == VarScope.local) {      // add any local vars to the local vars map for this block
        val localVarsList: MutableList<String> = localVarsMap[blockName] ?: mutableListOf()
        localVarsList.add(varName)
        localVarsMap[blockName] = localVarsList
    }
}

First, the scope is passed further on to parseOneIntDecl and parseOneStringDecl. In addition we have a new section added above that maintains a list of all the local variables declared in each block in the localVarsMap. The key to this map is the name of the block, which is set automatically by the compiler in the beginning of each block:

fun parseBlock(breakLabel: String = "", continueLabel: String = "") {
    inp.match(Kwd.startBlock)
    mustRestoreSP = true
    val blockName = "$BLOCK_NAME${blockId++}"       // blockName is used as key to the local vars map for this block
    while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
        parseStatement(breakLabel, continueLabel, blockName)
    }
    releaseLocalVars(blockName, mustRestoreSP)
    inp.match(Kwd.endBlock)
}

Going back to parseOneIntDecl and parseOneStringDecl, they just pass the scope on to declareVar:

fun parseOneIntDecl(varName: String, scope: VarScope) {
    ...
    declareVar(varName, DataType.int, initValue, INT_SIZE, scope)
}

declareVar simply checks the scope and calls one of the two functions, declareGlobalVar (which is our old declareVar) and declareLocalVar, which is where we need to do the work to declare the local variables:

fun declareVar(name: String, type: DataType, initValue: String, length: Int, scope: VarScope) {
    // check for duplicate var declaration
    if (identifiersMap[name] != null)
        abort ("line ${inp.currentLineNumber}: identifier $name already declared")
    when (scope) {
        VarScope.global -> declareGlobalVar(name, type, initValue, length)
        VarScope.local -> declareLocalVar(name, type, initValue, length)
    }
}

So, until now, apart from the introduction of the block name, the scope and the local vars map, we have only made minimum changes to the compiler code.

fun declareLocalVar(name: String, type: DataType, initValue: String, length: Int) {
    val stackOffset: Int
    val lengthRoundedTo64bits = (length / 8 + 1) * 8
    when (type) {
        DataType.int -> {
            stackOffset = code.allocateStackVar(INT_SIZE)
            initLocalIntVar(stackOffset, initValue)
        }
        DataType.string -> {
            stackOffset = code.allocateStackVar(STRPTR_SIZE)
            initLocalStringVar(name, stackOffset, initValue, lengthRoundedTo64bits) 
        }
        else -> return
    }
    identifiersMap[name] = IdentifierDecl(
        TokType.variable, type, initialised = true, size = lengthRoundedTo64bits, isStackVar = true, stackOffset = stackOffset
    )
}

As you can see above, declareLocalVar checks the type of the variable and calls first the function to allocate the space in the stack for an integer or a string pointer (which both happen to be 8 bytes by the way) and then initialises that stack variable. Finally, it adds the local variable to the identifiersMap so that all the details are at hand for when this variable is referenced. Please note that in the beginning of this function, the size of the local variable is rounded up to the nearest 8-byte size, in order to keep the stack in good shape.

Also, regarding local vars initialisation, in case of an int, then the local variable may or may not be initialised with the declaration. In case of string though, the local variable must be always initialised, either by specifying the length of the string or the initial value. The reason for this is the way TINSEL handles strings, which requires each string pointer to point to a certain memory area where the contents of the string lives. Here’s the code that takes care of the initialisation of the string local variables:

fun initLocalStringVar(name: String, stackOffset: Int, initValue: String, length: Int) {
if (initValue.isEmpty() && length == 0)
abort ("line ${inp.currentLineNumber}: local variable $name is not initialised")
var constStringAddress = ""
// check for the constant string init value
stringConstants.forEach { (k, v) -> if (v == initValue) constStringAddress = k }
if (constStringAddress == "") { // if not found
// save the string in the map of constant strings
constStringAddress = STRING_CONST_PREFIX + (++stringCnstIndx).toString()
stringConstants[constStringAddress] = initValue
}
val stringDataOffset = code.allocateStackVar(length)
code.initStackVarString(stackOffset, stringDataOffset, constStringAddress)
}

This function first checks whether the local string var is initialised and aborts otherwise (exactly same as the global string vars) and then checks for the scenario where an initial value for the string is given. In that case checks to see if the value already exists in the stringConstants map, and if not, it adds it. Finally, it allocates space for the contents of the string in the stack and calls the function that will produce the code to initialise it at run time:

fun initStackVarString(stackOffset: Int, stringDataOffset: Int, constStrAddress: String) {
outputCodeTabNl("lea\t$stringDataOffset(%rbp), %rax")
outputCodeTab("movq\t%rax, $stackOffset(%rbp)\t\t")
outputCommentNl("initialise local var string address")
if (constStrAddress.isNotEmpty()) {
outputCodeTabNl("lea\t$constStrAddress(%rip), %rsi")
outputCodeTabNl("movq\t$stackOffset(%rbp), %rdi")
outputCodeTab("call\tstrcpy_\t\t")
outputCommentNl("initialise local var string")
}
}

This function first sets the value of the string pointer to the block in the stack where the contents of the string lives and then checks whether there is initial value, which is then copied onto the string.

Cleaning up

This is the only bit left to be done at the end of the block:

fun parseBlock(breakLabel: String = "", continueLabel: String = "") {
    inp.match(Kwd.startBlock)
    mustRestoreSP = true
    val blockName = "$BLOCK_NAME${blockId++}"       // blockName is used as key to the local vars map for this block
    while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
        parseStatement(breakLabel, continueLabel, blockName)
    }
    releaseLocalVars(blockName, mustRestoreSP)
    inp.match(Kwd.endBlock)
}

fun releaseLocalVars(blockName: String, restoreSP: Boolean) {
    var localVarSize = 0
    localVarsMap[blockName]?.forEach {
        localVarSize +=
            when (identifiersMap[it]?.type) {
                DataType.int-> INT_SIZE
                DataType.string-> STRPTR_SIZE + identifiersMap[it]?.size!!
                else-> INT_SIZE
            }
        identifiersMap.remove(it)
    }
    if (localVarSize > 0 && restoreSP)
        code.releaseStackVar(localVarSize)
}

The function releaseLocalVars is called at the end of the block and does two things: (a) calculates the stack space allocated by all the local vars in this block and releases it and (b) removes these local vars from the identifiersMap so that they cannot be accessed outside the block.

You may have noticed that releaseLocalVars has an additional parameter (apart from the block name), restoreSP, which tells it to release the stack space used by the local vars in this block or not. This parameter is set to true in the beginning of the block so that at the end of it the stack space will be released. With one exception: when we have a return statement, then this parameter is set to false. The reason is that when we return from a function, the whole stack frame is restored to its previous state, so any space allocated to local vars is automatically released. No need to increase the stack pointer.

fun parseReturn() {
    inp.match()
    ...
    code.returnFromCall()
    mustRestoreSP = false
}

And that was it. TINSEL now supports local variables. And with this, we can write a professional-looking recursive version of the n-factorial program (strictly speaking it does not require local variables but it does require functions with parameters, which we covered in chapter xv). It works nicely until it hits integer overflow for input values above 25… And of course, now that we have recursion, we need to be careful not to overflow the stack.

And as always, you can find the code in my GitHub repository.

Coming up next – the next big step for TINSEL: TINSEL for Raspberry Pi

Chapter XV: Functions with Parameters

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

Those of you who tried the “diamond” program from the last chapter (diamond.tnsl) must have noticed a certain weakness: when the print_line function is called from inside the for loop in the main program, it needs to know the value of i so that it can print the correct pattern. And the only way we could do that in the previous chapter, was by assigning a global variable with the value of the for loop control variable. This is not very neat, so let’s go and improve it by adding parameters to our functions.

The Ground Rules

In the x86-64 assembly standard the first 6 parameters to a function are passed via the registers %rdi, %rsi, %rdx, %rcx, %r8, and %r9. Any additional parameters or any parameters with size over 64 bits are passed via the stack. TINSEL will support up to 6 parameters (which are too many anyway) so only these 6 registers will be used for passing function parameters.

TINSEL supports int and string variables. In case of an int parameter, the value of the variable will be passed and in case of a string parameter the address of the string variable will be passed (i.e. int parameters will be passed by value and string parameters will be passed by reference).

As a side note, passing function parameters directly via the CPU registers may sound simpler than passing them via the stack (which is the traditional method), however, it introduces certain complexity, which we will address further down in this chapter.

Declaring Function with Parameters

This should be a relatively easy job. The function declaration will look like this:

<function declaration> ::= fun <identifier> : <type> ( [ <param declaration> ] * ) <block>
<param declaration> ::= <identrifier> : <type>

So, at first we need to add the code to parse any parameter declarations between the '(' and the ')'. The code below does exactly this:

fun parseFunDecl() {
    while (inp.lookahead().encToken == Kwd.funDecl) {
        inp.match()
        val functionName = inp.match(Kwd.identifier).value
        ...
        inp.match(Kwd.leftParen)
        parseFunParams(functionName)       // <----
        inp.match(Kwd.rightParen)
        inp.match(Kwd.colonToken)
        ...
        inp.match()
        declareFun(functionName, funType)
        storeParamsToStack(functionName)   // <----
        parseFunctionBlock()
    }
}

As you can see the first item that was added to our parseFunDecl function is to parse the parameters:

fun parseFunParams(functionName: String) {
    var paramCount = 0
    val paramTypesList = mutableListOf<FunctionParameter>()
    if (inp.lookahead().encToken == Kwd.identifier) {
        do {
            if (paramCount++ >= code.MAX_FUN_PARAMS)
                abort("line ${inp.currentLineNumber}: a function can have only up to ${code.MAX_FUN_PARAMS} parameters maximum")
            if (inp.lookahead().encToken == Kwd.commaToken)
                inp.match()
            paramTypesList.add(parseOneFunParam())
        } while (inp.lookahead().encToken == Kwd.commaToken)
    }
    funParamsMap[functionName] = paramTypesList
}


fun parseOneFunParam(): FunctionParameter {
    val paramName = inp.match(Kwd.identifier).value
    if (identifiersMap[paramName] != null)
        abort("line ${inp.currentLineNumber}: parameter name $paramName has already been declared")
    inp.match(Kwd.colonToken)
    var paramType = DataType.none
    when (inp.lookahead().encToken) {
        Kwd.intType -> paramType = DataType.int
        Kwd.stringType -> paramType = DataType.string
        else -> inp.expected("variable type (int or string)")
    }
    inp.match()
    return FunctionParameter(paramName, paramType)
}

parseFunParams is a loop that processes a comma-separated parameter list. For each parameter it calls parseOneFunParam.

parseOneFunParam first checks whether the name has already been declared (which could be either as a global variable or as another parameter in this function) and if not, it processes the declaration by checking the type of the parameter. Finally it returns an object that contains the name and type of that parameter.

parseFunParams builds a list of all the function parameters with their names and types and saves it in the funParamsMap. This map will be used in two places: (a) in the beginning of the code for the function block and (b) when a function is called and the parameters must be assigned values.

Let’s look at (a) first.

You may have noticed above that there is one more line added to parseFunDecl: it’s the call to storeParamsToStack, which does exactly what it’s name says: stores the values of the parameters from the registers (up to six) passed by the caller to the stack. Why? Here’s the reason:

Based on the x86-64 architecture, the 6 registers used to pass the function parameters, are part of the so-called caller-save group of registers, which means that their value is not necessarily retained across function calls. Even though it would have been nice to keep the 6 parameters in these 6 registers and have them handy to use them through the function code, it’s very likely that we will have the scenario where another function is called that modifies e.g. %rdi or %rsi or %rcx or %rdx or %r8 or %r9 (the first 4 are very commonly used – they are also used in the TINSEL runtime library), and this means that the value of one (or more) parameter(s) would be corrupted.

The solution to the problem is to store the values in the stack as stack variables. We have used the stack to store a variable already in chapter 13 for the for-loop control variable. When the parameters are needed later in the function code, their value will be retrieved from the stack. This, by the way, is exactly how the GNU C compiler has dealt with this problem.

fun storeParamsToStack(functionName: String) {
    val paramsList = funParamsMap[functionName] ?: listOf()
    for (i in paramsList.indices) {
        val paramVarOffs = code.allocateStackVar(INT_SIZE)
        identifiersMap[paramsList[i].name] = IdentifierDecl(
            TokType.variable, paramsList[i].type, initialised = true, size = INT_SIZE,
            isStackVar = true, stackOffset = paramVarOffs, canAssign = false
        )
        code.storeFunParamToStack(i, paramVarOffs)
    }
}

This function foes the following for each parameter: allocates space in the stack, creates an entry in the identifiersMap and finally saves the value of the parameter in the stack, which is done by the below assembly code:

val funInpParamsCpuRegisters = arrayOf("%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9")
fun storeFunParamToStack(paramIndx: Int, stackOffset: Int) {
outputCodeTabNl("movq\t${funInpParamsCpuRegisters[paramIndx]}, $stackOffset(%rbp)")
}

There is one more thing required to be done just before the end of the function. We need to remove the parameters from the identifiersMap, otherwise a subsequent function declaration would not be able to use parameters with the same names. This is done at the end of parseFunctionBlock:

fun parseFunctionBlock() {
    hasReturn = false
    parseBlock()
    if (!hasReturn)
        abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retToken)}")
    // clean up declarations of parameters so that the names can be reused in other functions
    funParamsMap[funName]?.forEach { identifiersMap.remove(it.name) }
}

Please note, there is no need here to release the space that was allocated in the stack for the function parameters, because the code produced on the back of the return instruction takes care of this:

private fun funEnd() {
restoreStackFrame()
outputCodeTab("popq\t%rbx\t\t")
outputCommentNl("restore \"callee\"-save registers")
}
private fun restoreStackFrame() {
outputCodeTab("movq\t%rbp, %rsp\t\t")
outputCommentNl("restore stack frame")
outputCodeTabNl("popq\t%rbp")
}

Calling a Function with Parameters

So, most of the work is already done. Let’s go now and make the necessary changes to pass parameters when we call a function.

The value of each parameter will be the result of a boolean expression, which, as you remember from the previous chapters (and also from the BNF definitions TINSEL_BNF.md), can be a boolean result (0 or 1), a result of a numberical expression or even a result of string expression.

As per the x86-64 standard we have to set the CPU registers mentioned above with the value of each parameter. Here we have the same problem mentioned in the previous section, which is that these registers are not presevred across function calls. This means that we may have a scenario where we set %rdi to the value of the first parameter, we set %rsi to the vallue of the second parameter, and then we call a function to set %rdx to the value of the third parameter, which function changes the values of %rdi and %rsi corrupting the first two parameters.

Again, I will follow the excample of the GNU C compiler and do this:

First, I will set the registers %rbx, %r12-%r15 and %rax to the values of the parameters (that is, to the values of the parameters that have been declared – up to 6). The first 5 of these registers are guaranteed to maintain their value across function calls so the above issue goes away. %rax obviously does not maintain its value across function calls, but will be the last one set, so we will not lose the value of the sixth parameter.

Second, once all the values of the parameters have been set into these temporary registers, I will move them to the formal x86 parameter registers: %rdi, %rsi, %rdx, %rcx, %r8, and %r9.

And here’s how this is done:

fun parseFunctionCall(): DataType {
    val funcName = inp.match(Kwd.identifier).value
    inp.match(Kwd.leftParen)
    parseAssignFunParams(funcName)
    setInpFunParams(funcName)
    inp.match(Kwd.rightParen)
    code.callFunction(funcName)
    restoreParamRegisters(funcName)
    return getType(funcName)
}

As you can see there are three new functions called by parseFunctionCall:

The first one, parseAssignFunParams, first produces the necessary code that will return the value for each parameter and checks that the value will be of the same type as the parameter, and then sets the temporary registers to the value of the parameters as mentioned above. As you can see in setIntTempFunParam, each egister is first saved in the stack – based on the x86-64 standard it is our responsibility to ensure that these registers will retain their values when we return / exit.

fun parseAssignFunParams(functionName: String) {
    val paramTypeList = funParamsMap[functionName] ?: listOf()
    for (i in paramTypeList.indices) {
        if (i > 0)
            inp.match(Kwd.commaToken)
        val paramExprType = parseBooleanExpression()
        if (paramExprType != paramTypeList[i].type)
            abort ("line ${inp.currentLineNumber}: parameter #${i+1} must be type ${paramTypeList[i].type}, found $paramExprType")
        when (paramExprType) {
            DataType.int -> code.setIntTempFunParam(i)      // the same code is used both for int and for string parameters
            DataType.string -> code.setIntTempFunParam(i)   // i.e. moves %rax to the appropriate register for this parameter
            else -> {}
        }
    }
}
fun setIntTempFunParam(paramIndx: Int) {
if (funTempParamsCpuRegisters[paramIndx] == "%rax")
return
outputCodeTabNl("pushq\t${funTempParamsCpuRegisters[paramIndx]}")
outputCodeTabNl("movq\t%rax, ${funTempParamsCpuRegisters[paramIndx]}")
}

The second one, setInpFunParams, simply transfers the value of each temporary parameter register to the corresponding formal parameter register as per x86-64 standard:

fun setInpFunParams(functionName: String) {
val paramTypeList = funParamsMap[functionName] ?: listOf()
for (i in paramTypeList.indices)
code.setFunParamReg(i)
}
// these registers hold the fun params at the time of the call
val funInpParamsCpuRegisters = arrayOf("%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9")
// during the assignment of the parameters, their values are saved temporarily here,
// so that they are not corrupted by function calls executed during the assignment of the parameters
val funTempParamsCpuRegisters = arrayOf("%rbx", "%r12", "%r13", "%r14", "%r15", "%rax")
// 6 params maximum allowed
val MAX_FUN_PARAMS = funInpParamsCpuRegisters.size

fun setFunParamReg(paramIndx: Int) {
    outputCodeTabNl("movq\t${funTempParamsCpuRegisters[paramIndx]}, ${funInpParamsCpuRegisters[paramIndx]}")
}

Finally, restoreParamRegisters, restores the values of the temporary registers from the stack so that we are compliant with the x86-64 standard for register usage:

fun restoreParamRegisters(functionName: String) {
    val paramTypeList = funParamsMap[functionName] ?: listOf()
    for (i in paramTypeList.indices)
        code.restoreFunTempParamReg(paramTypeList.size - i - 1)
} 
fun restoreFunTempParamReg(paramIndx: Int) {
if (funTempParamsCpuRegisters[paramIndx] == "%rax")
return
outputCodeTabNl("popq\t${funTempParamsCpuRegisters[paramIndx]}")
}

Plase note %rax is neither saved nor restored, as it is not required to maintain its value across function calls.

A Point to note about Function Return Values

As we are in the subject of calling a function, I’d like to point out the following scenario:

Suppose we have a TINSEL function (let’s call it test_alpha() ) that returns an integer value, which we want to treat as boolean (0 for false, non-zero for true). We would want to use it like this:

if (test_alpha()) {
    // do something
}

Please note that this code simply works because of the way the expressions are calculated and the flags register is set. So, when at some point in function test_alpha there is a return statement

    return <some_expression>

the result of some_expression is calculated and is placed in %rax. At the same time the Z-flag is set either by the instructions executed as part of the expression calculation (addq, subq, etc) or by specifically executing

    testq   %rax, %rax

which sets the flags based on the contents of %rax, when %rax is set by a movq instruction from the contents of a memory location.

This means that a function not only returns a value in register %rax, but also returns the flags register, which is set based on the contents of %rax. And this is what makes the above code work.

Now TINSEL looks a bit more like a grown up language (although there is still a long way to go) and you can modify the diamond.tnsl example to use a parameter for the print_line() function instead of the global variable k.

And as always, you can find the code in my GitHub repository.

Coming up next: Local Variables and Recursion

Chapter XIV: Strings – Hello, World!

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the last chapter I included an example TINSEL program that demonstrated the use of nested for loops. Those of you who tried it on your Linux box, I’m sure you are convinced it works. However, it brought up another limitation of the current version of TINSEL. The lack of string handling, which made the printout of that program look like an awkward series of numbers that we had to try hard to follow. Not to mention that we have no way to print a prompt to the user when the program is waiting for input for the variable n. To rectify these limitations I will introduce strings and the associated handling in this chapter. I consider this a major enhancement, given that until now TINSEL supported only integers, hence it will give us TINSEL v2.0. There will be significant changes across many parts of the compiler, so please bear with me.

Defining Different Types

Until now everything in TINSEL was of type integer and this made things very simple in terms of declaring variables, processing expressions, executing numerical operations, calling functions, retrieving their return value, etc. The introduction of Strings means that we need to support Types. So, let’s start with this part. A new file DataTypesAndDecl.kt is created where all the necessary structures to support the handling of various types will go:

The following enumeration is first declared; it defines the different types supported in TINSEL v2.0:

enum class DataType { int, string, void, none }

The class that defines the identifiers (variables/functions) and the map that actually holds the mapping of all these identifiers also go there:

/** the declaration space (variables and functions) */
class IdentifierDecl(var fv: TokType, var type: DataType, var initialised: Boolean = false, var size: Int = 0, var stackVar: Boolean = false, var stackOffset: Int = 0, var canAssign: Boolean = true)

// the identifiers space map
val identifiersMap = mutableMapOf<String,IdentifierDecl>()

The declareVar and declareFun functions that produce the assembly code for a variable or a function also go there, as well as two new functions, getType and getCanAssign that return the type of an identifier and whether it can be assigned a new value (if you remember from the previous chapter, the control variable in a for loop cannot be assigned a value in an assignment statement).

Variable Declarations

The next thing that has to change is how we define our variables. We now have to tell our compiler what type each variable is. The Integer variables must now be defined like this:

var i: int

and may optionally be followed by an initialisation clause:

var i: int = -1

As you may remember form previous chapters, TINSEL uses 64-bit integers, so each integer variable declaration will allocate 8 bytes in the assembly output for that variable.

Let’s get to Strings now. I have decided to deal with strings in the same way as C does, i.e. treat a string as a pointer in memory where the contents of the string is. Also, same as in C, strings in TINSEL will be null-terminated. With this in mind, when we declare a sting variable, we must allocate the necessary space in memory where its value will be stored. This is done in one of two ways:

– initialise the string with a specific value:

var s1: string = "1234567890"

This will allocate 11 bytes in memory containing the above 10 characters plus the null terminator, while s1 will be pointing to the first of those bytes.

– or, allocate a number of bytes in memory

var s2: string(20)

This will allocate 20 bytes in memory with s2 pointing to the first of those bytes.

Please note that these memory blocks allocated for the string variables are fixed size and cannot grow. This is important when we get to string expressions.

Function Declarations

Similarly to strings, our function declarations must now include a type:

fun square(): int
fun alower(): string

This type declaration will be checked later when we parse the return statement for each function, to ensure it returns the correct type (more on this later in this chapter).

Expressions

And now it’s time for the heavy lifting: we need to change the way we process expressions in order to support processing of strings.

First, let me summarise how string expressions will look like:

  • adding strings s1 + s2 will result in an operation same as the C strcat(s1,s2) i.e. will append the contents of the memory area where s2 is pointing to the end of the memory area where s1 is pointing (the end of the string being the position of the null character)
  • comparing strings s1 == s2 will result in an operation similar to the C strcmp(s1,s2) that will return 1 if the strings are equal and 0 if not.
  • comparing strings s1 != s2 will result in the same operation as above but with exactly the opposite result.
  • assigning a string variable s = "abcd" or s = s1 + s2 will result in an operation same as the C strcpy(s, “abcd”) or strcpy(s, s1) (after s2 has been concatenated at the end of s1).

And these are the only string operations that TINSEL 2.0 will support for now.

Please remember the point about declaring string variables: each string variable is allocated a fixed memory space at declaration time, so it’s important to ensure this space is adequate to hold the result of the strcat or the strcpy, otherwise other variables will be corrupted or the program will crash (exactly the same as in C).

Now it’s time to think how we deal with expressions that can potentially contain different types of variables or constants. We need to be able to validate whether an operation is valid of that specific type of variable or whether the two sides of an operation are valid for that operation. E.g. we can compare a string to a string and assign the result to an integer variable (remember TINSEL treats boolean vars as integers – same as C), we can obviously add or multiply an integer with an integer, we can add a string to a string as mentioned above, but we cannot multiply a string with a string or a string with an integer, we cannot do boolean xor between a string and a string, etc.

To achieve this level of control we need to change how the various parse functions deal with the expressions, the terms and the factors. Starting from the bottom, we have already the identifiersMap that contains the type of each identifier (int or string). In addition, when scan finds a quote (the start of a string), it will call the new function getString, that will return the string literal as the next token:

private function scan(): Token {
    ....    
    if (checkQuote())
        return Token(getString(), Kwd.string, TokType.none)
    ....
/** get a string literal */
private fun getString(): String {
var value = ""
if (nextChar != '"')
return value
getNextChar()
while (nextChar != '"') {
value += nextChar.toString()
getNextChar()
}
getNextChar()
return value
}

So now, when we get a string literal or a number, we can tell what the type is by looking at the type of the Token.

Working our way up, parseFactor now checks the type of the next token and calls parseNumber or parseStringiteral respectively.

fun parseFactor(): DataType {
    when (inp.lookahead().encToken) {
        Kwd.leftParen -> return parseParenExpression()
        Kwd.identifier -> return parseIdentifier()
        Kwd.number -> return parseNumber()
        Kwd.string -> return parseStringLiteral()
        else -> inp.expected("valid factor (expression, number or string)")
    }
    return DataType.void    // dummy instruction
}

In addition, it now returns the type of the factor as you can see above (int or string). This will be used when it comes to validating whether a specific type or type combination is allowed in a specific operation.

The parseNumber function is what we used to have

fun parseNumber(): DataType {
code.setAccumulator(inp.match(Kwd.number).value)
return DataType.int
}

The parseStringLiteral however, is a bit more complicated:

fun parseStringLiteral(): DataType {
    val stringValue = inp.match(Kwd.string).value
    // check if this string exists already in our map of constant strings and add it if not
    var stringAddress = ""
    stringConstants.forEach { (k, v) -> if (v == stringValue) stringAddress = k }
    if (stringAddress == "") {  // if not found
        // save the string in the map of constant strings
        stringAddress = STRING_CONST_PREFIX + (++stringCnstIndx).toString()
        stringConstants[stringAddress] = stringValue
    }
    code.getStringAddress(stringAddress)
    return DataType.string
}

Apart from producing the necessary assembly code, to load the Accumulator with the address of the string literal, it has to find some place to store that string literal. For this purpose it builds a table of all the string literals (stringConstants) which at the end of the compilation is added to the end of the assembly output in a separate .data section. As you can see above, if the same string literal is found multiple times, it is stored in the table only once and is reused.

Similarly, parseTerm has been modified to get the type of the first factor and pass it on as a parameter to multiply or divide. It also returns the type to the caller:

fun parseTerm(): DataType {
val typeF1 = parseSignedFactor()
while (inp.lookahead().type == TokType.mulOps) {
if (typeF1 == DataType.string)
code.saveString()
else
code.saveAccumulator()
when (inp.lookahead().encToken) {
Kwd.mulOp -> multiply(typeF1)
Kwd.divOp -> divide(typeF1)
else -> inp.expected("multiply or divide operator")
}
}
return typeF1
}

And in the same way, parseExpression has been modified to get the type of the first term and pass it on as a parameter to add or subtract and, of course, to return it to the caller:

fun parseExpression(): DataType {
val typeT1 = parseTerm()
while (inp.lookahead().type == TokType.addOps) {
if (typeT1 == DataType.string)
code.saveString()
else
code.saveAccumulator()
when (inp.lookahead().encToken) {
Kwd.addOp -> add(typeT1)
Kwd.subOp -> subtract(typeT1)
else -> inp.expected("add or subtract operator")
}
}
return typeT1
}

So, now, parseAssignment knows what it’s dealing with and can make a decision whether a string literal can be assigned to an integer variable or not:

fun parseAssignment() {
    val identName: String = inp.match(Kwd.identifier).value
    checkCanAssign(identName)
    val typeVar = getType(identName)
    inp.match(Kwd.equalsOp)
    val typeExp = parseBooleanExpression()
    checkOperandTypeCompatibility(typeVar, typeExp, ASSIGN)
    when (typeVar) {
        DataType.int -> parseNumAssignment(identName)
        DataType.string -> parseStringAssignment(identName)
        else -> {}
    }
}

You can see that it now calls parseNumAssignment or parseStringAssignment depending on the type of the variable being assigned. The former is simply what we used to have as assignment:

fun parseNumAssignment(varName: String) {
code.assignment(varName)
}

However, the latter, even though still one line at the parser level, is a bit more complex when it comes to the code we generate as it involves a call to our own strpcy_ (see above).

fun parseStringAssignment(identName: String) {
code.assignmentString(identName)
}
fun assignmentString(identifier: String) {
outputCodeTab("movq\t%rax, %rsi\t\t")
outputCommentNl("assign string - strcpy_(identifier, %rax)")
outputCodeTabNl("lea\t${identifier}(%rip), %rdi")
outputCodeTabNl("call\tstrcpy_")
}

The key change in our upgraded parseAssignemnt is the call to checkOperandTypeCompatibility with the two types and the operation as parameters. This is where the decision is made whether these two types are allowed in this operation. And this is how it works:

At the end of the new file (DataTypesAndDecl) you will see a map that defines the allowed combinations for each operation

val typesCompatibility = mapOf(
    // int with int allowed for all operations
    TypesAndOpsCombi(DataType.int, DataType.int, ALL_OPS) to true,
    TypesAndOpsCombi(DataType.int, DataType.none, ALL_OPS) to true,
    // string with string allowed only for assign, add, compare_eq compare_ne
    TypesAndOpsCombi(DataType.string, DataType.string, ASSIGN) to true,
    TypesAndOpsCombi(DataType.string, DataType.string, ADD) to true,
    TypesAndOpsCombi(DataType.string, DataType.string, COMPARE_EQ) to true,
    TypesAndOpsCombi(DataType.string, DataType.string, COMPARE_NE) to true,
    TypesAndOpsCombi(DataType.string, DataType.none, PRINT) to true,
    // all other combinations forbidden unless set here
)

In order to make this easier to manage, we have a “wildcard” operation, ALL_OPS. So, the first two lines say that all binary operations between integers are allowed and all unary operations on an integer are allowed. The following five lines say that we can assign a string to a string variable, we can add two strings, we can compare (equal or not-equal) two strings and we can print a string. No other combination is allowed. With this definition, the logic in checkOperandTypeCompatibility is very simple:

fun checkOperandTypeCompatibility(t1: DataType, t2: DataType, operation: String) {
var typesAreCompatible = typesCompatibility[TypesAndOpsCombi(t1, t2, operation)] ?: false
if (!typesAreCompatible)
typesAreCompatible = typesCompatibility[TypesAndOpsCombi(t1, t2, ALL_OPS)] ?: false
if (!typesAreCompatible) {
var message = "line ${inp.currentLineNumber}: $operation $t1 "
if (t2 != DataType.none)
message += "with $t2 "
message += "not supported"
abort(message)
}
}

This function is now called in every parse function that performs a binary or unary operation to decide whether the type(s) is/are compatible. If they are not, a compilation error is thrown.

Operations

Given that the execution of an operation is now different, depending on the type(s) involved, we have separate functions for operations on integers and on strings (which are called after checking the type compatibility):

fun add(typeT1: DataType) {
inp.match()
val typeT2 = parseTerm()
checkOperandTypeCompatibility(typeT1, typeT2, ADD)
when (typeT1) {
DataType.int -> addNumber()
DataType.string -> addString()
else -> {}
}
}

The same goes for subtract, multiply, and all the other operations.

These functions are now placed in two separate files OpsNumeric.kt and OpsStrings.kt.

Boolean Expressions

Similarly to the Numerical Expressions, our Boolean expressions module has to be changed to support strings. As mentioned early in this chapter, strings can only be used with the relational operators == and != and this is defined in the typesCompatibilityMap.

Our parseBooleanExpression, parseBooleanTerm, parseNotFactor, parseBooleanFactor and parseRelation have been changed to return the type of the entry they processed. Also, all the boolean operations functions are now called with a parameter that’s telling them what the type of the first operand was. This way they can check the type compatibility after retrieving the type of the second operand.

Notably, parseRelation, parseEquals and parseNotEquals have had their logic changed as this is where strings are getting into the picture:

fun parseRelation(): DataType {
val typeE1 = parseExpression()
if (inp.lookahead().type == TokType.relOps) {
if (typeE1 == DataType.string)
code.saveString()
else
code.saveAccumulator()
when (inp.lookahead().encToken) {
Kwd.isEqual -> parseEquals(typeE1)
Kwd.isNotEqual -> parseNotEquals(typeE1)
Kwd.isLess -> parseLess(typeE1)
Kwd.isLessOrEq -> parseLessEqual(typeE1)
Kwd.isGreater -> parseGreater(typeE1)
Kwd.isGreaterOrEq -> parseGreaterEqual(typeE1)
else -> inp.expected("relational operator")
}
return DataType.int
}
else
return typeE1
}
fun parseEquals(typeE1: DataType) {
    inp.match()
    val typeE2 = parseExpression()
    checkOperandTypeCompatibility(typeE1, typeE2, COMPARE_EQ)
    when (typeE1) {
        DataType.int -> code.compareEquals()
        DataType.string -> code.compareStringEquals()
        else -> {}
    }
}

These changes are consistent with the changes in the Numeric expressions parsing, where different routines have to be called to execute the operation depending on whether it’s integer or string operation.

Return from Function

Given that our functions have to return a specific type, we need to check whether the return statement is returning the correct type. This is done quite easily with the below code:

fun parseReturn() {
    inp.match()
    if (labelPrefix == MAIN_BLOCK)
        abort("line ${inp.currentLineNumber}: return is not allowed in [main]")
    hasReturn = true       // set the return flag for this function
    val funType = getType(funName)
    if (funType != DataType.void) {
        val expType = parseExpression()
        if (expType != funType)
            abort("line ${inp.currentLineNumber}: $funType function cannot return $expType")
    }
    code.returnFromCall()
}

I’ve also taken the opportunity to simplify the logic that checks whether a function has a return statement in its block. If you remember, the initial implementation (Chapter 12) was quite complex: each block parsing function was returning a boolean value to the caller, which indicated whether it found a return or not, and when that value reached the top of the recursion, which is the parseBlock called by the parseFunctionBlock, it was checked and generated an error if that function had no return statement.

Even though that logic worked, it was unnecessarily complicated, because we can only declare a function at the top level only – we cannot declare a function within a function. This means that a global variable hasReturn, which is reset in the beginning of parseFunctionBlock, set in parseReturn (see code above) and checked at the end of parseFunctionBlock, can do the job in a much simpler way.

fun parseFunctionBlock() {
hasReturn = false
parseBlock()
if (!hasReturn)
abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retToken)}")
}

Code Generator

There is one more change that was needed in our code generator this time, in order to support string operations (+, ==, !=).

Let’s go back to how we calculate this expression: A + ( B + C )

We set the accumulator to the value of A and then push it to the stack. Then, we set the accumulator to the value of B and, given that we have parenthesis, push it also to the stack. Then, we set the accumulator to the value of C, pop the top of stack and add it to the accumulator (this gives us B+C). Finally, we pop the top of stack again and add it to the accumulator (and this gives us A + (B+C) ).

The above works well with numbers. In case of strings we will have to do the same – right? But, what is the value of A, B or C if these are strings? It’s the address of each string that points to the memory area where the actual contents of the string is stored. So, we are pushing the address of the string to the stack, but what about its contents? Technically we could push it to the stack too and that would have been fully consistent with the handling of integers. However, this would introduce unnecessary complexity that has to do with the fact that the strings are variable-length plus they are made of bytes, which means that a string is most likely not an exact multiple of a 64 or a 32-bit word. Even though we could use padding to make a string multiple of a 64 or 32-bit word and use the null terminator to know that we have reached the end of the string to stop popping from the stack, this is totally unnecessary for the simple reason, because A + (B + C) is the same as A + B + C.

The answer is that we will not allow parenthesis in strings expressions (try it and see the compiler error generated). This means that we don’t really need a stack to store the value of all the strings involved in the expression. We just need a temporary buffer to store the contents of the last string processed in the expression and retrieve it to add it (strcat_) to the next one and store the new result to the buffer again. Please note, the value of the accumulator (string address) is still stored in the stack to keep things simple. You can see how this logic works in the saveString function which is called when a Term is processed:

fun saveString() {
outputCodeTab("movq\t%rax, %rsi\t\t")
outputCommentNl("save string - strcpy_(string_buffer, %rax)")
outputCodeTabNl("lea\t$STRING_BUFFER(%rip), %rdi")
outputCodeTabNl("call\tstrcpy_")
outputCodeTabNl("pushq\t%rax")
includeStringBuffer = true
}

The buffer is a dedicated memory space called string_buffer_ that has a default length of 1024 bytes. This means that the maximum length of any string in our program can be 1023 characters (+1 for the null terminator). Any string operation that resulted in a string longer than that would most likely crash the program or, best case, would overwrite other variables. If however you need a string of higher length, you can simply call the compiler with this option:

-maxstring nnnnn

This will allocate a string_buffer of nnnnn bytes to prevent overflow and crash of the program.

If you write a program that involves string operations you will see the string_buffer space allocated at the end of the compiler output:

...

.data
   .align 8
# buffer for string operations - max str length limit
   string_buffer_:    .space 1024

Runtime Library

Finally, the TINSEL runtime library has also been enhanced to support Version 2.0. The following functions have been added:

  • strcpy_
  • strcat_
  • strlen_
  • streq_

These functions are not directly visible to the TINSEL program. They are called from the assembly program that the compiler produces.

You will need to download the files string_utils.s and readwrite.s from the directory libtinel/src in the repository and assemble them using the GNU assembler so that they can be linked to your TINSEL output program (see below and also Chapter 10).

And with that, we have completed the implementation of the first version of string support in TINSEL.

This chapter was quite extensive and this reflects the extent of the changes in our compiler. I have not described every little change in every detail here, because that would double the size of this chapter. I have described the concept though and I hope you can follow the changes by looking at the code.

We have made a significant improvement to our TINSEL language and can now, finally, write the “hello world” program! Here’s a slightly more “complex” version of it:

program hello_world
var s: string(20), s1: string(20), s2: string(20)
main {
    s1 = "Hello"
    print "please enter your name (or Enter to continue): "
    read s2
    if (s2 == "") { s2 = "World" }
    s = s1 + ", " + s2
    println s, "!"
}
endprogram

You can also try a more complex example here: diamond.tnsl

Remember, to run your own TINSEL programs, you need this compiler and a Linux computer. You will need to download the source libraries from the repository, which you will have to assemble and link with your TINSEL program as described at the end of Chapter 10 – Running our Code: Let’s Build a Compiler – Chapter X, TINSEL for x86-64

And as always, you can find the code in my GitHub repository.

Coming up next: Functions with Parameters

Chapter XIII: Fixing the For Loop

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

Our newly introduced programming language, TINSEL, has now taken shape and we can write a decent program. We have improved our error checking and it gives us meaningful error messages pointing to the line where the error was. There is still one thing missing though – the FOR loop. If you look at parseFor, you will see that it still produces dummy code, so, let’s go and fix it in this chapter and produce our TINSEL V1.2.

Fixing the For Loop

Jack had pointed out in his tutorial that “the for loop is a bear to translate” and I would agree. There are a lot of moving parts: (a) the control variable – where do we keep it? in the stack? in a register? (b) the start value (c) the end value – do we recalculate it at the end of each iteration or only once in the beginning? (d) the step. So, let’s set the scene and define first what we want to do and then decide how we do it.

This will be my specification of the for loop:

    <for> ::= for ( <identifier> = <expression1> [ down ] to <expression2> [ step <expression3> ] ) <block>

It will work like this:

  1. Set the control variable to expression1
  2. Check if it has exceeded the limit (expression2) and if yes exit the loop
  3. Execute the block
  4. Update the control variable
  5. Loop back to point 2.

In addition I will support an optional down keyword, to count down as opposed to up. If it’s there, then the control variable will be decreased by 1 at the end of each iteration (while, if it isn’t there, it will obviously be increased by 1).

Finally there is an optional step keyword followed by an expression. If this is present, then the value of expression3 will be added to the control variable at the end of each iteration (as opposed to the control variable increased or decreased by 1).

It is important to remember that it will be the responsibility of the programmer to set the step expression (but also the initial and the to expressions) to the right values so that the control variable will eventually reach the limit and the loop will be correctly terminated. Please note, if down is used, then the result of the step expression must be a negative number.

Furthermore, in my implementation, expression2 and expression3 (if it exists) are evaluated only once when the loop is entered and not in every iteration.

The Control Variable

The control variable must be a new variable not previously defined in that function or main program. It exists only within the for block and cannot be assigned a value in an assignment statement – it is assigned a new value automatically at the end of each iteration. However, it’s value can be used on the right-hand side of an assignment within the for block.

The next question is “where do we store it?”. Jack had decided to use the global variables space in his implementation of the for loop. That would not work with TINSEL, e.g. in the scenario where we have a for loop in the main block that calls a function that has a for loop and uses the same name for the control variable. This is valid scenario and TINSEL should support it.

I would be tempted to use a CPU register given that the x86 architecture has a few spare ones (e.g. %r8 – %r15). This would be a really easy option, but apart from the above, it presents an additional problem: how do we handle nested for loops? We would have to use a new register in each level and also keep track of which level of for we are in and use the appropriate register when we want to access the control variable. This would introduce unnecessary complexity plus we would soon run out of registers… Hence, the answer is: the stack. It’s the most appropriate place, as the nature of the control variable is exactly the same as of a local variable.

Using the Stack for Storing a Variable

In the x86 architecture there are standard techniques for using the stack for temporary variable storage (e.g. for local variables or, as in our case, for the for loop control variable). The first step is to create a new stack frame in the beginning of each function. This means that the old stack frame must be restored at the of the function. I do the same in the beginning and at the end of the main block as well.

To create a new stack frame we just need to save the Base Pointer in the stack and then give it the value of the Stack Pointer (i.e. have the new stack frame start where the stack pointer is pointing after the last push):

    pushq %rbp
    movq %rsp, %rbp

Consequently, to restore the old stack frame we need to restore the Stack Pointer from the Base Pointer and then restore the Base Pointer from the stack:

    movq %rbp, %rsp
    popq %rbp

I have implemented the two functions newStackFrame and restoreStackFrame in the code module that produce the code for these two operations. They are called in the beginning and at the end (where we have the return statement) of each function block and of the main block.

With our new stack frame in place, we can allocate space in the stack for a temporary variable by reducing the stack pointer by the appropriate number of bytes. E.g. to allocate space for a 64-bit integer (i.e. 8 bytes) we just do:

    subq $8, %rsp

In addition we need to keep track of the offset of each variable we allocate from the base of the stack (don’t forget, rbp had the value of rsp in the beginning of the block). This means that when we define a new stack frame, we start with an offset of 0, and this is decreased by the appropriate amount of bytes every time space for a variable is allocated in the stack. E.g. the offset of the first 64-bit integer we allocated is -8, and we can access it like this:

    movq -8(%rbp), %rax

The current offset from the Base Pointer is managed within the code module.

To release the space that has just been allocated at the top of the stack, we just add the variable’s size in bytes to the stack pointer, e.g.:

    addq $8, %rsp

For these two operations I have implemented allocateStackVar(size: Int) and releaseStackVar(size: Int) in the code module.

With all that clarified, here’s what’s needed in our compiler:

In the IdentifierDecl map that holds all the variables and functions declarations I have added two more parameters:

var stackVar: Boolean = false

which is true when the variable is stored in the stack and false if it’s a global variable

var stackOffset: Int = 0

which holds the offset from the base pointer for a stack variable.

And to actually save or retrieve a stack variable, I have implemented the functions assignmentLocalVar(offset: Int) and setAccumulatorToLocalVar(offset: Int) respectively in the code module.

With these tools in our toolbox we can now go ahead and code the parseFor function

The New Shape of parseFor

In my first attempt, I started rewriting the for parser implementing all the various aspects (the control variable in the stack, the initial expression, the to expression, the optional down keyword and the optional step keyword) and the outcome was a function that was over 80 lines long… I had to break that down in smaller pieces so that I could read it and debug it. So, here’s the new parseFor function:

fun parseFor(): Boolean {
inp.match()
parseForLine()
presetCtrlVar()
val label1 = newLabel()
val label2 = newLabel()
postLabel(label1) // actual start of the loop
stepAndCheck() // increase (or decrease) ctrl var and check
code.branchIfFalse(label2) // if limit reached, exit
val foundReturn = parseBlock(label2) // the FOR block
code.branch(label1) // loop back to the beginning of the loop
postLabel(label2) // exit point of the loop
cleanUpStack()
return foundReturn
}

I will go through this at high level and will let you check all the details by looking at the code in the repository. Due to the complexity and the number of lines, I have put the for parser and the related functions in a separate file and actually a separate class: ProgParser1aControlFor.kt, class ForParser.

And now let’s go through the parts of parseFor:

First comes parseForLine, which parses the actual for line from the for keyword to the closing parenthesis.

It first gets the control variable name and checks if it’s been declared already. If not, it allocates space for it in the stack and adds it to the variables’ map marking it as stack variable and saving its offset from the base pointer. It then parses the “from” expression and produces the code to set the initial value of the control variable.

Next, it checks for a down keyword and sets the relevant flag if it finds it.

It then producers the code for the to expression and saves its value also in the stack, but without assigning a variable name to it; it’s not needed. However, it has to save the offset from rbp and for this a class variable is used.

And the last bit of this first part is to process the step expression (if there is any), set a flag if there is one, and produce the code to calculate the value of the expression and save it also in the stack. This is not saved as a variable in the variables’ map either, so a class variable is also used to save the offset from rbp.

Next we have presetCtrlVar, which pre-decrements (or pre-increments if we have down) the control variable to get it ready for the first “increment-and-check” (or decrement-and-check respectively).

Then we produce the two labels that every loop needs: label1, which marks the beginning of the loop and label2, which marks the first instruction after the end of the loop.

Here starts the loop: we first call stepAndCheck. This function first increments the control variable by 1 (or decrements it by 1 if we have down). Please note that if we have step, then the step value is always added to the control variable, as it will be positive if there is no down and negative if there is down.

It then invokes the appropriate comparison, >= or <= depending on the presence of down or not.

And this is where we know whether we have to stop the loop by executing branchIfFalse, same as in the while and repeat implementations.

Having established whether we can continue with the next iteration, we execute the actual for block (parseBlock) and then go back to the beginning of the loop to check all over again.

The final part in our for parser, after we are done with the loop, is to clean up the stack and the variables’ space: cleanUpStack. It first removes the entry for the control variable from the variables’ map so that it can be potentially reused in a new for block.

It then releases the stack space allocated for the step value (if it was there) plus the space for the to value and the control variable.

And that was it – this is our for parser done. This class is almost 150 lines long but at least one can read it and understand what each function does.

We also need to change the line in parseStatement where for is called:

    Kwd.forToken -> return ForParser().parseFor()   // in separate module due to increased complexity

Now you can try this nested loop, compile it and run it on your Linux machine and see the output:

/*--
 test nested for loops
*/
program TestFor

var n = 0

main {
    read n
    for (i=n down to 1 step 2) {
        print i
        for (j=1 to i) {
            print i*j
        }
        print 999
    }
}

endprogram

Continue

Now that we are in the topic of “for”, it’s a good opportunity to add continue. We have break already, implemented based on the method that Jack has shown us, so it will be quite easy to add the necessary processing for continue.

If you remember, every loop parser generates two labels: label1 and label2. Label1 marks the start of the iteration, while label2 marks the next instruction following the loop (the one that is executed next, after the loop is terminated).

And it’s label2 that is passed as a parameter to parseBlock, and then on to parseStatement and then on to parseIf, and on and on down the recursion tree, so that if and when a break statement is encountered, it will produce an unconditional jump instruction to that label in order to break the loop.

We will use exactly the same technique to implement continue. We will add a second parameter to parseBlock, parseStatement and parseIf, which will be the label to be used in parseContinue:

fun parseBlock(breakLabel: String = "", continueLabel: String = ""): Boolean

The “continue” label will be passed down the recursion tree alongside the “break” label and if and when a continue statement is found, it will use it:

   Kwd.continueToken -> { parseContinue(continueLabel); return false }

And the actual parseContinue function is effectively a copy of parseBreak, as it does exactly the same thing (generates an unconditional jump to the label it has received as parameter). In fact we could have used the very same function, but in the interest of clean code I will keep them separate:

fun parseContinue(label: String) {
    inp.match()
    if (label == "")
        abort("line ${inp.currentLineNumber}: no loop to continue")
    code.jump(label)
}

Don’t forget, we need to add the continueToken in our LanguageTokens file.

And that was it – with this minor change in our code, we can now support continue, and this is thanks to Jack’s method, which resulted in code that is so easy to expand.

In this version I have also introduced two small enhancements in the code module:

First, a constant CODE_ID that contains an identification of the flavour of assembly code produced. It is currently set to “x86-64 Assembly Code – AT&T format” and is printed as a comment at the top of the compiler output file. This is to help the reader identify easily what kind of code has been produced in that file. When a new code module is developed (e.g. to support ARM architecture) this string will be updated.

Second, I have added a line counter for the assembly code produced: outputLines. This is being incremented based on the number of new lines in the output string in outputCode and is printed on the console when the compiler finishes (together with the number of source lines).

And always, you can find the code in my GitHub repository.

Coming up next: Strings

Chapter XII: Improving our Error Checking

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

We now have a fully working compiler that can compile TINSEL v1.0 programs, including comments, which we can run on Linux. So, let’s work on some details to close some small gaps and also improve our error checking.

More Input after the EndProgram Token

Let’s start with an easy one. Our compiler stops compiling when it sees endprogram which is the token that terminates our program. It will ignore anything that follows “endprogram” and this is fine – everything works. However, having extra characters in the input stream after the end of the actual program, which are being ignored, is not a clean scenario. Let’s add a little check so that our compiler flags this as an error. After the “endprogram” token we would expect the end of input, so all we need is to add one line at the end of parseProgEnd:

fun parseProgEnd() {
    inp.match(Kwd.endOfProgram)
    code.progEnd()
    inp.match(Kwd.endOfInput)
}

With this simple check, our compiler will not allow anything after the “endprogram” token other than white spaces (and new lines).

As you may have noticed, I’ve also moved the two lines that were printing a newline and the comment “end of program” to the output, to our code module into a function called progEnd, similar to progInit – it’s cleaner this way.

Output to a File

Now that we can directly execute our programs on Linux, it does not feel right to have to copy and paste our compiler’s output to a file so that we can then pass it through the GNU assembler. Let’s make our compiler do this for us.

So far we have been using only one mandatory command line argument, which was the input file. I will add one more option, to tell our compiler to create a file for the code that it produces:

CompilerMain [-o output_file] input_file

For this to work, let’s declare a new variable in CompilerMain.kt that will hold the name of the output file:

var outFile = ""

We will need to pass this variable as a parameter to the x86_64Instructions class constructor:

class x86_64Instructions(outFile: String = "") {
    ....

In this class we will declare our output stream that will be initialised by default to the standard output:

private var outStream: PrintStream = out

What we need now is some initialisation code for this class, which will check the value of outFile, and if not empty, will set the output stream to point to that file:

/** initialisation code - class InputProgramScanner */
init {
if (outFile != "") {
try {
outStream = PrintStream(File(outFile))
} catch (e: Exception) {
err.println("could not create output file - $e")
err.println("output code will be sent to stdout")
}
}
}

As you can see, if the output file cannot be created/opened, the compiler will print an error message and will send the output to stdout. By the way, if the output file exists already, it will be overwritten.

And to make the output go to the output stream as opposed to stdout, a small change is required in outputCode. This is the only place where we need to change the call to print, as this is where the code is actually written to the output. All the other functions (code with a newline, labels, comments, etc.) end up calling outputCode.

fun outputCode(s: String) = outStream.print(s)

And that was it in terms of sending the code either to a file or to stdout, however, what’s still missing is how to set the value of the output file name (outFile). Here’s how we will do this:

We need a new function to parse the command line arguments. We will call it (what else?) processCmdLineArgs.

In my years of programming I’ve seen (and written) numerous different ways to process command line arguments. A good one that I recently came across, is to do it in two stages: (a) convert all the -x options and their parameters to a map with the option “x” as key and the parameters as data, and (b) process that map, i.e. process each option and set things accordingly. This method works very well when we have a large number of options (think of those complex Linux commands that have 20 or 30 different options). However it would be an overkill in our case as we only have one: “-o output_file” (and maybe we can add one more: “-?” for usage).

Another possible option would be to use the techniques we have learnt in the scanner implementation (by following Jack’s tutorial). After all, the command line arguments are a series of tokens that have to be parsed and actions must be taken based on their values. I have implemented this, as it was too tempting to resist. It appears to be very flexible and very thorough and can deal with anything you can throw at it (same as our compiler). However, the resulting parseCmdLineArgs function is too lengthy to use just to parse a -o output_file option followed by input_file. I want to retain this solution though for future use, so I have saved the supporting class and a use case in this repository.

So, for now, I have used a simple, short, straightforward function that parses the args list. It loops through the list using the helper function getNextArg, and first checks for a "-" to determine if the arg is an option. If yes, it checks for "-o" or "-?" and if "-o", it gets the name of the outfile from the next argument. If not "-", then it assigns the value of the arg to infile. All that is done with the appropriate error checking of course :

const val USAGE = "usage: CompilerMain [-o output_file] input_file"

/** process command line arguments */
fun getNextArg(args: Array<String>, index: Int, argName: String = ""): String {
    if (index >= args.size)
        exit("missing argument(s) $argName, $USAGE")
    return args[index]
}
fun processCmdLineArgs(args: Array<String>) {
    var argIndx = -1
    while (++argIndx < args.size) {
        val arg = getNextArg(args, argIndx)
        if (arg[0] == '-')
            when (arg) {
                "-?", "-h", "-H" -> exit(USAGE)
                "-o", "-O" -> { outFile = getNextArg(args, ++argIndx, "output_file"); continue }
                else -> exit("invalid option [$arg]\n$USAGE")
            }
        else
            inFile = arg
    }
    if (inFile == "")
        exit("missing argument input_file, $USAGE")
}

To activate this function a one-line change is needed in the main initialisation function:

fun initCompiler(args: Array<String>) {
    processCmdLineArgs(args)
    // initialise the code module
    code = x86_64Instructions(outFile)
    // initialise the scanner module
    inp = InputProgramScanner(inFile)
}

While we are on this topic, now that the compiler can (and most of the times will) send the resulting code to a file, the console would be totally silent. So, I have added one “Hello” message in the beginning and one “Finished” message at the end:

fun main(args: Array<String>) {
    println("TINSEL(c) compiler v1.1 Jan 2022, Copyright M.Pappas\n")
    initCompiler(args)
    parseProgram()
    println("Successful compilation, ${inp.currentLineNumber-1} source lines") // -1 is needed as an extra new line was added when the input was read
}

Finally, I have added an extra comment at the top of the assembly output showing the date and time it was compiled:

fun progInit(progName: String) {
    outputCommentNl("program $progName")
    outputCommentNl("compiled on ${Date()}")
    ....
}

Undeclared Identifiers

You may have noticed that our compiler will happily accept any variable name in an expression without checking whether it’s been declared first. This will cause a failure when our compiler’s output is sent to the assembler, so let’s fix this by adding an undeclared variable check. Actually, the same applies to functions.

In addition our compiler will allow the following:

var a
...
fun a() {
... 

Even though our compiler can cope with this, declaring a variable and a function with the same name, first is not good programming practice, and second it will cause an issue in our assembly code as both will translate to the same symbol:

a:
...
a:
... 

and this will cause an assembler error.

So, first, let’s combine the check “variable already defined” with the check “function already defined”. For this we need to combine the two maps:

// the variables space
val variablesSpace = mutableMapOf<String,VariableDecl>()

// the functions space
val functionsSpace = mutableMapOf<String,VarType>()

into one:

// the identifiers space map
val identifiersSpace = mutableMapOf<String,IdentifierDecl>()

/** the declaration space (variables and functions) */
class IdentifierDecl(var fv: TokType, var type: VarType, var initialised: Boolean = false)

With this, our declareVar and declareFun functions can check against one single map (identifiersSpace) whether an identifier has been declared, and if not, they can update this map setting the fv field accordingly (for a variable or function – the corresponding values have already been defined in the TokType enumeration).

/** declare a variable */
fun declareVar(name: String, initValue: String) {
// check for duplicate var declaration
if (identifiersSpace[name] != null)
abort ("line ${inp.inLineNumber()}: identifier $name already declared")
identifiersSpace[name] = IdentifierDecl(TokType.variable, VarType.int, initValue!="")
code.declareInt(name, initValue)
}

/** process a function declaration */
fun declareFun(name: String) {
if (identifiersSpace[name] != null)
abort ("line ${inp.inLineNumber()}: identifier $name already declared")
identifiersSpace[name] = IdentifierDecl(TokType.function, VarType.int)
code.declareAsmFun(name)
}

Now that we have joined the declarations space for our variables and functions and can easily tell whether an identifier is a function or a variable, let’s go back to scan. When scan sees an alphabetic token, which is not a keyword, it returns exactly that: a token of type identifier without any additional information. And it’s down to our parser to decide whether that identifier is a variable or a function depending on whether there is a "(" following. It looks like we can do this in a smarter way. Let’s change scan to return the type of the token as well (function, variable or none).

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    newLine = skipWhite()
    if (checkEndofInput())
        return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none)
    if (checkNumeric())
        return Token(getNumber(), Kwd.number, TokType.none)
    if (checkAlpha())
        return keywordOrFunctionOrVariable(getName())
    if (checkSpecialToken())
        return getSpecialToken()
    return getInvalidToken()
}

/** return a keyword or identifier token based on the keyword tokens list */
private fun keywordOrFunctionOrVariable(name: String): Token {
    val indx = isKeyword(name)
    return (if (indx >= 0)
                languageTokens[indx]  // keyword found
            else
                // function, variable or other identifier found (determined by Token type)
                Token(name, Kwd.identifier, identifiersSpace[name]?.fv ?: TokType.none))
}

And with this additional information in the token coming from the scanner, we can improve our parsers. Until now, our compiler cannot process a plain function call (not part of an assignment):

fun a() {
    ...
}
main {
    ...
    a()     <-- this will produce an error ("=" expected), which should not be the case
    ...
}

To fix this let’s first update the BNF definition:

<block> ::= { <statement> * }
<statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> |
                <return> | <read> | <print> | <assignment> | <function_call> | null [ ; ]
<if> ::= if ( <b-expression> ) <block> [ else <block> ]
<while> ::= while ( <b-expression> ) <block>
<repeat> ::= repeat <block> until ( <b-expression> )
<for> ::= ( <identifier> = <b-expression> to <b-expression> ) <block>
<break> ::= break
<return> ::= return <b-expression>
<assignment> ::= <identifier> = <b-expression>
<function_call> ::= <function_name> ( )
<read> :: = read <identifier> [ , <identifier> ] *
<print> ::= print <b-expression> [ , <b-expression> ] *

And based on this, the parseStatement becomes:

fun parseStatement(breakLabel: String) {
    when (inp.lookahead().encToken) {
        Kwd.startBlock -> parseBlock(breakLabel)
        Kwd.ifToken -> parseIf(breakLabel)
        Kwd.whileToken -> parseWhile()
        Kwd.repeatToken -> parseRepeat()
        Kwd.forToken -> parseFor()
        Kwd.breakToken -> parseBreak(breakLabel)
        Kwd.retTok -> parseReturn()
        Kwd.readToken -> parseRead()
        Kwd.printToken -> parsePrint()
        Kwd.identifier ->
            if (inp.lookahead().type == TokType.variable) parseAssignment()
            else if (inp.lookahead().type == TokType.function) parseFunctionCall()
            else abort("line ${inp.inLineNumber()}: identifier ${inp.lookahead().value} not declared")
        Kwd.semiColon -> inp.match()    // semicolons are simply ignored
        else -> inp.expected("valid keyword, semicolon or identifier")
    }
}

Lastly, the parseIdentfier function can now change to:

fun parseIdentifier() {
when (inp.lookahead().type) {
TokType.variable -> parseVariable()
TokType.function -> parseFunctionCall()
else -> abort("line ${inp.inLineNumber()}: identifier ${inp.lookahead().value} not declared")
}
}

/**
* parse a function call
* <function_call> ::= <function_name> ( )
*/
fun parseFunctionCall() {
val funcName = inp.match(Kwd.identifier).value
inp.match(Kwd.leftParen)
inp.match(Kwd.rightParen)
code.callFunction(funcName)
}

/** parse a reference to a variable */
fun parseVariable() {
val varName = inp.match(Kwd.identifier).value
code.setAccumulatorToVar(varName)
}

With these changes, our compiler now supports the “plain” function call and also checks every time an identifier is mentioned, whether it has been declared (as a variable or a function) and produces an error if not.

Uninitialised Variables

It would be good to have a check for uninitialised variables but we need to think how this will work. For the moment we have the flag initialised in our identifiers space map, which is set to true when a variable is initialised at declaration time. Also need to remember that for now we only use global variables.

So, regardless of initialisation at declaration time, we can set the initialised flag to true when the variable is assigned a value, which is when it appears on the left hand side of an assignment or in a read statement. And then, every time the variable appears on the right hand side of an assignment, we can check the initialised flag and trigger a compile error if not initialised. So far, so good.

However, we are talking about global variables here. They can be modified by any function in our program, even by functions in other files (when TINSEL supports this). This means it’s almost impossible for the compiler to determine at compile time whether a variable has been initialised or not when it appears on the right hand side of an assignment.

So, it looks like we will abandon the idea of checking for uninitialised variables as far as the global variables go, but we will come back to it when we introduce local variables. And that’s what the commercial compilers do when it comes to uninitialised variables.

Finally, don’t forget that global variables that are not initialised in our compiler when they are declared, will be initialised to 0 (zero) by default (see chapter 10).

Return

As mentioned in Chapter VIII, each function must have a return statement. So, to be thorough, our compiler must check this and produce an error if that return statement is not found. The assembler would not complain about the absence of a return statement, however this scenario would lead to unpredictable results, as the code would continue executing whatever bytes it found after the end of our function (most likely some other function or the main program itself from the beginning…). A decent compiler would also check that the return statement is actually reachable and would produce a fatal error otherwise. We will keep it simple though and will only check for the existence of the return statement.

The first question is, where do we do this? The most obvious answer must be in parseBlock, as this is where the body of the function is processed. However, parseBlock is called in quite a few other cases like from parseIf, parseWhile, parseRepeat, etc. and even from parseMain. So, let’s create a new version of parseBlock, that will be called only form parseFuncDecl to parse the function block. And let’s assume for a moment that parseBlock returns a Boolean value indicating whether the block included a return statement or not (we will see in a minute how we will do this). Our new parseFunctionBlock will now check the return value of parseBlock and if false, it knows there was no return statement:

fun parseFunctionBlock(funName: String) {
val hasReturn = parseBlock()
if (!hasReturn)
abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retTok)}")
}

The next question is, how will parseBlock check for the return statement? parseBlock is recursive: it calls parseStatement, which calls parseWhile, which calls parseBlock, which calls parseStatement, which calls parseIf, which calls parseBlock, etc. and at some point in this chain there might be a return statement.

Let’s take a step back and think of how we have implemented return. It is done by a parseReturn function that has been added to the parseStatement where block. It is at this level where we know that we have a return statement. And it looks like there is an “easy” way to check, very similar to how Jack implemented the breakLabel, which is passed down the recursion chain as a parameter. I will change parseStatement, to return a Boolean value indicating whether the statement it executed has found a return statement or not. This return value will be returned further on, by the previous function, up the recursion chain until it reaches the top level parseBlock (the one that was called from parseFunctionBlock). This may sound more complicated than it really is, but it boils down to this: each parser that calls parseBlock, simply reads its return value and returns it to the caller. In the end this value reaches parseFunctionBlock. And here’s the code for it:

fun parseBlock(breakLabel: String = ""): Boolean {
var foundReturn = false
inp.match(Kwd.startBlock)
while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
if (parseStatement(breakLabel))
foundReturn = true
}
inp.match(Kwd.endBlock)
return foundReturn
}
fun parseStatement(breakLabel: String): Boolean {
    when (inp.lookahead().encToken) {
        Kwd.startBlock -> return parseBlock(breakLabel)
        Kwd.ifToken -> return parseIf(breakLabel)
        Kwd.whileToken -> return parseWhile()
        Kwd.repeatToken -> return parseRepeat()
        Kwd.forToken -> return parseFor()
        Kwd.breakToken -> { parseBreak(breakLabel); return false }
        Kwd.retTok -> { parseReturn(); return true }
        Kwd.readToken -> { parseRead(); return false }
        Kwd.printToken -> { parsePrint(); return false }
        Kwd.identifier -> {
            if (inp.lookahead().type == TokType.variable) parseAssignment()
            else if (inp.lookahead().type == TokType.function) parseFunctionCall()
            else abort("line ${inp.currentLineNumber}: identifier ${inp.lookahead().value} not declared")
            return false
        }
        Kwd.semiColon -> { inp.match(); return false }    // semicolons are simply ignored
        else -> { inp.expected("valid keyword, semicolon or identifier"); return false }
    }
}

As you can see, parseStament now returns a Boolean value indicating whether the statement it executed has found a return statement or not. parseBlock checks this return value and sets foundReturn to true if at least one statement has found a return. At the end of the block, this Boolean variable has the answer to the question “was there a return statement in this block?”. At the same time the foundReturn value is also returned to the caller, as this block could have been called by a parseStatement or a parseIf, etc, and they need to return that value up the recursion chain. This value is checked at the end of the chain by parseFunctionBlock and produces a compiler error if false.

Looking at parseStatement, we can see that it returns true on when it executes parseReturn (as this is where the return statement is found). It returns false if it executes parseBreak, parseRead, parsePrint, parseFunctionCall and parseAssignment (no return statement there). For all the other statements (parseIf, parseWhile, parseRepeat, parseFor) it returns their return value. These parsers must record the return value of the parseBlock that they call and return that to the caller. Where parseBlock is called more than once (like in parseIf) then all the return values must be OR‘ed as the return statement can be in any of these blocks. Easy, isn’t it!!

On a side note, you may have noticed a new variable above: currentLineNumber. If you remember, when I introduced line numbers in Chapter IX, I had some doubts about the accuracy of the line number. And there is a good reason for this. The lineNumber variable is increased by 1 with every new line in function getNextChar, which is called by skipWhite, which is called by scan, which is called at the end of match to advance to the next token. So, lineNumber simply points to the line where the next token is, not the one we are processing right now.

To fix this, I have introduced the new variable currentLineNumber, which is set to the lineNumber at the beginning of match (i.e. in the beginning of the next cycle). This way currentLineNumber points always to the line of the current token and can be safely used in our error messages. Don’t forget, we need to set it both in the beginning of match and also in the beginning of lookahead, as these are the two points that our parsers call to check the current token.

fun match(keyWord: Kwd = Kwd.any): Token {
    currentLineNumber = lineNumber
    ....
fun lookahead(): Token {
currentLineNumber = lineNumber
return nextToken
}

Finally, I have added a -debug option to the main program that activates a debug function (debugCompiler) instead of the standard parseProgram function. In this debug mode our compiler will scan the input and will print each token, next token, line number and other details so that we can easily verify how (and if) our scanner works.

And as always, you can find the code in my GitHub repository.

Coming up next: Fixing the For Loop

Chapter XI: Semicolons and Comments

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we produced our first real running program written in TINSEL, our own programming language. We could give it input from the keyboard and and see its output on the screen. Yes, I admit, it’s a bit basic, but we’ve come a long way from the one-character tokens and single-line programs from the beginning of the series that were producing a mixture of M68000 instructions and pseudo-code. This has been an amazing journey that we’ve managed to travel following on Jack’s footsteps. And in this chapter we will add a couple of features that will make TINSEL even better.

Semicolons

Same as Jack, I had to use semicolons since I learnt Algol! And I continued using them with Pascal, C and Java. But, guess what – I don’t have to use them anymore now that I have started programming in Kotlin!

So, does a modern language really need semicolons or not? The answer is “yes” and “no”. As Jack explained in detail, one school says that they are just a nuisance – they are adding nothing to the syntax, and a modern language can function perfectly well without them (and Kotlin does actually). Don’t forget that our compiler can compile the below statements correctly, without the need of any semicolons:

a = b * 2 
    + (x+y) / z
i = 1

It can tell where each statement finishes and where the next one starts, even if they extend across multiple lines, so, as the minimalists would say, why use semicolons? Why add an extra mandatory keystroke in every statement, which will be a source of another potential mistake?

The other side of the coin is that semicolons add clarity and make the code less cryptic and much easier to read. Think of the case where we have 2 or 3 short statements in one line:

    i=1; a=10; b=a+1;

Without semicolons this line would look a bit “funny”:

    i=1 a=10 b=a+1    

But, again, including multiple statements in one line is not good programming practice, and should we make semicolons mandatory for that rare case?

So, I will do exactly what Jack did: make them optional. The compiler will process one when it sees one, but will not ask for one if it’s not there.

Based on this, our BNF definition of statement changes as per below:

<statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> |
                <return> | <read> | <print> | <assignment> | null [ ; ]

which says that any statement can be optionally terminated by a semicolon and we can also have an empty statement. And the code to implement this is the following three lines in parseStatement:

    ...
    Kwd.identifier -> parseAssignment()
    Kwd.semiColon -> inp.match()    // semicolons are simply ignored
    else -> inp.expected("valid keyword, semicolon or identifier")
}

The changes here are: (a) we only trigger parseAssignment when we actually see an identifier (and not in the else clause as before), (b) when we see a semicolon we recognise it and ignore it and (c) we catch any invalid tokens in the else clause. And that was it!

Now, if you want to make semicolons mandatory, you will need to add the line

    inp.match(Kwd.semiColon)

at the end of the parser functions for break, return, read, print and assignment. In that case, you may also want to add them to the parsers for ProgHeader, VarDecl, and ProgEnd for consistency. I’ll let you decide whether you want to make semicolons mandatory in your own version of TINSEL or not.

Comments

Some people are against comments, some are in favour. Some say the code speaks for itself, some see the need of a comment in certain cases to explain what is happening and why, as no matter how “clean” the code is, every now and then there will be an obscure point that needs a comment to clarify. I have decided to include comments in TINSEL so that you can use them if you want to.

The processing of comments can be as easy or as complicated as we make it.

Here’s my spec: I want both block comments like /* comment here... */ and also inline comments like this // comment inline.

I will also add a little twist: I will define an extra comment token that will transfer those comments from the source code to the output code. For this I will use the tokens /*-- and //--

<block-comment> ::= /* <comment> */ | /*-- <comment> */
<inline-comment> ::= // <comment> <newline> | //-- <comment> <newline>

With these specs in mind, the next question is: where is the best place to process comments?

Jack has put his skipComment function as low in the scanner as possible, inside skipWhite. And this is an approach that served him well, especially in the case where he was working with single-character tokens. I tried this in my code but I had two issues with it: (a) it resulted in a very complicated function to detect the start of a comment (and the kind of the comment) – 27 lines – the reason for this is that by working inside skipWhite, I had to go back to string comparisons and could not use the power of scan that translates the character sequences to tokens, and (b) the comments that are transferred to the output code are printed in the wrong place (I will explain this in more detail shortly). So based on this, I abandoned the idea. If you want to see how I implemented it, I have saved it in the file InputProgramScanner_1.kt in the GitHub repository.

Another idea would be to process comments as high as possible, in the parser. That would mean a parseComment function that would have to be triggered when a start-of-comment token was returned by scan (or actually match). However, that parse function would have to be called in all the other parser functions where a comment could be seen and this is too many places. Also, we would have another issue in case of an in-line comment that would have to be transferred to the output – our parseComment would not know where to stop in that case, as the newline is not visible at the parser level (it is skipped in skipWhite). To resolve this we would have to treat newline as another token and have special logic in scan to decide when to skip it and when to return it as the next token. For these reasons I abandoned this idea too.

So, I have decided to go for a solution in the middle – process comments in match. This is the place where we match each token and then we advance to the next one (including skipping white spaces in scan), so naturally this would be the right place to process any comments if the next token is the start of one. Normally we would want to process the comments in the beginning of match so that any comments would be processed (and printed in the right place) and then nextToken would be advanced to the next actionable token to be matched. And this would be a good approach apart from one issue: match is not the only place where we check the next token. We also check it via lookahead, which returns the next token without advancing the cursor. So we have a discrepancy here where match would point to the next actionable token skipping any comments, while lookahead would point to the next token even if that was a comment. To fix this we would have to process the comments in lookahead as well, but that would mean that lookahead would also advance the cursor and this would cause another discrepancy, as lookahead is supposed to not move the cursor.

It looks like the processing of the comments has to be done at the end of match. And this makes sense based on the logic of my code, given that this is where the previous token has been matched, we have called scan, skipped white spaces and advanced to the next token. In my mind, this is the right place to check for any comments and process them, so that nextToken will point to the next actionable token. We still have one issue though. Imagine this scenario:

program prog1
var a
//-- function to calculate square
fun square() {
...

Here we are processing the var declaration, we have matched “var” and are calling match to get the variable called “a”. This call, after matching “a” would process (and print) the comment that follows and then advance to the fun token, ready for the next match. Given that the assembly code is transferred to the output after match returns, while the comment would be printed from inside match, the assembly code would look like this:

# program prog1
.data
.align 8
#
# function to calculate square 
	a:    .quad 0
.text
.align 8
.global _start
square: 
...

As you can see the comment is transferred to the output code in the wrong place – it is printed just before the declaration of var a. There is an easy way to fix it. If we have a printable comment, we don’t actually print it, but keep it in a variable instead. We then check the contents of that variable in the beginning of match and print its contents if any. This way each comment that has been processed is printed in the output code in the beginning of the next match cycle, which is where it should be.

So, here’s the new function getComment, which branches into two functions that process block and inline comments respectively.

/** get a comment */
private fun getComment() {
    while (nextToken.type == TokType.commentStart)
        when (nextToken.encToken) {
            Kwd.blockComment -> getCommentBlock()
            Kwd.blockCommentOut -> getCommentBlock(true)
            Kwd.inlineComment -> getCommentInline()
            Kwd.inlineCommentOut -> getCommentInline(true)
            else -> expected("start of comment")
        }
}

/** get a block comment */
private fun getCommentBlock(printToOut: Boolean = false) {
    var localCommentString = code.COMMENT
    val endComment: String = decodeToken(Kwd.commentEnd)
    while (!inputProgram.substring(cursor).startsWith(endComment) && nextChar != endOfInput) {
        localCommentString += nextChar
        if (nextChar == '\n')
            localCommentString += code.COMMENT
        getNextChar()
    }
    localCommentString += '\n'
    nextToken = scan()      // nextToken now points to endComment or endOfInput
    if (nextToken.encToken == Kwd.endOfInput)
        expected(endComment)
    nextToken = scan()      // nextToken now points to the next token after the comment
    if (printToOut)
        commentString += localCommentString
}

/** get an in-line comment */
private fun getCommentInline(printToOut: Boolean = false) {
    var localCommentString = code.COMMENT
    while (nextChar != '\n' && nextChar != endOfInput) {
        localCommentString += nextChar
        getNextChar()
    }
    localCommentString += '\n'
    nextToken = scan()
    if (printToOut)
        commentString += localCommentString
}

getCommentBlock reads the input until it finds the end-of-commend token, while getCommentInline reads the input until it gets a newline. Both transfer the comment to the commentString variable which is printed in the beginning of match in the next call. This is the modified match function:

fun match(keyWord: Kwd = Kwd.any): Token {
    printComment()  // any comments found in the previous call must be printed in the output code now
    if (keyWord != Kwd.any && nextToken.encToken != keyWord)    // check keyword to match
        expected(decodeToken(keyWord))
    val thisToken = nextToken
    nextToken = scan()  // advance to next token
    getComment()    // process any comments
    return thisToken
}

private fun printComment() {
    if (commentString != "") {
        code.outputCode(commentString)
        commentString = ""
    }
}

The only thing to add is that any comments before our first token (which is program if you remember) will not be processed here. To fix this we just need to add one more call to getComment at the end of the init section of our scanner. This will take care of any comments before the program... Also, please note, this version does not support nested comments.

I did not expect the processing of comments to be so complex, but it is there now and it’s working. As I spent quite some time on the scanner in this chapter, I have taken the opportunity and improved the scan function, which is the brains and engine of our scanner. I have broken it down to smaller functions and made it easier to read (full details in the GitHub repository):

private fun scan(): Token {
    skipWhite()
    if (checkEndofInput())
        return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none)
    if (checkNumeric())
        return Token(getNumber(), Kwd.number, TokType.none)
    if (checkAlpha())
        return keywordOrIdentifier(getName())
    if (checkSpecialToken())
        return getSpecialToken()
    return getInvalidToken()
}

And in the spirit of tidying up, I have also removed the field info from the Token class. It has not been used and I don’t think it will be needed at all. With these enhancements in place, you can rewrite the TINSEL program from chapter X like this:

/*--
 My First TINSEL program
 demonstrates the basic capabilities of the language
 Accepts an integer n as input
 and prints the squares of all the integers from 1 to n
 */
program tinselExample

//-- variables declarations
var i = 0, n

/*--
 calculate the square of i
 only for positive i
 */
fun i_square() {
    if (i <= 0) { return -1 }
    return i * i;
}

// main entry point
main {
    read n;
    //-- main loop
    repeat {
        i = i + 1
        print i_square();
    } until (i >= n)
}

endprogram

As I said earlier in this chapter, it’s up to you how much you want to use comments in your code and whether you want to use semicolons and when.

And as always, you can find the code in my GitHub repository.

Coming up next: Improving our Error Checking

Chapter X: TINSEL for x86-64

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the first 6 Chapters we have closely followed the method and the steps that Jack taught us in his original tutorial, which made it so easy to deliver a working compiler in very simple and easy to understand steps.

In Chapter VII we removed the last remaining limitations (single character tokens) and set up the necessary structure to be able to recognise any token and pass it to our parser. We also set up the tools to make our language fully configurable and to be able to define and recognise any token we want.

In Chapters VIII and IX we defined our own programming language: TINSEL, which looks like a mini C. We demonstrated that our compiler can process TINSEL programs with decent error checking.

We still have limitations on the output side, as the code we are producing is still a mixture of M68000 instructions and dummy commands. Obviously this output is not much use for any practical application.

And this is what we will fix in this Chapter: we will build a new code module, for x86-64, and will make any other necessary changes in our compiler to produce assembly code for x86-64, for Linux. That code anyone will be able pass through an assembler and linker and then run it and test it and see the result of.

Some Background

First, I’d like to set the plan and explain what exactly we will do in this Chapter. This will not be a lesson on Assembly language, however, there are certain things which have to be explained in detail so that you understand why our output code is generated in a specific way and how it works.

Please note these important points to ensure that you can actually run the code that our compiler produces:

  1. The output code that we will produce here is compatible with the GNU Assembler. As you may know, there are two standards in the x86 world, the Intel standard and the AT&T standard. The main difference between the two is the order in which Source – Destination appear in the instructions (plus a few more minor differences). Our compiler will produce AT&T-standard assembly code which is recognisable by the GNU assembler. If anyone wants to use Intel-standard and the NASM assembler, they will have to change the code module accordingly.
  2. The resulting code can be run only on Linux (has been tested on Ubuntu 20.04 and Fedora 35, 64-bit). The reason for this is that certain system calls have been used for reading, writing and exit program. These calls are specific to Linux and the 64-bit architecture and would not work on Windows. In order for the output code to work on Windows, these system calls would have to be changed to the equivalent Windows system calls.
  3. The output code is compatible with the x86-64-bit architecture only – it will not run on 32-bit architecture. The 64-bit registers are being used throughout the code but also the calling convention is different in x86-64. In x86-64 the first 6 parameters to a function are passed via CPU registers (as long as they are integers or pointers) while previously they were passed via the Stack. So, to make our output code work in 32-bit architecture, not only the registers have to be changed (rax -> eax, rbx -> ebx, etc), but also the calling sequence when a function is called must be changed. Unless you are very comfortable with x86 assembly, this can be a very delicate operation.
  4. Given that we want to be doing input and output in our TINSEL programs, we need a run-time library that will support these functions (we don’t want to keep rewriting these in every program). I have implemented these functions in x86-64 assembly and the source code is in my Github repository. These will have to be linked with your compiled TINSEL program(s) to produce the final executable.
  5. Any programs you write on TINSEL can either be run autonomously or can be linked with a C program and can be invoked from there. You will have to make certain changes for this and take certain considerations into account – more details later in this chapter.

Please don’t be confused by the above. Everything will be clarified by the end of this chapter. In the end, if you are not interested at all in Assembly, you can just skip to the end of this chapter, take the sources from the Github repository and run them on your Linux box. On the other hand, I’ve always wanted to know “what’s inside” and understand “how it works”, so any opportunity to write Assembly code is a temptation I can’t resist.

For those of you who are interested, a good quick reference is here, while you can find a detailed x86-64 (AT&T flavour) reference manual in the Oracle documentation pages.

The x86-64 Calling Convention

As mentioned above, in x86-64, the function parameters are passed via CPU registers. There are sixteen 64-bit registers: %rax, %rbx, %rcx, %rdx, %rdi, %rsi, %rbp, %rsp, and %r8-r15. The following six are used to pass the first six integer or pointer parameters (in the below order) to a function that is called:

first six int or ptr parameters: %rdi, %rsi, %rdx, %rcx, %r8, and %r9

Any additional parameters or any large structures are passed through the stack (first parameter at the top).

The function will return its value (if there is one) in the following CPU register

return value: %rax.

I have applied this calling convention (a) in the code module of our compiler when a function is called (at the moment this is used only when a library function is called – our Tinsel functions don’t support parameters yet) and (b) in the runtime libraries, which consist of functions that our code will be calling (see below). For this reason the programs that are produced by our TINSEL compiler are not compatible with the 32-bit architecture.

Runtime Libraries

Every programming language has a runtime system that includes libraries, which are linked with the program (statically or dynamically). These libraries provide basic functions that the program needs in order to run.

The first version of our runtime library has only two functions that are available to our TINSEL programs, read_i_ and write_i_, which are used on the back of the TINSEL commands read and print respectively. There is also a number of other functions but those are used internally within the library.

The read_i_ function takes no parameters and returns the integer value read from the keyboard in %rax. Internally it calls the read_s_ function, part of the same library, which reads a string from the standard input using the Linux system call #0 (= read). To read the string, it uses an internal buffer (defined in the same library) to store it temporarily, and then calls another library function, atoi_, which converts the string to integer.

The write_i_ function takes one parameter (in %rdi), which is the integer to be printed. It first converts it to string using itoa_ (another library function) and then calls write_s_ which prints the string to standard output using the Linux system call #1 (= write).

Even though at the moment our programs will be using only read_i_ and write_i, the other library functions will come useful in later chapters when we introduce strings.

This version of our runtime library consists of two assembly files (readwrite.s and string_utils.s) that you can find under the libtinsel/src directory in the GitHub repository. I will explain at the end of this chapter how to pass them through the Assembler and link them into your TINSEL programs.

Please note, you can also try these library functions from a C program on Linux (Ubuntu 20.04 or Fedora 35, 64-bit) as follows:

/* test1.c */
extern long long int read_i_();
extern void write_i_(long long int);
long long int i;
void main() {
    do
        write_i_(i = read_i_());
    while(i);
}

Remember the library routines are written for 64-bit architecture, so all integers must be declared as long long int, otherwise negative integers will not work. You can compile this small test program and link it with our library using this command:

gcc -o test1 test1.c readwrite.s string_utils.s

and you can just run it as ./test1. It will read an integer from the keyboard and print it on the screen in a loop until you give it a 0 (or Enter). You may notice that each integer you enter is printed without a new line, and this is exactly what write_i is supposed to do: write an integer on the screen. However, in order to make the output of our programs easier to read, we will address this gap with a temporary workaround later in this chapter, when we implement the code for print, and with a more permanent solution when we introduce strings in one of the following chapters.

The x86-64 Code Module for our Compiler

This is where we will create a new code module that will produce x86-64 code. There are almost no changes required in our Lexical Scanner or the Parsers other than a couple of places where we need to call some functions to generate extra code.

Please copy the code file M68000Instructions.kt to x86-64Instructions.kt. Remember to also change the class name to class x86_64Instructions. Also change the code variable in the CompilerMain.kt:

// the Intel x86-64 instruction set
val code = x86_64Instructions()

Now, let’s first tidy up the routines that send code to the output. We need to write code (a) with or without a tab in the beginning, (b) with or without a newline at the end, (c) output a label and (d) output a comment. The routines to do this will be:

    private val COMMENT = "#"

    /** output code */
    private fun outputCode(s: String) = print(s)

    /** output code with newline */
    fun outputCodeNl(s: String = "") = outputCode("$s\n")

    /** output code with tab */
    fun outputCodeTab(s: String) = outputCode("\t$s")

    /** output code with tab and newline */
    fun outputCodeTabNl(s: String) = outputCodeTab("$s\n")

    /** output comment */
    fun outputComment(s: String) = outputCode("$COMMENT $s\n")

    /** output a label */
    fun outputLabel(s: String) = outputCodeNl("$s:")

As you can see, only one of these functions uses print (and the rest use the first one), so when we decide to send the compiler output to a file instead, it will be a very easy change.

Next step, we need to include some new functions that will create certain sections, symbols and other declarations that the GNU assembler needs in order to produce a complete, runnable object module.

I have decided as first line to have the program name in a comment. We then need to define the .data section, align it to 8-bit boundary and declare a copyright string. The function that does all that is:

    /** initialisation code for assembler */
    fun progInit(progName: String) {
        outputComment("program $progName")
        outputCodeNl(".data")
        outputCodeNl(".align 8")
        // copyright message
        outputCodeTabNl("tinsel_msg_: .string \"TINSEL version 1.0 for x86-84 (Linux) Dec 2021 (c) M.Pappas\\n\"")
        // newline
        outputCodeTabNl("newline_: .string \"\\n\"")
    }

It has to be called at the very beginning, so I have added it to the parseProgramHeader code:

fun parseProgHeader() {
    inp.match(Kwd.startOfProgram)
    code.progInit(inp.match(Kwd.identifier).value)
} 

Next, we need the necessary code to declare variables. One thing to remember is that in Assembler there are no real variables, there are just labels. So effectively, our variable declaration results in defining a label. In order to treat this as a variable we need to reserve memory where this label is pointing – this is where the value of this variable will go. As our variables are only integer for now we need to allocate 8 bytes (or a quad word) for each variable. In order to reserve these 8 bytes, we have to give them a value. This means that in case of uninitialised variables (in the TINSEL program, that is) we have to choose a value and I have chosen 0. So, any global variable that is not initialised in our TINSEL program will default to 0.

    /** declare int variable (64bit) */
    fun declareInt (varName: String, initValue: String) {
        if (initValue == "")
            outputCodeTabNl("$varName:\t.quad 0")
        else
            outputCodeTabNl("$varName:\t.quad $initValue")
    }

We call this code generating function as the last line in declareVar (that declares each of our global variables):

    code.declareInt(name, initValue)

With the above, the .data section is completed and it’s time to move to the .text section. We first declare our program’s entry point (_start) as global so that the linker can find it.

    private val MAIN_ENTRYPOINT = "_start"

    /** initial code for functions */
    fun funInit() {
        outputCodeNl("\n.text")
        outputCodeNl(".global $MAIN_ENTRYPOINT")
    }

This is called in parseProgram after we are done with variables and just before we start processing the function declarations, which is where the .text section starts.

fun parseProgram() {
    parseProgHeader()
    if (inp.lookahead().encToken == Kwd.varDecl)
        parseVarDecl()
    code.funInit()
    if (inp.lookahead().encToken == Kwd.funDecl)
        parseFunDecl()
    parseMainBlock()
    parseProgEnd()
}

We also need to implement the declareAsmFun routine, which is nothing else but a comment with the function name and a label to declare the start of that function:

    fun declareAsmFun (name: String) {
        outputCodeNl()
        outputComment("function $name")
        outputLabel(name)
    }

This is called as the last line in declareFun where each function is declared:

    code.declareAsmFun(name)

Then we come to the point where our main program starts. We mark this with a comment to make it visible in the assembly code, and we declare the _start label. We also add a couple of lines of executable code which prints the TINSEL hello message on the screen:

    fun mainInit() {
        outputCodeNl()
        outputComment("main program")
        outputLabel(MAIN_ENTRYPOINT)
        outputComment("print hello message")
        outputCodeTabNl("pushq\t%rbx")
        outputCodeTabNl("lea\ttinsel_msg_(%rip), %rdi")
        outputCodeTabNl("call\twrite_s_\n")
    }

At the end of main, apart from a comment to mark it, we also need the necessary code to stop our program and return to the operating system. We achieve this by using the Linux system call #60 (exit).

    fun mainEnd() {
        outputCodeNl()
        outputComment("end of main")
        outputCodeTabNl("popq\t%rbx")
        outputCodeTabNl("movq\t$60, %rax\t\t$COMMENT exit system call")
        outputCodeTabNl("xorq\t%rdi, %rdi\t\t$COMMENT exit code 0")
        outputCodeTabNl("syscall")
    }

These two functions are called in the beginning and at the end of parseMainBlock respectively:

fun parseMainBlock() {
    labelPrefix = "main"        // set label prefix and label index
    labelIndx = 0
    inp.match(Kwd.mainToken)
    code.mainInit()
    parseBlock()
    code.mainEnd()
} 

You may have noticed the addition of labelPrefix which is set to “main” above. It’s also set to the function name in parseFunDecl. This is to help have more meaningful names for the labels in our assembly program (as opposed to them running from L0: to Lxxx:)

/** create a unique label*/
fun newLabel(): String = "${labelPrefix}_L${labelIndx++}_"

And with that we come to the end of the program where I have added a comment line to mark the end:

fun parseProgEnd() {
    inp.match(Kwd.endOfProgram)
    code.outputCodeNl()
    code.outputComment("end program")
} 

Having done the above, it is then straight forward to translate the M68000 instructions to x86-64 instructions. A couple of points to note: I have used %rax (obviously) as the accumulator, and when an operation between the top of stack and %rax is to be done, I pop the value from the stack into %rbx:

    /** add top of stack to accumulator */
    fun addToAccumulator() {
        outputCodeTabNl("popq\t%rbx")
        outputCodeTabNl("addq\t%rbx, %rax")
    }

Also, all variable references are relative to the program counter, same as in Jack’s M68000 code:

    /** set variable to accumulator */
    fun assignment(identifier: String) = outputCodeTabNl("movq\t%rax, ${identifier}(%rip)") 

Finally, below follows a special mention of how the cmp instruction and the flags are used. Jack spent quite a few paragraphs explaining this and it’s time I do the same.

A Note on CMP

If you are like me and want to know “what’s inside”, this is the topic to pay attention to. Here’s how I have implemented the is equal (e.g. x == 1):

    fun compareEquals() {
        outputCodeTabNl("popq\t%rbx")
        outputCodeTabNl("cmp\t%rax, %rbx")
        outputCodeTabNl("sete\t%al")        // set AL to 1 if comparison is ==
        outputCodeTabNl("andq\t$1, %rax")   // zero the rest of rax and set flags - Z flag set = FALSE
    } 

The first two lines are obvious – we pop the top of stack into %rbx and compare with %rax. But what do the two lines that follow do?

To understand this, we need first to go back to chapter 6, where we decided that any integer can be treated as boolean and vice-versa. And we also decided that the value of 1 means true and the value of 0 means false. We also need to remember that the conditional commands in assembly (e.g. je = jump if equal) work based on the values of the flags (or flags register).

Let’s see now how cmp works:

    cmp    %rax, %rbx 

It subtracts the value of %rax from the value of %rbx and, depending on the result, it sets the flags respectively. The most important flags when we are dealing with signed integers are ZF and SF (the Zero and the Signed Flag respectively). The result of this subtraction is then thrown away and %rbx retains its original value.

So, if we compare x == 1 (and let’s assume that x is actually 1), then the result of the above subtraction will be 0 and the Zero Flag will be set. So if after this we wanted to do a jump if true command, we would do a jz (jump if Zero Flag is set). So in this case jump if true = jz.

Now let’s assume that we fetch a boolean variable in %rax, i.e. an interger that would have a value of either 0 or 1. And assume we set the flags according to the contents of %rax. Same as above, let’s assume that the value is true, i.e. the value of %rax is 1. And if we wanted to do a jump if true command we would have to do jnz (jump if Zero Flag not set). So in this case jump if true = jnz.

You can see the discrepancy above. In addition, the above cmp instruction just sets the flags and we have no boolean result (0 or 1) to store as a boolean variable.

So, in order to solve this problem I have followed exactly the method Jack used. After the comparison and the setting of the flags, set the value of %rax to 0 or 1 based on the convention 1=true and 0=false. And as a following step, set the flags again based on the new value of %rax. This is exactly what the extra two lines of code in compareEquals do:

sete %al sets %al (the lowest 8 bits of %rax) to 1 if the comparison result is true (if the two numbers are equal, that is) and to 0 if false.

andq $1, %rax does two jobs (a) performs a bitwise and of the contents of %rax and the constant $1, i.e. sets all bits of %rax to 0 apart from the first bit so that %rax can take the values of 1 or 0 only and (b) sets the flags (ZF will be set if %rax is 0).

With this we have satisfied both scenarios: (a) %rax takes the value of 1 if the condition is true and 0 if false, so it can be used a boolean variable and (b) the Zero Flag is set when %rax is 0 (=false) which means that jz always means jump if false.

One more detail to complete this topic. In the family of compare* functions we are getting the final setting of the Zero Flag for free as it is done by the and instruction. We need to remember though that we must ensure the Flags register is set every time we set %rax to a value. I have done this by using the test instruction in setAccumulator and setAccumulatorToVar (in all other cases when %rax is the result of a numeric or boolean operation the Flag register is also set automatically):

    fun setAccumulator(value: String) {
        outputCodeTabNl("movq\t$${value}, %rax")
        outputCodeTabNl("testq\t%rax, %rax")    // also set flags - Z flag set = FALSE
    }

    fun setAccumulatorToVar(identifier: String) {
        outputCodeTabNl("movq\t${identifier}(%rip), %rax")
        outputCodeTabNl("testq\t%rax, %rax")    // also set flags - Z flag set = FALSE
    }

This way we can write TINSEL code like

    while (true) {
        ... 

or

    while (nextIteration) {
        ...

or

    print x==1 

Before we leave this topic let’s add two more parse functions, parseLessEqual and parseGreaterEqual and their corresponding assembly code functions compareLessEqual and compareGreaterEqual. These are based on the their sister functions parseLess, parseGreater, compareLess and compareGreater.

parseLessEqual and parseGreaterEqual must be triggered in parseRelation when the appropriate relational operators are found.

Read and Print

The implementation of parseRead and parsePrint that support the read and print commands is much more straightforward than that of cmp in the previous section. We only support integers for now, so parseRead calls readInt in our code module, which in turn calls read_i_ from our runtime library, and then assigns the value returned to the variable we are reading.

Similarly, parsePrint calculates the given expression and then calls printInt, which in turn calls write_i_ from our library to print the result on the screen. In addition parsePrint calls println that prints a newline, otherwise our output would be unreadable. This is temporary workaround until we implement strings, at which point we will implement proper print and println in TINSEL.

/**
* parse read statement
* <read> :: = read <identifier> [ , <identifier> ] *
*/
fun parseRead() {
var varName: String
do {
inp.match()
varName = inp.match(Kwd.identifier).value
code.readInt(varName)
code.assignment(varName)
} while (inp.lookahead().encToken == Kwd.commaToken)
}

/**
* parse print statement
* <print> ::= print <b-expression> [ , <b-expression> ] *
*/
fun parsePrint() {
do {
inp.match()
parseBooleanExpression()
code.printInt()
code.println()
} while (inp.lookahead().encToken == Kwd.commaToken)
}

And with this we are done, we are now ready to write a fully functional TINSEL program and run it.

Running our Code

And that’s it! We can now write and compile a complete TINSEL program and can run it on our Linux workstation. This is how you do it:

  • First prepare the runtime library (you do this only once and keep the .o files that you will need in every TINSEL program:
    • as -o readwrite.o readwrite.s
    • as -o string_utils.o string_utils.s
  • Write a TINSEL program, e.g. prog1.tnsl
  • Compile prog1.tnsl and copy the Compiler output (the assembly code) to a file: prog1.s
  • Pass the Compiler output through the assembler:
    • as -o prog1.o prog1.s
  • Link the code to produce the final executable:
    • ld -o prog1 prog1.o readwrite.o string_utils.o
  • Run your program:
    • ./prog1

If you don’t fancy this multiple step process, you can always use gcc and do it all in one step:

  • gcc -o prog1 prog1.s readwrite.s string_utils.s

This will link your TINSEL program with the C runtime libraries, so for this to work you must change the entry point of the program to: val MAIN_ENTRYPOINT = "main".

And to enjoy the result of our work try this TINSEL program:

program myprog1

var i = 0, n

fun i_square() {
if (i <= 0) { return -1 }
return i * i
}

main {
read n
repeat {
i = i + 1
print i_square()
} until (i >= n)
}

endprogram

You give it an integer (n) as input and it will print the squares of all the integers from 1 to n. Try it on your Linux box. (please ignore the if(i<=0)... line in function i_square – it’s there just to test the correct generation of the labels)

I will stop here. This has been a long chapter, but I did not want to interrupt until we got to this point. We have reached a major milestone, where we can actually run our TINSEL programs on a real computer and see the real result on our screen.

Enjoy!

And as always, you can find the code in my GitHub repository.

Coming up next: Semicolons and Comments

Chapter IX: More TINSEL

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we Introduced the basic structure of TINSEL, our mini programming language. Here, we will continue defining these structures to make it even friendlier. The aim is to write a complete working program for x86 in the next chapter.

Commas

Looking at the 6 lines that were needed to define 6 variables in the TINSEL example at the end of the previous chapter, it is obvious we would benefit a lot by processing a comma separated list of variables:

    <variable declaration> ::= var <identifier> [ = <value> ] [ , <identifier> [ = <value> ] ] *

It should be very simple to extend our current parseVarDecl to support this. Given that this function must process at least one variable declaration, it makes sense to use a do - while loop:

/**
 * parse variables declarations
 * <variable declarations> ::= var <identifier> [ = <value> ] [ , <identifier> [ = <value> ] ] *
 */
fun parseVarDecl() {
    while (inp.lookahead().encToken == Kwd.varDecl) {
        do {
            inp.match()
            parseOneVarDecl()
        } while (inp.lookahead().encToken == Kwd.commaToken)
    }
}

/** parse one variable declaration */
fun parseOneVarDecl() {
    val varName = inp.match(Kwd.identifier).value
    var initValue = ""
    if (inp.lookahead().encToken == Kwd.equalsOp) {
        inp.match()
        initValue = inp.match(Kwd.number).value
    }
    declareVar(varName, initValue)
}

Don’t forget that in order to activate a new token, its encoded value must be added to the Kwd enumeration and the actual symbol/keyword (the ',' in this case) must be added to the languageTokens list.

enum class Kwd {
    ...
    commaToken,
    ...
}

languageTokens.add(Token(",", Kwd.commaToken, TokInfo.keyword, TokType.none))

That was a quick one! Change the previous TINSEL example so that all the vars or combinations of some vars are in a comma separated list and make sure it works.

Line Numbers

I believe you will agree with me that our compiler has decent error checking. But as our programs can now grow in size, it might be difficult to spot in our TINSEL input program where exactly the error was. I think a line number in the error message might be really helpful.

First we will need a global variable in the scanner, which will be increased by 1 every time we see a newline. I believe the best place to do this is in skipNextChar as this is the place where our code goes through the whole input program one character at a time (including white spaces and newlines):

// input program line number
var lineNumber = 1
/** set the lookahead character to the next char from input */
private fun skipNextChar() {
    if (nextChar == '\n')
        ++lineNumber
    nextChar = if (cursor < inputProgram.length-1)
        inputProgram[++cursor]
    else
        endOfInput
}

With this done, we just need to print this line number when we print an error message:

/** report what was expected and abort */
fun expected(expMsg: String) {
    abort("line $lineNumber: expected [$expMsg] found [${nextToken.value}]")
}

That was easy! But is it working? Let’s feed our compiler deliberately wrong input e.g. by removing an opening { or adding an unrecognised symbol at the end of a line. Now observe the error message and the line number. Is it correct? Well, it’s not far off, but it’s actually pointing to the line where the next token is instead of the line where the error occurred.

And there is a reason for this: every time we extract a token from the input (notably in getName, getNumber and getSpecSeq) we skip all the following whitespaces and newlines until we reach the next token, ready for the next match. And this is why our line number is a bit off.

There is an easy solution to this. Instead of advancing the cursor after we read each token and have it pointing to the next token, we will skip the white spaces and newlines just before we read the next token. This way the line number will always be pointing to the line of the last token we’ve just read.

To achieve this, remove the call to skipWhite from getName, getNumber, getNextChar, getSpecSeq and init, and add it just once at the top of scan as this is the function we call every time we want to return a token:

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    // first skip white space
    skipWhite()

    if (nextChar == endOfInput)  // check for end of input
    ....

Try it again. You will see that the line number is now pointing to the line where the error is. There might be a few outlier cases where the line number might be off, but I believe in most cases it will be accurate.

After this change, if you look carefully, you will see that getNextChar has now been reduced to just one line that calls skipNextChar. Based on this, we can remove this version of getNextChar, and rename skipNextChar to getNextChar (which is exactly how Jack had it originally designed!).

In addition, if we look now at getName and getNumber, they can be further simplified. Both of them are called from only one place, from scan, and only when scan knows that there is an alpha or a numeric character ahead:

if (isNumeric(nextChar))            // check for number
    return Token(getNumber(), ...
...
if (isAlpha(nextChar)) {            // check for name
    val name = getName()
    ...

With this in mind, the check in the beginning of getName and getNumber, to abort if they don’t see an alpha or a numeric character respectively, is now redundant and can be removed.

With this change, all of our error checking at the scanner level is now done in match (with the help of scan of course). The syntax of our whole language is checked in just one line and any error is caught there. This is really neat and is based on the design from Jack.

I’ll let you manage these minor code changes, but if you need to verify, please check the source in the GitHub repository.

Input / Output

No program is any use unless it can communicate with the user. I will implement this through read and print.

read will be reading integer values from the standard input into a comma separated list of variables and print will be printing onto the standard output the integer result(s) of a comma separated list of expressions:

/**
* parse read statement
* <read> :: = read var [ , var ] *
*/
fun parseRead() {
var varName: String
do {
inp.match()
varName = inp.match(Kwd.identifier).value
code.dummyInstr("Read value into $varName")
code.assignment(varName)
} while (inp.lookahead().encToken == Kwd.commaToken)
}

/**
* parse print statement
* <print> ::= print expression [ , expression ] *
*/
fun parsePrint() {
do {
inp.match()
parseBooleanExpression()
code.dummyInstr("Print value")
} while (inp.lookahead().encToken == Kwd.commaToken)
}

Here you see some more dummy instructions but these will go away in the next chapter.

Remember, we also have to define the two new keywords (read and print) in the LanguageTokens file and also to recognise them and point to the two new parsers in parseStatement.

/** parse a statement */
fun parseStatement(breakLabel: String) {
when (inp.lookahead().encToken) {
Kwd.ifToken -> parseIf(breakLabel)
Kwd.whileToken -> parseWhile()
Kwd.repeatToken -> parseRepeat()
Kwd.forToken -> parseFor()
Kwd.breakToken -> parseBreak(breakLabel)
Kwd.retTok -> parseReturn()
Kwd.readToken -> parseRead()
Kwd.printToken -> parsePrint()
else -> parseAssignment()
}
}

It’s amazing how with an extra 7-8 lines of code we can add functionality like reading and printing a series of variables and expressions respectively. This is the beauty in Jack’s method.

Case Sensitive Names

To close this chapter, I’d like to ensure all our names and keywords are case-sensitive, same as with all modern languages. This can be achieved with some minor changes:

Remove the uppercase() conversion in getName:

token += nextChar.uppercase()

Remove the ignoreCase parameter (currently set to true) when we check for a keyword in isKeyword. By default it’s false, so the comparison is case-sensitive:

if (languageTokens[i].value.equals(name, true))  // check for keyword match

It can also be written like this (in Kotlin):

if (languageTokens[i].value == name)  // check for keyword match

With this, program is a valid keyword, while Program isn’t and the variable Alpha is different to alpha.

In the last three chapters I put emphasis on reshaping the lexical scanner (to make it recognise all possible tokens and also make it configurable), and also on defining the structures of our “toy” language to ensure we can write programs one can read and understand and which will be able to do something useful and produce some output. As a result I have neglected the part where the output code is generated, hence you see a number of dummy instructions. This will go away in the next chapter where I will start adapting the code module for x86-64.

And as always, you can find the code in my GitHub repository.

Coming up next: TINSEL for x86-64

Chapter VIII: Introducing TINSEL

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we changed our lexical scanner so that it can now recognise any token and any operator that we have defined in our programming language definition. And we verified this, by seeing the input stream broken down into distinct tokens that were encoded and categorised and ready to be passed to the parser.

In this chapter we will use this capability and define the full specification of our own programming language that we will call TINSEL (Technology Integration and Next Step in Evolutionary Learning). I believe this is an appropriate name for our mini-language, given that this chapter is published in the beginning of December.

At the end of this chapter we will be able to write a complete TINSEL program that will be as easy to read as C or Java or Kotlin, and which we will be able to compile with decent error checking included.

But first, let’s complete the change we started in the previous chapter, by making the necessary changes in the 3 parser modules to make them compatible with the new scanner.

Connecting Parser and Scanner

Now that our tokens consist of multiple characters and are being returned as an enumerated value, our parser routines have to be changed accordingly. Please take the three files that were part of Chapter 7 and create a new package under your project together with the three ProgParser* and the M68000Instructions files from Chapter 6.

Please make the following changes:

  • Where we check against the value of a token (e.g. 'addOp' or 'ifToken') the corresponding enumerated value must be used (e.g. Kwd.addOp or Kwd.ifToken)
  • Where we check the value of inp.nextChar, we must now use the value of inp.lookahead().encToken and must check against the corresponding enumerated value (see point above)
  • Where match is called with a char parameter, it must now be called with the corresponding enumerated value (inp.match(endIfToken) becomes inp.match(Kwd.endIfToken))
  • Where the type of token is checked by using e.g. isEndOfBlock or isAddop or isOrOp the type attribute of the token must be checked instead: e.g. inp.lookahead().type == TokType.endOfBlock or inp.lookahead().type == TokType.addOps
  • Where we check for an Alpha or a Numeric token with inp.isAlpha(inp.nextChar) or inp.isNumber(inp.nextChar), we must now use inp.lookahead().encToken == Kwd.identifier or inp.lookahead().encToken == Kwd.number
  • The same goes for inp.isLeftParen and all the other similar checks
  • Replace any call to inp.getName() with inp.match(Kwd.identifier).value and any call to inp.getNumber() with inp.match(Kwd.number).value; these will fetch an identifier or a number from the input with the appropriate error checking
  • Replace the line with the call to inp.getBoolean() in parseBooleanFactor() with if (inp.match(Kwd.booleanLit).value == BOOLEAN_TRUE)
  • Reinstate the line val code = M68000Instructions() in CompilerMain.kt

For example, here’s how parseBlock and parseFactor will look after these changes:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock(breakLabel: String = "") {
    while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
        when (inp.lookahead().encToken) {
            Kwd.ifToken -> parseIf(breakLabel)
            Kwd.whileToken -> parseWhile()
            Kwd.repeatToken -> parseRepeat()
            Kwd.forToken -> parseFor()
            Kwd.breakToken -> parseBreak(breakLabel)
            else -> parseAssignment()
        }
    }
}
/**
* parse a factor
* <factor> ::= ( <expression> ) | <number> | <identifier>
*/
fun parseFactor() {
if (inp.lookahead().encToken == Kwd.leftParen) {
// ( Expression )
inp.match()
parseExpression()
inp.match(Kwd.rightParen)
}
else
if (inp.lookahead().encToken == Kwd.identifier)
// Identifier
parseIdentifier()
else
// Number
code.setAccumulator(inp.match(Kwd.number).value)
}

These changes would normally come to mind naturally, based on the new definition of Token and match(). However, this may turn into a delicate operation, so please compare the outcome of your changes with the files from GitHub (url at the bottom as usual) just in case you have missed something.

Let’s Define our Language

We will now start defining our language, TINSEL. We will take a top-down approach and will start from the top level, which is the Program itself. We will then continue down the hierarchy and define the building blocks one by one, until we reach the Statement level, which is what we can support already.

As Jack has shown us, the best tool for this is the BNF notation, which, as we’ve seen in previous chapters, we can translate directly to code in our parser.

The Program

The first programming language I learnt was FORTRAN, so I’d like to have a program header in the beginning, something like program calculate_differences, and an end of program mark at the end, something like endprogram.

Having clarified this, let’s think what other structural blocks our program will consist of.

Immediately after program... I’d like to have an optional section where the global variables are declared. These will all be integer for now.

After this, I would like to see a section (also optional) where functions can be declared. They will also return an integer for now.

Finally, the only mandatory section of our program will be the main section, which will be clearly marked (e.g. with the token main)

Based on this our program can be defined as:

    <program> ::= <prog header> [ <var declarations> ] [ <fun declarations> ] <main block> <prog end>

And the function to process this will naturally be:

fun parseProgram() {
parseProgHeader()
if (inp.lookahead().encToken == Kwd.varDecl)
parseVarDecl()
if (inp.lookahead().encToken == Kwd.funDecl)
parseFunDecl()
parseMainBlock()
parseProgEnd()
}

The functions to process the program header and program end will be:

fun parseProgHeader() {
    inp.match(Kwd.startOfProgram)
    println("program ${inp.match(Kwd.identifier).value}\n")
}

and

fun parseProgEnd() {
    inp.match(Kwd.endOfProgram)
    println("\nend program")
}

You will notice that the Kotlin compiler is now giving us a warning on all those when blocks. The reason is that we now use enumerations in the when block and it is advisable that we include an else case, even though it’s not 100% needed as long as the definitions of our tokens are correct. If however, there is an error in our tokens definitions, this “catch-all” else case will help us identify it. E.g. in parseExpression():

else -> inp.expected("add or subtract operator")

As a result, the function expected must now become public.

Please note, we will include all these functions in a new file called ProgParser0TopLevel.kt, so please remove the parseProgram() function from ProgParser1ControlStruct.kt.

Global Variables Declarations

Given that we have only numeric variables, we can declare them just by using the keyword var followed by the name (no need to specify the type – everything is integer for now). In addition we could have an optional initialisation of the variable. To do this properly we should be able to calculate the result of a numeric expression, which would potentially include variables that had been declared and initialised before. However, our compiler is not able to perform calculations at this stage (see chapter 4 – Interpreters) so for now we will keep it simple and support only simple numeric values. Our variable declaration will look like this:

    <variable declaration> ::= var <identifier> [ = <value> ]

We will keep processing variables declarations in a loop as long as we see the var token:

/** parse variables declarations */
fun parseVarDecl() {
    while (inp.lookahead().encToken == Kwd.varDecl) {
        inp.match()
        val varName = inp.match(Kwd.identifier).value
        var initValue = ""
        if (inp.lookahead().encToken == Kwd.equalsOp) {
            inp.match()
            initValue = inp.match(Kwd.number).value
        }
        declareVar(varName, initValue)
    }
}

You may have noticed the function declareVar above. This is how this will work: we will maintain all the declared variables in a Map with the variable name as the key. The data for each variable will be it’s type (something’s telling me that we may want later to extend our language to support more than just int) and an initialised flag. All modern compilers point out references to uninitialised variables, so why not prepare the ground for this? If our variable declaration contains an initialisation part, we will set this flag to true, otherwise it will be false.

Regarding variables initialisation, please note that even though the initialisation part is optional in TINSEL, every global variable has to have an initial value – more details on this in chapter 10. So for now let’s say that the initialised flag will be true if a variable has been specifically initialised by the programmer in the TINSEL program, and false if it hasn’t been initialised by the programmer but instead is initialised to the default value by the Compiler.

In declareVar we first check in case the variable has already been declared, and if not, we process the new declaration together with the initialisation. The initialised flag is set accordingly.

// the variables space
val variablesSpace = mutableMapOf<String,VariableDecl>()

/** our variable types */
enum class VarType { int }

/** our variables declaration */
class VariableDecl(var type: VarType, var initialised: Boolean = false)

/** declare a variable */
fun declareVar(name: String, initValue: String) {
    // check for duplicate var declaration
    if (variablesSpace[name] != null)
        abort ("variable $name already declared")
    variablesSpace[name] = VariableDecl(VarType.int, initValue!="")
    code.dummyInstr("declare variable $name initValue $initValue")
}

You may also have noticed the use of dummy code here. The actual assembly code for the variables declarations depends (a) on the specific assembly language and (b) on the specific assembler. So, I will give you the detailed full code for this in chapter 10, when we start producing actual code for x86.

Functions Declarations

Similarly to the variables, our functions will only return numeric result, so no type needed. We will also keep it simple by having no parameters for now. A function declaration will look like this:

    <function declaration> ::= fun <identifier> ( ) <block>

There are a couple or things to note above:

  • To parse the function body I’m using <block> which we already have a parser function for. We will see shortly whether our existing parseBlock will need any changes to support the parsing of the function body
  • A function must contain a return statement. We will see later on how our compiler will verify this

The parser for a function declaration will look like this:

/** parse functions declarations */
fun parseFunDecl() {
    while (inp.lookahead().encToken == Kwd.funDecl) {
        inp.match()
        val funName = inp.match(Kwd.identifier).value
        inp.match(Kwd.leftParen)
        inp.match(Kwd.rightParen)
        declareFun(funName)
        parseBlock()
    }
}

You can see that same as with variables, we are processing the function declarations in a loop as long as we see a fun keyword. Same as with declareVar above, we need a declareFun, which will produce the necessary assembly code for the function declaration. And same as with var, this function uses dummy code for now until chapter 10.

// the functions space
val functionsSpace = mutableMapOf<String,VarType>()

/** process a function declaration */
fun declareFun(name: String) {
    if (functionsSpace[name] != null)
        abort ("function $name already declared")
    functionsSpace[name] = VarType.int
    code.dummyInstr("declare function $name")
}

Return Statement

The return statement that must be included in every function, will require some new routines to process it. Its format will be:

    <return> ::= return <expression>

The parser for this will be:

fun parseReturn() {
    inp.match()
    parseExpression()
    code.returnFromCall()
}

And we need to activate it in the when block in parseBlock when the return keyword is recognised:

Kwd.retTok -> parseReturn()

We will also need the return statement in our assembly code module:

/** return from function call */
fun returnFromCall() = outputCodeNl("RET")

The Main Program

I would like to create a resemblance to C, so I will use the token main to mark the start of the main program bock:

    <main block> ::= main <block>

That was an easy one. The parser will be like this:

fun parseMainBlock() {
    inp.match(Kwd.mainToken)
    code.outputLabel("main")
    parseBlock()
    code.dummyInstr("stop running")
}

As you can see in the main block, I have added a label called main for now to mark the entry point to the main program and also the final instruction, the one that tells the target machine to stop running the program. The exact instructions will also be clarified in chapter 10.

The Block

Looking at our current definition of <block>, which is

    <block> ::= [ <statement> ] *

it looks like it’s not adequate for it’s new upgraded role. It will at least need a set of delimiters for start-block and end-block. I’d like to expand the definition to:

    <block> ::= { [ <statement> ] * }
    <statement> ::= <block> | if | while | repeat | for | break | 
                     return | <assignment>

which says that a block consists of 0 or more <statements> and, Yes, it’s surrounded by curly brackets. So, now we have a distinct terminator for the block and don’t need the ENDIF, ENDWHILE or ENDFOR anymore. The statement can be one of the statements we have already worked on or actually a block.

With the above definition in mind I would like to split our existing parseBlock as follows:

/** parse a block */
fun parseBlock(breakLabel: String = "") {
    inp.match(Kwd.startBlock)
    while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
        parseStatement(breakLabel)
    }
    inp.match(Kwd.endBlock)
}

/** parse a statement */
fun parseStatement(breakLabel: String) {
    when (inp.lookahead().encToken) {
        Kwd.startBlock -> parseBlock(breakLabel)
        Kwd.ifToken -> parseIf(breakLabel)
        Kwd.whileToken -> parseWhile()
        Kwd.repeatToken -> parseRepeat()
        Kwd.forToken -> parseFor()
        Kwd.breakToken -> parseBreak(breakLabel)
        Kwd.retTok -> parseReturn()
        else -> parseAssignment()
    }
}

By adding the block-start and block-end tokens (the curly brackets) inside parseBlock, we only do it once and don’t have to do it in every parser function that calls parseBlock (and by now, there’s quite a lot of them).

If – While – Repeat

In order to maintain the resemblance to C, I would like to have the condition in the if, while and repeat blocks surrounded by parentheses. This is a small change in the three parser functions to add the corresponding match calls before and after the call to parseBooleanExpression(). I’m leaving the for loop for later due to its complexity. I will fix this separately, together with the code that is needed to execute it properly (if you remember, for currently generates dummy code).

...
inp.match(Kwd.leftParen)
parseBooleanExpression()
inp.match(Kwd.rightParen)
...

With these changes, it looks like we are ready to write a program like this:

Program TestTinsel1

var a = 2
var b = 5
var x
var y
var i
var result

fun alpha() {
    x = a * b - (y-1)*y
    i = 1
    while (true) {
        i = i + 1
        if (i > 10) {
            return x+i
        }
    }
}

main {
    result = alpha()
    a = 20
    b = 80
    result = alpha()
}

EndProgram

Finally change the main function as per below and try it out!

/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseProgram()
}

The output code will still be a mixture of M68000 assembly and dummy instructions but I will soon produce x86 code that anyone will be able to run and test on Linux (and Windows with some minor modifications).

Try other structures like for and repeat, if - else, break, and try removing or changing things and see how the compiler will respond to invalid syntax.

Before I leave you, I realised that a minor bug has crawled in our scanner – if you run the compiler with an empty file as input, or if there is no newline character at the end of the last line of the program, it will crash with “string index out of bounds exception”. To stop this, we will bring back the extra newline character that we had added at the end of our program in one of the early chapters:

// read the whole program into a string
// add a newline at the end to deal with end of input easier
inputProgram = f.readText() + '\n'

Have fun.

And as always, you can find the code in my GitHub repository.

Coming up next: More TINSEL

Chapter VII: Lexical Scanning

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous 6 chapters we have covered the basic structures of a programming language, like numerical expressions, Boolean expressions and control structures, and by now we have managed to build a basic compiler that can parse something that looks like a programming language. The biggest restriction we still have in place is the single-character tokens, which make our programming language look a bit funny. And this restriction we will get rid of in this chapter.

Until now I have followed closely Jack’s original tutorial (apart from some minor exceptions) and that has given us a great benefit in terms of delivering a working “product”, by using small blocks and continuously building on what we’ve done so far, following Jack’s step-by-step approach, while at the same time, keeping it simple.

In this chapter I will change approach and will introduce some of my own enhancements. The reasons for this are (a) having benefited and learnt from Jack’s tutorial, I feel I can start making my own experiments and see where this can lead us (b) I’m keen to use the capabilities that a language like Kotlin gives us and (c) I want to start focusing on my original goal, which is to produce code for the TI-59. Hence the different numbering style from this chapter on, using Latin numbers.

Nevertheless, I will continue using ideas and concepts from Jack’s following chapters, and will be mentioning these clearly in my text. I strongly believe that the original “Let’s Build a Compiler!” series is a masterpiece, and this is proven by the fact that it has been adapted and reproduced numerous times.

The Various Parts of our Compiler

You may have noticed that our compiler code is split in various files. This is not by accident – the split represents the 3 basic modules a compiler consists of: the Lexical Scanner, the Parser and the Code Generator.

The file InputProgramScanner.kt has all the functions that implement the lexical scanning, i.e. the breaking down of the input stream to distinct tokens, which are then passed to the next stage. It is here that we deal with all the detail about handling white spaces, newlines, and other input format specifics.

The three files ProgParser*.kt implement the Parser. I have split this in 3 sub-modules to make it more structured and easier to follow (one for numerical expressions, one for Boolean expressions and one for control structures). It is in the parser where the language structures are recognised and processed so that they can be translated to object code in the next stage.

Finally, the file M68000Instructions.kt implements the code generator. this is where the specific instructions for the target machine are produced.

I used this separation in these 3 modules from the beginning by using common sense and by following the modular approach that I always like to follow – write small pieces of code that interact with each other, each of which does a small part of the whole job, and if this little part had to be changed, then only that corresponding small module would need need to be changed. E.g. if we want to produce code for 80×86 instead of M68000, we just have to replace the code generator, while everything else remains as is. Similarly, if we want to start using '.AND.' instead of '&' for the Boolean “and” operator we just have to change this in our lexical scanner so that it can recognise the correct token (and we will be doing this in the next chapter).

The Approach

The main concept in Jack’s tutorial is the lookahead character (where “character” means “token” in the single-character token world). By doing a simple check on this lookahead character, our parser knows the next token and can make a decision as to what follows and therefore take the appropriate path. And I intend to keep this concept that Jack taught us.

In order to convert form single-character to multi-character tokens we obviously need to change the value of the token from char to string. But by doing this, we are losing the benefit that the single character gave us, which was by doing one single comparison (character comparison) we knew exactly what followed (which keyword, which operator, identifier, number, etc). If we try to do this with a string instead, we will end up with quite complex and convoluted IF statements which would be against the concept of simplicity. E.g. if you see the character 'I' then you would naturally check the following character to see if it’s an 'F' so you have the string 'IF'. But in this case how do you know it’s the keyword 'IF' and not the beginning of a variable name called 'IFILE'? Similarly, if you see a '+', how do you know it’s the operator '+' and not '++' or '+='?

So, it looks like we need to extend the definition of our token, and, apart from it’s value (string), include other attributes, like an encoded value to make the comparison easier, and also a category to help us group tokens together (e.g. addops). This thinking leads us to the use of enumerated values for these attributes and also points towards using an object for the token as opposed to a single value (string).

This is more or less what Jack has done by using a pair of variables Token and Value. In Token he holds the encoded token (e.g. number, identifier, if-token, else-token, operator) while in Value he holds the actual value of it (e.g. the identifier name, or the actual number)

Token Redefined

So, based on this thinking, let’s see how our token definition will look like:

class Token(val value: String = NO_TOKEN,
            val encToken: Kwd = Kwd.none,
            val info: TokInfo = TokInfo.none,
            val type: TokType = TokType.none)

val NO_TOKEN = "No Token"

The value is holding the string value of the token as it’s received (actually in our case it’s been converted to upper case). The encToken is holding the encoded value of the token that is used to drive the decisions in the parser. The enumeration for this can be something like:

enum class Kwd { addOp, subOp, mulOp, divOp, equalsOp, ifToken, elseToken, endifToken, whileToken, endwhileToken, repeatToken, untilToken, ... ..., endOfInput, any, invalid, none }

By using this enumeration, a single comparison (similar to the character comparison we have used so far) is good enough for the parser, so that it knows exactly what follows.

The info will be something like:

enum class TokInfo { identifier, keyword, operator, number, invalid, none }

While the type can be something like:

enum class TokType { addOps, mulOps, booleanLit, endOfBlock, endOfPRogram, endOfInput, invalid, none }

You can see that by using the TokType enumeration, functions like isAddop, isEndOfBlock, etc. can now be replaced by a simple check of the type attribute.

At this stage I’m not entirely convinced we need the info attribute – we will figure this out by the end of the next chapter.

Having defined how our tokens will look like from now on (until now they were single characters), let’s see what we need to do in order to identify them. I’m definitely in favour of Jack’s design, and intend to keep match as our main function of getting (and checking) tokens in the parser. However, match has to be changed now to compare the encToken instead of the single character that it used to compare:

/**
 * get the next token from the input stream and advance the cursor
 * match this token against a specific given token 'x'
 * also produce a match if called with no token or if token is "any"
 * return the token object that has been matched
 * called by the parser functions
 */
fun match(x: Kwd = Kwd.any): Token {
    if (nextToken.encToken != x && x != Kwd.any)
        expected(decodeToken(x))
    val thisToken = nextToken
    nextToken = scan()
    return thisToken
}

In this version of the scanner, nextToken replaces nextChar (as a concept , that is – nextChar is still used internally in the scanner):

// the next token is here so that we can look ahead
private lateinit var nextToken: Token

There are two more changes in this version of match (apart from comparing the encToken instead of char): The first one is that it now returns the token that it matched (or any token found if it has been asked not to match any). The reason for this is that by returning the token to the caller, match can now be used as the main interface between the parser and the scanner (which was also Jack’s original intention). The second difference is that it now calls scan (instead of getNextChar) to advance to the next token. This was necessary as our tokens are not single characters anymore, so we need to do a bit more work in order to get the next token. I will speak about scan shortly – this is where the intelligence in identifying the next token is going to be.

We have also seen that the parser needs to occasionally check the value of the next token without advancing the “cursor” and so far this has been done by checking the value of nextChar. I would not want to directly access nextToken from the parser, I would prefer this, from the structural point of view, to be a private variable in our scanner class. So we need one more function in our scanner to return the value of the next token, to fulfil the lookahead function without advancing the cursor:

/** lookahead function - returns next token without advancing the cursor */
fun lookahead(): Token = nextToken

So with this, the only two points of interaction between the parser and the scanner are match (to match a token and advance to the next one or just advance to the next one) and lookahead (to check the next token but do not advance).

Scan

And now we are getting to scan which is the major change in our scanner. I’m using the same name and same concept as Jack here but my implementation of scan is slightly different, as my intention is to make it as generic as possible, so that it covers all possible scenarios.

Let’s go down the list of scenarios that scan has to go through, starting from the simplest case and going to more and more complex ones.

The first case is when we encounter the end of the input stream (marked by char(0) in nextChar if you remember). In this scenario scan returns a special token that indicates end-of-input:

if (nextChar == endOfInput)  // check for end of input
    return Token(END_OF_INPUT, Kwd.endOfInput, TokInfo.none, TokType.none)

The next obvious case is when nextChar is a number, in which case scan reads and returns that numeric value:

if (isNumeric(nextChar))            // check for number
    return Token(getNumber(), Kwd.number, TokInfo.number, TokType.none)

Then we have the case where the next character is alphabetic. In this case scan will read that alphanumeric token, but then what? Our gut feel tells that we should check first to see if it is one of the reserved keywords, and if not, it will be an identifier:

val indx: Int
if (isAlpha(nextChar)) {            // check for name
    val name = getName()
    indx = isKeyword(name)
    return (if (indx >= 0) languageTokens[indx]  /* keyword found */
            else Token(name, Kwd.identifier, TokInfo.identifier, TokType.none))  /* identifier found */
}

Having gone through these cases, we are now across a special character, most likely an operator but how can we identify it? The most generic approach I could think of (but maybe not the simplest one) is to check the input stream (starting at the next character) against the predefined reserved sequences and see if any of them appears at the start of what remains of the input stream. We need to compare first against the longest sequences so that if our next character is e.g. '/' we check first for '/=' and also for '/*' (if this is our comment delimiter) and then we check against '/' which will match. Same goes for '==' vs '=', '<=' vs '<' etc. And this is the most complex part of scan:

indx = specSeqPresent(languageTokens)  // we have a special character - check for special sequence
if (indx >= 0)
    return(getSpecSeq(languageTokens, indx))

I have tried to simplify this part by moving the checking for a special sequence and the retrieval of that special sequence, if one is found, to different routines (listed further down).

Finally, if everything so far has failed, then we’ve come across an invalid token and this is what scan will return:

// at this point we have an invalid token
val thisChar = nextChar
getNextChar()
return Token(thisChar.toString(), Kwd.invalid, TokInfo.invalid, TokType.invalid)

Here’s the full listing of scan and the three supporting functions isKeyword, specSeqPresent and getSpecSeq:

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    if (nextChar == endOfInput)  // check for end of input
        return Token(END_OF_INPUT, Kwd.endOfInput, TokInfo.none, TokType.none)

    if (isNumeric(nextChar))            // check for number
        return Token(getNumber(), Kwd.number, TokInfo.number, TokType.none)

    val indx: Int
    if (isAlpha(nextChar)) {            // check for name
        val name = getName()
        indx = isKeyword(name)
        return (if (indx >= 0) languageTokens[indx]  /* keyword found */
            else Token(name, Kwd.identifier, TokInfo.identifier, TokType.none))  /* identifier found */
    }

    indx = specSeqPresent()            // we have a special character - check for special sequence
    if (indx >= 0)
        return(getSpecSeq(indx))

    val thisChar = nextChar           // at this point we have an invalid token
    getNextChar()
    return Token(thisChar.toString(), Kwd.invalid, TokInfo.invalid, TokType.invalid)
}

/** check if a specific name is a keyword */
private fun isKeyword(name: String): Int {
    if (cursor >= inputProgram.length)         // check for end of input
        return -1
    for (i in languageTokens.indices) {
        if (languageTokens[i].value.equals(name, true))  // check for keyword match
            return i
    }
    return -1
}

/**
 * check the beginning of the remaining input for special sequence (e.g. operator)
 * returns the index in our keywords list if found or -1 if not
 */
private fun specSeqPresent(): Int {
    if (cursor >= inputProgram.length)         // check for end of input
        return -1
    for (i in languageTokens.indices) {
        val tokenValue = languageTokens[i].value
        if (inputProgram.substring(cursor).startsWith(tokenValue))  // check for keyword match
            return i
    }
    return -1
}

/** get a special sequence from input (keyword or operator  */
private fun getSpecSeq(indx: Int): Token {
    if (indx >= languageTokens.size)
        return Token(NO_TOKEN, Kwd.none, TokInfo.none, TokType.none)
    val t = languageTokens[indx]
    cursor = min(cursor+t.value.length, inputProgram.length)
    nextChar = inputProgram[cursor]
    skipWhite()
    return t
}

You may have noticed a new variable above, languageTokens. This is the list of all the keywords, operators and other special sequences used in our language. This is fully configurable and by changing the contents of this list we can change the look and feel of the language.

var languageTokens = mutableListOf<Token>()

The declaration and initialisation of this list is done in LanguageTokens.kt, which is effectively where we define all the keywords and operators of our language.

The Final Touches

On top of what we just mentioned, there are a couple of changes still required in our scanner:

Add this statement in the import section:

import kotlin.math.min

Modify slightly the init function of the scanner class:

/** initialisation code class InputProgramScanner */
init {
    try {
        val f = File(inputFile)
        // read the whole program into a string
        inputProgram = f.readText()
        // init the list of tokens for our language
        initKeywords()
        initOperators()
        // set the lookahead character to the first input char and skip any white spaces
        nextChar = inputProgram[0]
        skipWhite()
        // get the first token from input
        nextToken = scan()
    } catch (e: Exception) {
        abort("$e")
    }
}

Rename the indx variable to cursor (I think it’s more appropriate) and change the skipNextChar function as follows (cursor now pointing to the current character that’s coming up in the input stream as opposed to the one after):

/** set the lookahead character to the next char from input */
private fun skipNextChar() {
    if (cursor < inputProgram.length-1)
        nextChar = inputProgram[++cursor]
    else
        nextChar = endOfInput
}

Remove all the global definitions of all the keywords and operators and the relevant functions to recognise them. Also remove the declaration of the blockTerminators set.

Move the endOfInput variable from the global section to inside the InputProgramScanner class and make it private. Also make all other variables in this class private.

// end of input mark
private val endOfInput = nullChar.toChar()

Modify the function that checks for end of Program:

/** check for end of program - called by parseBlock */
fun isEndOfProgram(): Boolean = nextToken.encToken == Kwd.endOfProgram ||
nextToken.encToken == Kwd.endOfInput

Remove the function that checks for end of block – not needed anymore.

Add the decodeToken function that helps expected print the expected token value in case of error:

/** decode an encoded token to token name */
private fun decodeToken(token: Kwd): String {
for (i in languageTokens.indices)
if (languageTokens[i].encToken == token)
return languageTokens[i].value
return "*******"
}

Move the function expected from CompilerMain to InputProgramScanner as this the only place it is used. This function is now simplified:

/** report what was expected and abort */
private fun expected(expMsg: String) {
abort("Expected $expMsg \n found [${nextToken.value}]")
}

Remove the getBoolean function (it’s functionality is now covered by scan). Also isBoolean is not needed either.

To test these changes you can use only these three files CompilerMain.kt, InputProgramScanner.kt. If you have managed to follow these incremental changes I’ve described here, these two files should be in good shape – if not, you can take them from the GitHub repository. The third file is LanguageTokens.kt. Please take this from GitHub (URL below). Finally, you will need to change the main function as follows:

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    while(true) {
        val t: Token = inp.lookahead()
        println("[${t.value}], [${t.encToken}], [${t.info}], [${t.type}]")
        if (t.encToken == Kwd.endOfInput)
            break
        inp.match()
    }
}

You can also comment out the following line in CompilerMain.kt as we will not be producing code during this test.

//val code = M68000Instructions()

You can now compile it and run it with any input consisting of any combination of numbers, words, operators and other special characters. It will recognise the tokens in this input sequence as per definitions in LanguageTokens and will print them one by one. Give it an input like this:

Program Test

if condition1
    alpha = 5
    beta = 6
else
    x = 1
    y = 2
endif

pi=3.14

EndProgram

It should produce this output:

[PROGRAM], [startOfProgram], [keyword], [none]
[TEST], [identifier], [identifier], [none]
[IF], [ifToken], [keyword], [none]
[CONDITION1], [identifier], [identifier], [none]
[ALPHA], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[5], [number], [number], [none]
[BETA], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[6], [number], [number], [none]
[ELSE], [elseToken], [keyword], [endOfBlock]
[X], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[1], [number], [number], [none]
[Y], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[2], [number], [number], [none]
[ENDIF], [endifToken], [keyword], [endOfBlock]
[PI], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[3], [number], [number], [none]
[.], [invalid], [invalid], [invalid]
[14], [number], [number], [none]
[ENDPROGRAM], [endOfProgram], [keyword], [endOfPRogram]
[End of Input], [endOfInput], [none], [none]

You can see how each token is recognised and categorised and how it is translated from string to an encoded token so that it can be easily processed by the parser. Play with it, feel free to change the token definitions in LanguageTokens.kt and see how the tokens you define will be recognised. Try any combinations you can think of to see if you can break it.

I will pause here to let you digest and experiment with this and will be back shortly.

And as always, you can find the code in my GitHub repository.

Coming up next: Introducing TINSEL