Chapter XII: Improving our Error Checking

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

We now have a fully working compiler that can compile TINSEL v1.0 programs, including comments, which we can run on Linux. So, let’s work on some details to close some small gaps and also improve our error checking.

More Input after the EndProgram Token

Let’s start with an easy one. Our compiler stops compiling when it sees endprogram which is the token that terminates our program. It will ignore anything that follows “endprogram” and this is fine – everything works. However, having extra characters in the input stream after the end of the actual program, which are being ignored, is not a clean scenario. Let’s add a little check so that our compiler flags this as an error. After the “endprogram” token we would expect the end of input, so all we need is to add one line at the end of parseProgEnd:

fun parseProgEnd() {
    inp.match(Kwd.endOfProgram)
    code.progEnd()
    inp.match(Kwd.endOfInput)
}

With this simple check, our compiler will not allow anything after the “endprogram” token other than white spaces (and new lines).

As you may have noticed, I’ve also moved the two lines that were printing a newline and the comment “end of program” to the output, to our code module into a function called progEnd, similar to progInit – it’s cleaner this way.

Output to a File

Now that we can directly execute our programs on Linux, it does not feel right to have to copy and paste our compiler’s output to a file so that we can then pass it through the GNU assembler. Let’s make our compiler do this for us.

So far we have been using only one mandatory command line argument, which was the input file. I will add one more option, to tell our compiler to create a file for the code that it produces:

CompilerMain [-o output_file] input_file

For this to work, let’s declare a new variable in CompilerMain.kt that will hold the name of the output file:

var outFile = ""

We will need to pass this variable as a parameter to the x86_64Instructions class constructor:

class x86_64Instructions(outFile: String = "") {
    ....

In this class we will declare our output stream that will be initialised by default to the standard output:

private var outStream: PrintStream = out

What we need now is some initialisation code for this class, which will check the value of outFile, and if not empty, will set the output stream to point to that file:

/** initialisation code - class InputProgramScanner */
init {
if (outFile != "") {
try {
outStream = PrintStream(File(outFile))
} catch (e: Exception) {
err.println("could not create output file - $e")
err.println("output code will be sent to stdout")
}
}
}

As you can see, if the output file cannot be created/opened, the compiler will print an error message and will send the output to stdout. By the way, if the output file exists already, it will be overwritten.

And to make the output go to the output stream as opposed to stdout, a small change is required in outputCode. This is the only place where we need to change the call to print, as this is where the code is actually written to the output. All the other functions (code with a newline, labels, comments, etc.) end up calling outputCode.

fun outputCode(s: String) = outStream.print(s)

And that was it in terms of sending the code either to a file or to stdout, however, what’s still missing is how to set the value of the output file name (outFile). Here’s how we will do this:

We need a new function to parse the command line arguments. We will call it (what else?) processCmdLineArgs.

In my years of programming I’ve seen (and written) numerous different ways to process command line arguments. A good one that I recently came across, is to do it in two stages: (a) convert all the -x options and their parameters to a map with the option “x” as key and the parameters as data, and (b) process that map, i.e. process each option and set things accordingly. This method works very well when we have a large number of options (think of those complex Linux commands that have 20 or 30 different options). However it would be an overkill in our case as we only have one: “-o output_file” (and maybe we can add one more: “-?” for usage).

Another possible option would be to use the techniques we have learnt in the scanner implementation (by following Jack’s tutorial). After all, the command line arguments are a series of tokens that have to be parsed and actions must be taken based on their values. I have implemented this, as it was too tempting to resist. It appears to be very flexible and very thorough and can deal with anything you can throw at it (same as our compiler). However, the resulting parseCmdLineArgs function is too lengthy to use just to parse a -o output_file option followed by input_file. I want to retain this solution though for future use, so I have saved the supporting class and a use case in this repository.

So, for now, I have used a simple, short, straightforward function that parses the args list. It loops through the list using the helper function getNextArg, and first checks for a "-" to determine if the arg is an option. If yes, it checks for "-o" or "-?" and if "-o", it gets the name of the outfile from the next argument. If not "-", then it assigns the value of the arg to infile. All that is done with the appropriate error checking of course :

const val USAGE = "usage: CompilerMain [-o output_file] input_file"

/** process command line arguments */
fun getNextArg(args: Array<String>, index: Int, argName: String = ""): String {
    if (index >= args.size)
        exit("missing argument(s) $argName, $USAGE")
    return args[index]
}
fun processCmdLineArgs(args: Array<String>) {
    var argIndx = -1
    while (++argIndx < args.size) {
        val arg = getNextArg(args, argIndx)
        if (arg[0] == '-')
            when (arg) {
                "-?", "-h", "-H" -> exit(USAGE)
                "-o", "-O" -> { outFile = getNextArg(args, ++argIndx, "output_file"); continue }
                else -> exit("invalid option [$arg]\n$USAGE")
            }
        else
            inFile = arg
    }
    if (inFile == "")
        exit("missing argument input_file, $USAGE")
}

To activate this function a one-line change is needed in the main initialisation function:

fun initCompiler(args: Array<String>) {
    processCmdLineArgs(args)
    // initialise the code module
    code = x86_64Instructions(outFile)
    // initialise the scanner module
    inp = InputProgramScanner(inFile)
}

While we are on this topic, now that the compiler can (and most of the times will) send the resulting code to a file, the console would be totally silent. So, I have added one “Hello” message in the beginning and one “Finished” message at the end:

fun main(args: Array<String>) {
    println("TINSEL(c) compiler v1.1 Jan 2022, Copyright M.Pappas\n")
    initCompiler(args)
    parseProgram()
    println("Successful compilation, ${inp.currentLineNumber-1} source lines") // -1 is needed as an extra new line was added when the input was read
}

Finally, I have added an extra comment at the top of the assembly output showing the date and time it was compiled:

fun progInit(progName: String) {
    outputCommentNl("program $progName")
    outputCommentNl("compiled on ${Date()}")
    ....
}

Undeclared Identifiers

You may have noticed that our compiler will happily accept any variable name in an expression without checking whether it’s been declared first. This will cause a failure when our compiler’s output is sent to the assembler, so let’s fix this by adding an undeclared variable check. Actually, the same applies to functions.

In addition our compiler will allow the following:

var a
...
fun a() {
... 

Even though our compiler can cope with this, declaring a variable and a function with the same name, first is not good programming practice, and second it will cause an issue in our assembly code as both will translate to the same symbol:

a:
...
a:
... 

and this will cause an assembler error.

So, first, let’s combine the check “variable already defined” with the check “function already defined”. For this we need to combine the two maps:

// the variables space
val variablesSpace = mutableMapOf<String,VariableDecl>()

// the functions space
val functionsSpace = mutableMapOf<String,VarType>()

into one:

// the identifiers space map
val identifiersSpace = mutableMapOf<String,IdentifierDecl>()

/** the declaration space (variables and functions) */
class IdentifierDecl(var fv: TokType, var type: VarType, var initialised: Boolean = false)

With this, our declareVar and declareFun functions can check against one single map (identifiersSpace) whether an identifier has been declared, and if not, they can update this map setting the fv field accordingly (for a variable or function – the corresponding values have already been defined in the TokType enumeration).

/** declare a variable */
fun declareVar(name: String, initValue: String) {
// check for duplicate var declaration
if (identifiersSpace[name] != null)
abort ("line ${inp.inLineNumber()}: identifier $name already declared")
identifiersSpace[name] = IdentifierDecl(TokType.variable, VarType.int, initValue!="")
code.declareInt(name, initValue)
}

/** process a function declaration */
fun declareFun(name: String) {
if (identifiersSpace[name] != null)
abort ("line ${inp.inLineNumber()}: identifier $name already declared")
identifiersSpace[name] = IdentifierDecl(TokType.function, VarType.int)
code.declareAsmFun(name)
}

Now that we have joined the declarations space for our variables and functions and can easily tell whether an identifier is a function or a variable, let’s go back to scan. When scan sees an alphabetic token, which is not a keyword, it returns exactly that: a token of type identifier without any additional information. And it’s down to our parser to decide whether that identifier is a variable or a function depending on whether there is a "(" following. It looks like we can do this in a smarter way. Let’s change scan to return the type of the token as well (function, variable or none).

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    newLine = skipWhite()
    if (checkEndofInput())
        return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none)
    if (checkNumeric())
        return Token(getNumber(), Kwd.number, TokType.none)
    if (checkAlpha())
        return keywordOrFunctionOrVariable(getName())
    if (checkSpecialToken())
        return getSpecialToken()
    return getInvalidToken()
}

/** return a keyword or identifier token based on the keyword tokens list */
private fun keywordOrFunctionOrVariable(name: String): Token {
    val indx = isKeyword(name)
    return (if (indx >= 0)
                languageTokens[indx]  // keyword found
            else
                // function, variable or other identifier found (determined by Token type)
                Token(name, Kwd.identifier, identifiersSpace[name]?.fv ?: TokType.none))
}

And with this additional information in the token coming from the scanner, we can improve our parsers. Until now, our compiler cannot process a plain function call (not part of an assignment):

fun a() {
    ...
}
main {
    ...
    a()     <-- this will produce an error ("=" expected), which should not be the case
    ...
}

To fix this let’s first update the BNF definition:

<block> ::= { <statement> * }
<statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> |
                <return> | <read> | <print> | <assignment> | <function_call> | null [ ; ]
<if> ::= if ( <b-expression> ) <block> [ else <block> ]
<while> ::= while ( <b-expression> ) <block>
<repeat> ::= repeat <block> until ( <b-expression> )
<for> ::= ( <identifier> = <b-expression> to <b-expression> ) <block>
<break> ::= break
<return> ::= return <b-expression>
<assignment> ::= <identifier> = <b-expression>
<function_call> ::= <function_name> ( )
<read> :: = read <identifier> [ , <identifier> ] *
<print> ::= print <b-expression> [ , <b-expression> ] *

And based on this, the parseStatement becomes:

fun parseStatement(breakLabel: String) {
    when (inp.lookahead().encToken) {
        Kwd.startBlock -> parseBlock(breakLabel)
        Kwd.ifToken -> parseIf(breakLabel)
        Kwd.whileToken -> parseWhile()
        Kwd.repeatToken -> parseRepeat()
        Kwd.forToken -> parseFor()
        Kwd.breakToken -> parseBreak(breakLabel)
        Kwd.retTok -> parseReturn()
        Kwd.readToken -> parseRead()
        Kwd.printToken -> parsePrint()
        Kwd.identifier ->
            if (inp.lookahead().type == TokType.variable) parseAssignment()
            else if (inp.lookahead().type == TokType.function) parseFunctionCall()
            else abort("line ${inp.inLineNumber()}: identifier ${inp.lookahead().value} not declared")
        Kwd.semiColon -> inp.match()    // semicolons are simply ignored
        else -> inp.expected("valid keyword, semicolon or identifier")
    }
}

Lastly, the parseIdentfier function can now change to:

fun parseIdentifier() {
when (inp.lookahead().type) {
TokType.variable -> parseVariable()
TokType.function -> parseFunctionCall()
else -> abort("line ${inp.inLineNumber()}: identifier ${inp.lookahead().value} not declared")
}
}

/**
* parse a function call
* <function_call> ::= <function_name> ( )
*/
fun parseFunctionCall() {
val funcName = inp.match(Kwd.identifier).value
inp.match(Kwd.leftParen)
inp.match(Kwd.rightParen)
code.callFunction(funcName)
}

/** parse a reference to a variable */
fun parseVariable() {
val varName = inp.match(Kwd.identifier).value
code.setAccumulatorToVar(varName)
}

With these changes, our compiler now supports the “plain” function call and also checks every time an identifier is mentioned, whether it has been declared (as a variable or a function) and produces an error if not.

Uninitialised Variables

It would be good to have a check for uninitialised variables but we need to think how this will work. For the moment we have the flag initialised in our identifiers space map, which is set to true when a variable is initialised at declaration time. Also need to remember that for now we only use global variables.

So, regardless of initialisation at declaration time, we can set the initialised flag to true when the variable is assigned a value, which is when it appears on the left hand side of an assignment or in a read statement. And then, every time the variable appears on the right hand side of an assignment, we can check the initialised flag and trigger a compile error if not initialised. So far, so good.

However, we are talking about global variables here. They can be modified by any function in our program, even by functions in other files (when TINSEL supports this). This means it’s almost impossible for the compiler to determine at compile time whether a variable has been initialised or not when it appears on the right hand side of an assignment.

So, it looks like we will abandon the idea of checking for uninitialised variables as far as the global variables go, but we will come back to it when we introduce local variables. And that’s what the commercial compilers do when it comes to uninitialised variables.

Finally, don’t forget that global variables that are not initialised in our compiler when they are declared, will be initialised to 0 (zero) by default (see chapter 10).

Return

As mentioned in Chapter VIII, each function must have a return statement. So, to be thorough, our compiler must check this and produce an error if that return statement is not found. The assembler would not complain about the absence of a return statement, however this scenario would lead to unpredictable results, as the code would continue executing whatever bytes it found after the end of our function (most likely some other function or the main program itself from the beginning…). A decent compiler would also check that the return statement is actually reachable and would produce a fatal error otherwise. We will keep it simple though and will only check for the existence of the return statement.

The first question is, where do we do this? The most obvious answer must be in parseBlock, as this is where the body of the function is processed. However, parseBlock is called in quite a few other cases like from parseIf, parseWhile, parseRepeat, etc. and even from parseMain. So, let’s create a new version of parseBlock, that will be called only form parseFuncDecl to parse the function block. And let’s assume for a moment that parseBlock returns a Boolean value indicating whether the block included a return statement or not (we will see in a minute how we will do this). Our new parseFunctionBlock will now check the return value of parseBlock and if false, it knows there was no return statement:

fun parseFunctionBlock(funName: String) {
val hasReturn = parseBlock()
if (!hasReturn)
abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retTok)}")
}

The next question is, how will parseBlock check for the return statement? parseBlock is recursive: it calls parseStatement, which calls parseWhile, which calls parseBlock, which calls parseStatement, which calls parseIf, which calls parseBlock, etc. and at some point in this chain there might be a return statement.

Let’s take a step back and think of how we have implemented return. It is done by a parseReturn function that has been added to the parseStatement where block. It is at this level where we know that we have a return statement. And it looks like there is an “easy” way to check, very similar to how Jack implemented the breakLabel, which is passed down the recursion chain as a parameter. I will change parseStatement, to return a Boolean value indicating whether the statement it executed has found a return statement or not. This return value will be returned further on, by the previous function, up the recursion chain until it reaches the top level parseBlock (the one that was called from parseFunctionBlock). This may sound more complicated than it really is, but it boils down to this: each parser that calls parseBlock, simply reads its return value and returns it to the caller. In the end this value reaches parseFunctionBlock. And here’s the code for it:

fun parseBlock(breakLabel: String = ""): Boolean {
var foundReturn = false
inp.match(Kwd.startBlock)
while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
if (parseStatement(breakLabel))
foundReturn = true
}
inp.match(Kwd.endBlock)
return foundReturn
}
fun parseStatement(breakLabel: String): Boolean {
    when (inp.lookahead().encToken) {
        Kwd.startBlock -> return parseBlock(breakLabel)
        Kwd.ifToken -> return parseIf(breakLabel)
        Kwd.whileToken -> return parseWhile()
        Kwd.repeatToken -> return parseRepeat()
        Kwd.forToken -> return parseFor()
        Kwd.breakToken -> { parseBreak(breakLabel); return false }
        Kwd.retTok -> { parseReturn(); return true }
        Kwd.readToken -> { parseRead(); return false }
        Kwd.printToken -> { parsePrint(); return false }
        Kwd.identifier -> {
            if (inp.lookahead().type == TokType.variable) parseAssignment()
            else if (inp.lookahead().type == TokType.function) parseFunctionCall()
            else abort("line ${inp.currentLineNumber}: identifier ${inp.lookahead().value} not declared")
            return false
        }
        Kwd.semiColon -> { inp.match(); return false }    // semicolons are simply ignored
        else -> { inp.expected("valid keyword, semicolon or identifier"); return false }
    }
}

As you can see, parseStament now returns a Boolean value indicating whether the statement it executed has found a return statement or not. parseBlock checks this return value and sets foundReturn to true if at least one statement has found a return. At the end of the block, this Boolean variable has the answer to the question “was there a return statement in this block?”. At the same time the foundReturn value is also returned to the caller, as this block could have been called by a parseStatement or a parseIf, etc, and they need to return that value up the recursion chain. This value is checked at the end of the chain by parseFunctionBlock and produces a compiler error if false.

Looking at parseStatement, we can see that it returns true on when it executes parseReturn (as this is where the return statement is found). It returns false if it executes parseBreak, parseRead, parsePrint, parseFunctionCall and parseAssignment (no return statement there). For all the other statements (parseIf, parseWhile, parseRepeat, parseFor) it returns their return value. These parsers must record the return value of the parseBlock that they call and return that to the caller. Where parseBlock is called more than once (like in parseIf) then all the return values must be OR‘ed as the return statement can be in any of these blocks. Easy, isn’t it!!

On a side note, you may have noticed a new variable above: currentLineNumber. If you remember, when I introduced line numbers in Chapter IX, I had some doubts about the accuracy of the line number. And there is a good reason for this. The lineNumber variable is increased by 1 with every new line in function getNextChar, which is called by skipWhite, which is called by scan, which is called at the end of match to advance to the next token. So, lineNumber simply points to the line where the next token is, not the one we are processing right now.

To fix this, I have introduced the new variable currentLineNumber, which is set to the lineNumber at the beginning of match (i.e. in the beginning of the next cycle). This way currentLineNumber points always to the line of the current token and can be safely used in our error messages. Don’t forget, we need to set it both in the beginning of match and also in the beginning of lookahead, as these are the two points that our parsers call to check the current token.

fun match(keyWord: Kwd = Kwd.any): Token {
    currentLineNumber = lineNumber
    ....
fun lookahead(): Token {
currentLineNumber = lineNumber
return nextToken
}

Finally, I have added a -debug option to the main program that activates a debug function (debugCompiler) instead of the standard parseProgram function. In this debug mode our compiler will scan the input and will print each token, next token, line number and other details so that we can easily verify how (and if) our scanner works.

And as always, you can find the code in my GitHub repository.

Coming up next: Fixing the For Loop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s