Chapter 3: More Expressions – Variables, Functions

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

So far this tutorial has shown us how to parse and translate numeric expressions up to arbitrary levels of complexity. However, we still have a few restrictions:

  • no variables are allowed – only numeric factors
  • the numeric factors were limited to single digits
  • no white spaces allowed

In this third chapter we will remove these restrictions and will also add support for assignments and simple calls to functions. Please remember that the second restriction is self-imposed and I fully agree with Jack that it helps simplify the code so that we can focus on the important points first. When we are ready to get rid of it, it will be very easy.

Variables

Most expressions in real life programming involve variables such as:

b * b + 4 * a + c

If we go back to our current definition of factor

<factor> ::= ( <expression> ) | <number>

we can easily conclude that the variable references in the above expression are just factors. So the factor definition becomes now:

<factor> ::= ( <expression> ) | <number> | <variable>

So, from this we can see that we have an easy choice: if the next character is a digit we have a number, if it’s a letter we have a variable. Same as with the number, we will just have to set the Accumulator to the value of the variable. Sticking with the 68000 assembly code for now, we are learning form Jack’s tutorial that this is done in a position-independent manner, i.e. it’s relative to the program counter:

MOVE X(PC),D0

where X is the variable name. Based on the above the method parseFactor now becomes:

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number> | <variable>
 */
fun parseFactor() {
        if (inp.isLeftParen(inp.nextChar)) {
                // ( Expression )
                inp.getNextChar()
                parseExpression()
                inp.match(rightParen)
        }
        else if (inp.isAlpha(inp.nextChar))
                // Variable
                code.setAccumulatorToVar(inp.getName())
        else
                // Number
                code.setAccumulator(inp.getNumber())
}

Please remember to add the new 68000 instruction in M68000Instructions.kt:

/** set accumulator to variable */
fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0")

So, with just 2 extra lines of code in the parser we have added support for variables! And as before our parseFactor function matches exactly the BNF definition. As usual, try it with various input “programs” and ensure it does proper error checking.

Functions

Apparently functions is one of the few common things that is supported by many languages. Even though this is too early stage to introduce functions, because we have not addressed the question about passing parameters, Jack includes simple function calls without parameters here, as it brings up an important point.

Until now, the parsing was predictive, i.e. by looking at the next character we could tell what follows. However, with the introduction of functions, if the next character is a letter, we don’t really know whether it’s a variable or a function. This is solved by following the C language standard where a function name is always followed by () – empty if there are no parameters.

Given that we cannot handle parameters yet, our function calls will have the form:

f()

Without parameters to take care of, there is nothing else to do than call that function. Given that now we have two possibilities if we encounter an “alpha” in parseFactor, this will be dealt with in a separate parse function.

Please modify parseFactor as below and add the new function parseIdentifier:

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number> | <identifier>
 */
fun parseFactor() {
        if (inp.isLeftParen(inp.nextChar)) {
                // ( Expression )
                inp.getNextChar()
                parseExpression()
                inp.match(rightParen)
        }
        else
        if (inp.isAlpha(inp.nextChar))
                // Identifier
                parseIdentifier()
        else
                // Number
                code.setAccumulator(inp.getNumber())
}

/**
 * parse an identifier
 * <identifier> ::= <variable> | <function>
 */
fun parseIdentifier() {
        val identName: String = inp.getName()
        if (inp.isLeftParen(inp.nextChar)) {
                // function
                inp.getNextChar()
                inp.match(rightParen)
                code.callSubroutine(identName)
        }
        else
                // variable
                code.setAccumulatorToVar(identName)
}

And of course add the new instruction into M68000Instructions.kt

/** call a subroutine */
fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}")

Please compile and run this version and as always test the error handling.

What Jack has shown us here is that with very little added complexity we have increased the scope of our compiler tremendously. Where the predictive approach cannot tell us what follows (at the point where parseFactor sees a letter coming next) we just branch to another function and let it deal with it by applying predictive approach further down the line (read that identifier name and check if the next character is a left parenthesis). And this technique can extend many levels down.

More on Error Handling

At this point let me point out that our compiler is performing decent error handling, catching invalid expressions and aborts the compilation in these scenarios. As Jack mentioned, we don’t have any direct error checks in any of the parser functions, so we seem to be getting this error handling almost “for free”. And this is indeed the case. Our parser relies on the error handling done in match, getNumber and getName.

Having said that, those of you who are good at breaking programs must have noticed that the expressions:

1+2<space>3*4 or

1+2@3*4

both stop after processing 1+2. And if you think about it, this is what we have told our compiler to do: process an Expression and stop at the end of it (i.e. when the next character is not a recognised as part of the expression).

We temporarily can fix this gap in our error handling by agreeing that our expressions should end with a newline. To enforce this add the following code into our main function just after the call to parseExpression:

/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseExpression()
if (!inp.isEndOfLine(inp.nextChar))
expected("Newline")
}

Don’t forget to add the corresponding function into InputProgramScanner.kt:

/** check for end of line */
fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'

As you can see the function that checks for end of line checks for either CR or LF to cover both Windows and Unix. However, you are free to fine-tune it to make sure it suits your environment.

As usual, compile it and try it. It should now recognise and flag expressions not properly terminated with a new line.

Assignment Statements

So, now that we have successfully parsed complex numerical expressions, let’s take the natural next step and save the output of the expression, which brings us to assignments. An assignment will usually have the form:

<assignment> ::= <identifier> = <expression>

In order to parse this we need another function that replicates the above BNF:

/**
* parse assignment
* <assignment> ::= <identifier> = <expression>
*/
fun parseAssignment() {
var identName: String = inp.getName()
inp.match(equalsOp)
parseExpression()
code.assignment(identName)
}

Also need to add one more global declaration in our InputProgramScanner.kt:

// equals
val equalsOp = '='

And of course the necessary 68000 code (this is very CPU specific so I trust Jack here):

/** assignment var to value */
fun assignment(identifier: String) {
outputCodeNl("LEA $identifier(PC),A0")
outputCodeNl("MOVE D0,(A0)")
}

Finally, replace the call to parseExpression in function main with a call to parseAssignement:

/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseAssignment()
if (!inp.isEndOfLine(inp.nextChar))
expected("Newline")
}

Once again, by adding 6 lines of code in the parser we have extended the functionality of our compiler dramatically! And as before, error checking comes for free as the new parsing function relies on match and getName to do the error checking.

If expressions was the only thing we would need for a program then we would be done. However, programs also need control statements, if-then-else, loops, etc. These are covered in chapter 5. What we have achieved up to now though is quite important, as expressions are one of the most challenging topics in a programming language.

Multi-Character Tokens

As I have said already, I’m following Jack’s approach by sticking to the “keep it simple” concept, hence the restrictions of single character tokens. This approach is used in some of the following chapters as well in order to avoid complexity and focus on the important issue at hand. It is very easy to be removed though as you can see below.

So, if we think of the identifier as a letter followed by alphanumeric characters (letters or numbers) or '_', and a number as a series of numeric digits, then we can easily modify getName and getNumber as per below:

  • Rename getNextChar to skipNextChar and define a new function getNextChar that for now just calls skipNextChar. This will be used in the following topic (white spaces):
/** get the next character from input */
fun getNextChar() {
skipNextChar()
}

/** set the lookahead character to the next char from input */
fun skipNextChar() {
if (indx < inputProgram.length)
nextChar = inputProgram[indx++]
else
exit("end of input")
}
  • Change getName and getNumber as per below:
/** get an identifier */
fun getName(): String {
    var token: String = ""
    if (!isAlpha(nextChar))
        expected("Identifier")
    else {
        while (isAlphanumeric(nextChar)) {
            token += nextChar.uppercase()
            skipNextChar()
        }
    }
    return token
}

/**
 * get a number
 * <number ::= [ <digit> ] +
 */
fun getNumber(): String {
    var value: String = ""
    if (!isNumeric(nextChar)) {
        expected("Number")
    } else {
        while (isNumeric(nextChar)) {
            value += nextChar.toString()
            skipNextChar()
        }
    }
    return value
}
  • Add the following as well to our InputProgramScanner class:
/** check for alphanumeric */
fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'

That seemed really easy. Compile and test this code. Now you can parse expressions like

var1=1234*x1+95-46*y1

Make sure it flags invalid characters in the expression.

White Spaces

The white space restriction is also one of those little things that annoy us, so let’s make our compiler more user-friendly by removing it. Not surprisingly Jack is suggesting that we find a simple rule to skip the white space and apply it everywhere. Until now, the lookahead character contained the next usable character in the input stream and this has always been true as there are no white spaces in our input code. The lookahead character is set by getNextChar.

So, naturally, we can enhance getNextChar to skip any white spaces and get us the next usable character from the input ignoring any white spaces.

  • Please add the new functions isWhite and skipWhite and modify getNextChar as below:
/** get the next character from input after skipping white spaces */
fun getNextChar() {
    skipNextChar()
    skipWhite()
}

/** skip white spaces */
fun skipWhite() {
    while (isWhite(nextChar))
        skipNextChar()
}

/** check for a white space */
fun isWhite(c: Char): Boolean = c == ' ' || c == '\t'
  • Add a call to skipWhite in getName and getNumber at the end of the loop as skipNextChar, which is used there, does not skip white spaces:
/**
 * get an identifier
 * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] *
 */
fun getName(): String {
    var token: String = ""
    if (!isAlpha(nextChar))
        expected("Identifier")
    else {
        while (isAlphanumeric(nextChar)) {
            token += nextChar.uppercase()
            skipNextChar()
        }
        skipWhite()
    }
    return token
}

/**
 * get a number
 * <number> ::= [ <digit> ] +
 */
fun getNumber(): String {
    var value: String = ""
    if (!isNumeric(nextChar)) {
        expected("Number")
    } else {
        while (isNumeric(nextChar)) {
            value += nextChar.toString()
            skipNextChar()
        }
        skipWhite()
    }
    return value
}

Here there is another difference between my code and Jack’s. I have used getNextChar where match was used when there was no need to match anything (in add, subtract, multiply and divide) and this has led to the need to create skipNextChar used in getNumber and getName. This function does not skip white spaces but the new version of getNextChar does, which means that by using skipWhite in it, I don’t have to do it in all the places where getNextChar is used. Given that I need to skip any white spaces every time getNextChar is called, in order to get the next usable character, it makes sense in my mind to do this inside getNextChar once. Of course the price I have to pay for this, is that I need two flavours of this function: skipNextChar that does not skip white spaces and getNexrtChar that skips white spaces.

Coming up next: Interpreters

Jack chose to dedicate a chapter to Interpreters, “even if you have no interest in interpreters”. With this next chapter he is giving us quite useful insight that will be used later in the following chapters of our compiler.

Given that there were quite a few changes in many parts of the compiler in this chapter, I’m listing the full code here.

If you look at the below code carefully, you will see that I have moved the initial call to getNextChar (the one that sets the first lookahead character to get our compiler going) from the initialisation code of our main program in CompilerMain.kt to the initialisation code of the class InputProgramScanner. I feel it’s place is there because this is the class that deals with our input stream so this where the lookahead character should be initiliased.

CompilerMain.kt

package you.set.this

import kotlin.system.exitProcess

/**
 * A Simple Compiler
 * Based on the Let's Build a Compiler! series by Jack Crenshaw
 * This version produces Assembly language code for the 68000 microprocessor
 * A future version will produce code for the TI-59 Calculator
 * It will produce output that uses the TI-59 mnemonics that can be keyed directly to the calculator
 * Version 1.0 01.10.2021
 */

// the input program scanner
lateinit var inp: InputProgramScanner
// the Motorola 68000 instruction set
val code = M68000Instructions()

/** report an error */
fun error(errMsg: String) {
    System.err.println("Error: $errMsg")
}

/** abort compilation */
fun abort(errMsg: String) {
    error(errMsg)
    exitProcess(1)
}

/** print message and exit */
fun exit(msg: String) {
    System.err.println(msg)
    exitProcess(0)
}

/** report what was expected and abort */
fun expected(expMsg: String) = abort("Expected $expMsg \n found [${inp.nextChar}]")

/** compiler initialisation */
fun initCompiler(args: Array<String>) {
    if (args.isEmpty())
        abort("no input file specified")
    // get a new input program scanner object - initialise from input file
    inp = InputProgramScanner(args[0])
}

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseAssignment()
    if (!inp.isEndOfLine(inp.nextChar))
        expected("Newline")
}

InputProgramScanner.kt

package you.set.this

import java.io.File

/**
 * The input program scanner class
 * Performs the lexical scanner functions
 * Processes the char-by-char input and returns the tokens from the input stream
 */

/////////////////////////////////////////////////////////
// global special characters definitions
/////////////////////////////////////////////////////////
// numeric operators
val addOp: Char = '+'
val subOp: Char = '-'
val mulOp: Char = '*'
val divOp: Char = '/'
// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'
// equals
val equalsOp = '='
/////////////////////////////////////////////////////////

/** this class implements the lexical scanner */
class InputProgramScanner(inputFile: String = "") {

    // the input program as string
    var inputProgram: String = ""
    var indx = 0

    // the next character from input
    // this is our lookahead character
    var nextChar: Char = ' '

    /** initialisation code */
    init {
        try {
            val f = File(inputFile)
            // read the whole program into a string
            // add a dummy \n at the end
            // this way the lookahead will work properly when we reach the end of input
            inputProgram = f.readText() + "\n"
            // set the lookahead character to the first input char
            getNextChar()
        } catch (e: Exception) {
            abort("could not open file [$inputFile]")
        }
    }

    /** get the next character from input after skipping white spaces */
    fun getNextChar() {
        skipNextChar()
        skipWhite()
    }

    /** set the lookahead character to the next char from input */
    fun skipNextChar() {
        if (indx < inputProgram.length)
            nextChar = inputProgram[indx++]
        else
            exit("end of input")
    }

    /** skip white spaces */
    fun skipWhite() {
        while (isWhite(nextChar))
            skipNextChar()
    }

    /** match a specific input char */
    fun match(x: Char) {
        // return a match when the next char matches x
        if (nextChar == x)
            getNextChar()
        else
            expected("'$x'")
    }

    /**
     * get an identifier
     * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] *
     */
    fun getName(): String {
        var token: String = ""
        if (!isAlpha(nextChar))
            expected("Identifier")
        else {
            while (isAlphanumeric(nextChar)) {
                token += nextChar.uppercase()
                skipNextChar()
            }
            skipWhite()
        }
        return token
    }

    /**
     * get a number
     * <number> ::= [ <digit> ] +
     */
    fun getNumber(): String {
        var value: String = ""
        if (!isNumeric(nextChar)) {
            expected("Number")
        } else {
            while (isNumeric(nextChar)) {
                value += nextChar.toString()
                skipNextChar()
            }
            skipWhite()
        }
        return value
    }

    /** check for an alpha char */
    fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z'

    /** check for a numeric char */
    fun isNumeric(c: Char): Boolean = c in '0'..'9'

    /** check for alphanumeric */
    fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'

    /** check for an "addop" (+,-) */
    fun isAddop(c: Char): Boolean = c == addOp || c == subOp

    /** check for a "mulop" (*,/) */
    fun isMulop(c: Char): Boolean = c == mulOp || c == divOp

    /** check for left parenthesis */
    fun isLeftParen(c: Char): Boolean = c == leftParen

    /** check for end of line */
    fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'

    /** check for a white space */
    fun isWhite(c: Char): Boolean = c == ' ' || c == '\t'
}

ProgramParser.kt

package you.set.this

/**
 * Program parsing - module
 *
 */

/**
 * parse assignment
 * <assignment> ::= <identifier> = <expression>
 */
fun parseAssignment() {
        var identName: String = inp.getName()
        inp.match(equalsOp)
        parseExpression()
        code.assignment(identName)
}

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        if (inp.isAddop(inp.nextChar))
        // "trick" to deal with -Expression or +Expression
                code.clearAccumulator()
        else
                parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

/**
 * parse a term
 * <term> ::= <factor> [ <mulop> <factor> ] *
 */
fun parseTerm() {
        parseFactor()
        while (inp.isMulop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        mulOp -> multiply()
                        divOp -> divide()
                }
        }
}

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number> | <identifier>
 */
fun parseFactor() {
        if (inp.isLeftParen(inp.nextChar)) {
                // ( Expression )
                inp.getNextChar()
                parseExpression()
                inp.match(rightParen)
        }
        else
        if (inp.isAlpha(inp.nextChar))
                // Identifier
                parseIdentifier()
        else
                // Number
                code.setAccumulator(inp.getNumber())
}

/**
 * parse an identifier
 * <identifier> ::= <variable> | <function>
 */
fun parseIdentifier() {
        val identName: String = inp.getName()
        if (inp.isLeftParen(inp.nextChar)) {
                // function
                inp.getNextChar()
                inp.match(rightParen)
                code.callSubroutine(identName)
        }
        else
                // variable
                code.setAccumulatorToVar(identName)
}

/** parse an addition */
fun add() {
        inp.getNextChar()
        parseTerm()
        code.addToAccumulator()
}

/** parse a subtraction */
fun subtract() {
        inp.getNextChar()
        parseTerm()
        code.subFromAccumulator()
}

/** parse a multiplication */
fun multiply() {
        inp.getNextChar()
        parseFactor()
        code.multiplyAccumulator()
}

/** parse a division */
fun divide() {
        inp.getNextChar()
        parseFactor()
        code.divideAccumulator()
}

M68000Instructions.kt

package you.set.this

class M68000Instructions {

    /** output code */
    fun outputCode(s: String) = print("\t$s")

    /** output code with newline */
    fun outputCodeNl(s: String) = outputCode("$s\n")

    /** set accumulator to a value */
    fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0")

    /** clear accumulator */
    fun clearAccumulator() = outputCodeNl("CLR D0")

    /** push accumulator to the stack */
    fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)")

    /** add top of stack to accumulator */
    fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0")

    /** subtract top of stack from accumulator */
    fun subFromAccumulator() {
        outputCodeNl("SUB (SP)+,D0")
        // negate accumulator as the result is the wrong way round
        outputCodeNl("NEG D0")
    }

    /** multiply accumulator by top of stack */
    fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0")

    /** divide accumulator by top of stack */
    fun divideAccumulator() {
        outputCodeNl("MOVE (SP)+,D1")
        outputCodeNl("DIVS D1,D0")
    }

    /** set accumulator to variable */
    fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0")

    /** call a subroutine */
    fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}")

    /** assignment var to value */
    fun assignment(identifier: String) {
        outputCodeNl("LEA $identifier(PC),A0")
        outputCodeNl("MOVE D0,(A0)")
    }
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s