Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
So far this tutorial has shown us how to parse and translate numeric expressions up to arbitrary levels of complexity. However, we still have a few restrictions:
- no variables are allowed – only numeric factors
- the numeric factors were limited to single digits
- no white spaces allowed
In this third chapter we will remove these restrictions and will also add support for assignments and simple calls to functions. Please remember that the second restriction is self-imposed and I fully agree with Jack that it helps simplify the code so that we can focus on the important points first. When we are ready to get rid of it, it will be very easy.
Variables
Most expressions in real life programming involve variables such as:
b * b + 4 * a + c
If we go back to our current definition of factor
<factor> ::= ( <expression> ) | <number>
we can easily conclude that the variable references in the above expression are just factors. So the factor definition becomes now:
<factor> ::= ( <expression> ) | <number> | <variable>
So, from this we can see that we have an easy choice: if the next character is a digit we have a number, if it’s a letter we have a variable. Same as with the number, we will just have to set the Accumulator to the value of the variable. Sticking with the 68000 assembly code for now, we are learning form Jack’s tutorial that this is done in a position-independent manner, i.e. it’s relative to the program counter:
MOVE X(PC),D0
where X
is the variable name. Based on the above the method parseFactor
now becomes:
/** * parse a factor * <factor> ::= ( <expression> ) | <number> | <variable> */ fun parseFactor() { if (inp.isLeftParen(inp.nextChar)) { // ( Expression ) inp.getNextChar() parseExpression() inp.match(rightParen) } else if (inp.isAlpha(inp.nextChar)) // Variable code.setAccumulatorToVar(inp.getName()) else // Number code.setAccumulator(inp.getNumber()) }
Please remember to add the new 68000 instruction in M68000Instructions.kt
:
/** set accumulator to variable */
fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0")
So, with just 2 extra lines of code in the parser we have added support for variables! And as before our parseFactor
function matches exactly the BNF definition. As usual, try it with various input “programs” and ensure it does proper error checking.
Functions
Apparently functions is one of the few common things that is supported by many languages. Even though this is too early stage to introduce functions, because we have not addressed the question about passing parameters, Jack includes simple function calls without parameters here, as it brings up an important point.
Until now, the parsing was predictive, i.e. by looking at the next character we could tell what follows. However, with the introduction of functions, if the next character is a letter, we don’t really know whether it’s a variable or a function. This is solved by following the C language standard where a function name is always followed by () – empty if there are no parameters.
Given that we cannot handle parameters yet, our function calls will have the form:
f()
Without parameters to take care of, there is nothing else to do than call that function. Given that now we have two possibilities if we encounter an “alpha” in parseFactor
, this will be dealt with in a separate parse function.
Please modify parseFactor
as below and add the new function parseIdentifier
:
/** * parse a factor * <factor> ::= ( <expression> ) | <number> | <identifier> */ fun parseFactor() { if (inp.isLeftParen(inp.nextChar)) { // ( Expression ) inp.getNextChar() parseExpression() inp.match(rightParen) } else if (inp.isAlpha(inp.nextChar)) // Identifier parseIdentifier() else // Number code.setAccumulator(inp.getNumber()) } /** * parse an identifier * <identifier> ::= <variable> | <function> */ fun parseIdentifier() { val identName: String = inp.getName() if (inp.isLeftParen(inp.nextChar)) { // function inp.getNextChar() inp.match(rightParen) code.callSubroutine(identName) } else // variable code.setAccumulatorToVar(identName) }
And of course add the new instruction into M68000Instructions.kt
/** call a subroutine */
fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}")
Please compile and run this version and as always test the error handling.
What Jack has shown us here is that with very little added complexity we have increased the scope of our compiler tremendously. Where the predictive approach cannot tell us what follows (at the point where parseFactor
sees a letter coming next) we just branch to another function and let it deal with it by applying predictive approach further down the line (read that identifier name and check if the next character is a left parenthesis). And this technique can extend many levels down.
More on Error Handling
At this point let me point out that our compiler is performing decent error handling, catching invalid expressions and aborts the compilation in these scenarios. As Jack mentioned, we don’t have any direct error checks in any of the parser functions, so we seem to be getting this error handling almost “for free”. And this is indeed the case. Our parser relies on the error handling done in match
, getNumber
and getName
.
Having said that, those of you who are good at breaking programs must have noticed that the expressions:
1+2<space>3*4 or
1+2@3*4
both stop after processing 1+2. And if you think about it, this is what we have told our compiler to do: process an Expression and stop at the end of it (i.e. when the next character is not a recognised as part of the expression).
We temporarily can fix this gap in our error handling by agreeing that our expressions should end with a newline. To enforce this add the following code into our main
function just after the call to parseExpression
:
/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseExpression()
if (!inp.isEndOfLine(inp.nextChar))
expected("Newline")
}
Don’t forget to add the corresponding function into InputProgramScanner.kt
:
/** check for end of line */
fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'
As you can see the function that checks for end of line checks for either CR or LF to cover both Windows and Unix. However, you are free to fine-tune it to make sure it suits your environment.
As usual, compile it and try it. It should now recognise and flag expressions not properly terminated with a new line.
Assignment Statements
So, now that we have successfully parsed complex numerical expressions, let’s take the natural next step and save the output of the expression, which brings us to assignments. An assignment will usually have the form:
<assignment> ::= <identifier> = <expression>
In order to parse this we need another function that replicates the above BNF:
/**
* parse assignment
* <assignment> ::= <identifier> = <expression>
*/
fun parseAssignment() {
var identName: String = inp.getName()
inp.match(equalsOp)
parseExpression()
code.assignment(identName)
}
Also need to add one more global declaration in our InputProgramScanner.kt
:
// equals
val equalsOp = '='
And of course the necessary 68000 code (this is very CPU specific so I trust Jack here):
/** assignment var to value */
fun assignment(identifier: String) {
outputCodeNl("LEA $identifier(PC),A0")
outputCodeNl("MOVE D0,(A0)")
}
Finally, replace the call to parseExpression
in function main
with a call to parseAssignement
:
/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseAssignment()
if (!inp.isEndOfLine(inp.nextChar))
expected("Newline")
}
Once again, by adding 6 lines of code in the parser we have extended the functionality of our compiler dramatically! And as before, error checking comes for free as the new parsing function relies on match
and getName
to do the error checking.
If expressions was the only thing we would need for a program then we would be done. However, programs also need control statements, if-then-else, loops, etc. These are covered in chapter 5. What we have achieved up to now though is quite important, as expressions are one of the most challenging topics in a programming language.
Multi-Character Tokens
As I have said already, I’m following Jack’s approach by sticking to the “keep it simple” concept, hence the restrictions of single character tokens. This approach is used in some of the following chapters as well in order to avoid complexity and focus on the important issue at hand. It is very easy to be removed though as you can see below.
So, if we think of the identifier as a letter followed by alphanumeric characters (letters or numbers) or '_'
, and a number as a series of numeric digits, then we can easily modify getName
and getNumber
as per below:
- Rename
getNextChar
toskipNextChar
and define a new functiongetNextChar
that for now just callsskipNextChar
. This will be used in the following topic (white spaces):
/** get the next character from input */
fun getNextChar() {
skipNextChar()
}
/** set the lookahead character to the next char from input */
fun skipNextChar() {
if (indx < inputProgram.length)
nextChar = inputProgram[indx++]
else
exit("end of input")
}
- Change getName and getNumber as per below:
/** get an identifier */ fun getName(): String { var token: String = "" if (!isAlpha(nextChar)) expected("Identifier") else { while (isAlphanumeric(nextChar)) { token += nextChar.uppercase() skipNextChar() } } return token } /** * get a number * <number ::= [ <digit> ] + */ fun getNumber(): String { var value: String = "" if (!isNumeric(nextChar)) { expected("Number") } else { while (isNumeric(nextChar)) { value += nextChar.toString() skipNextChar() } } return value }
- Add the following as well to our InputProgramScanner class:
/** check for alphanumeric */
fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'
That seemed really easy. Compile and test this code. Now you can parse expressions like
var1=1234*x1+95-46*y1
Make sure it flags invalid characters in the expression.
White Spaces
The white space restriction is also one of those little things that annoy us, so let’s make our compiler more user-friendly by removing it. Not surprisingly Jack is suggesting that we find a simple rule to skip the white space and apply it everywhere. Until now, the lookahead character contained the next usable character in the input stream and this has always been true as there are no white spaces in our input code. The lookahead character is set by getNextChar
.
So, naturally, we can enhance getNextChar to skip any white spaces and get us the next usable character from the input ignoring any white spaces.
- Please add the new functions
isWhite
andskipWhite
and modifygetNextChar
as below:
/** get the next character from input after skipping white spaces */ fun getNextChar() { skipNextChar() skipWhite() } /** skip white spaces */ fun skipWhite() { while (isWhite(nextChar)) skipNextChar() } /** check for a white space */ fun isWhite(c: Char): Boolean = c == ' ' || c == '\t'
- Add a call to
skipWhite
ingetName
andgetNumber
at the end of the loop asskipNextChar,
which is used there, does not skip white spaces:
/** * get an identifier * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] * */ fun getName(): String { var token: String = "" if (!isAlpha(nextChar)) expected("Identifier") else { while (isAlphanumeric(nextChar)) { token += nextChar.uppercase() skipNextChar() } skipWhite() } return token } /** * get a number * <number> ::= [ <digit> ] + */ fun getNumber(): String { var value: String = "" if (!isNumeric(nextChar)) { expected("Number") } else { while (isNumeric(nextChar)) { value += nextChar.toString() skipNextChar() } skipWhite() } return value }
Here there is another difference between my code and Jack’s. I have used getNextChar
where match
was used when there was no need to match anything (in add
, subtract
, multiply
and divide
) and this has led to the need to create skipNextChar
used in getNumber
and getName
. This function does not skip white spaces but the new version of getNextChar
does, which means that by using skipWhite
in it, I don’t have to do it in all the places where getNextChar
is used. Given that I need to skip any white spaces every time getNextChar
is called, in order to get the next usable character, it makes sense in my mind to do this inside getNextChar
once. Of course the price I have to pay for this, is that I need two flavours of this function: skipNextChar
that does not skip white spaces and getNexrtChar
that skips white spaces.
Coming up next: Interpreters
Jack chose to dedicate a chapter to Interpreters, “even if you have no interest in interpreters”. With this next chapter he is giving us quite useful insight that will be used later in the following chapters of our compiler.
Given that there were quite a few changes in many parts of the compiler in this chapter, I’m listing the full code here.
If you look at the below code carefully, you will see that I have moved the initial call to getNextChar
(the one that sets the first lookahead character to get our compiler going) from the initialisation code of our main program in CompilerMain.kt
to the initialisation code of the class InputProgramScanner
. I feel it’s place is there because this is the class that deals with our input stream so this where the lookahead character should be initiliased.
CompilerMain.kt
package you.set.this import kotlin.system.exitProcess /** * A Simple Compiler * Based on the Let's Build a Compiler! series by Jack Crenshaw * This version produces Assembly language code for the 68000 microprocessor * A future version will produce code for the TI-59 Calculator * It will produce output that uses the TI-59 mnemonics that can be keyed directly to the calculator * Version 1.0 01.10.2021 */ // the input program scanner lateinit var inp: InputProgramScanner // the Motorola 68000 instruction set val code = M68000Instructions() /** report an error */ fun error(errMsg: String) { System.err.println("Error: $errMsg") } /** abort compilation */ fun abort(errMsg: String) { error(errMsg) exitProcess(1) } /** print message and exit */ fun exit(msg: String) { System.err.println(msg) exitProcess(0) } /** report what was expected and abort */ fun expected(expMsg: String) = abort("Expected $expMsg \n found [${inp.nextChar}]") /** compiler initialisation */ fun initCompiler(args: Array<String>) { if (args.isEmpty()) abort("no input file specified") // get a new input program scanner object - initialise from input file inp = InputProgramScanner(args[0]) } /** main function */ fun main(args: Array<String>) { initCompiler(args) parseAssignment() if (!inp.isEndOfLine(inp.nextChar)) expected("Newline") }
InputProgramScanner.kt
package you.set.this import java.io.File /** * The input program scanner class * Performs the lexical scanner functions * Processes the char-by-char input and returns the tokens from the input stream */ ///////////////////////////////////////////////////////// // global special characters definitions ///////////////////////////////////////////////////////// // numeric operators val addOp: Char = '+' val subOp: Char = '-' val mulOp: Char = '*' val divOp: Char = '/' // parentheses val leftParen: Char = '(' val rightParen: Char = ')' // equals val equalsOp = '=' ///////////////////////////////////////////////////////// /** this class implements the lexical scanner */ class InputProgramScanner(inputFile: String = "") { // the input program as string var inputProgram: String = "" var indx = 0 // the next character from input // this is our lookahead character var nextChar: Char = ' ' /** initialisation code */ init { try { val f = File(inputFile) // read the whole program into a string // add a dummy \n at the end // this way the lookahead will work properly when we reach the end of input inputProgram = f.readText() + "\n" // set the lookahead character to the first input char getNextChar() } catch (e: Exception) { abort("could not open file [$inputFile]") } } /** get the next character from input after skipping white spaces */ fun getNextChar() { skipNextChar() skipWhite() } /** set the lookahead character to the next char from input */ fun skipNextChar() { if (indx < inputProgram.length) nextChar = inputProgram[indx++] else exit("end of input") } /** skip white spaces */ fun skipWhite() { while (isWhite(nextChar)) skipNextChar() } /** match a specific input char */ fun match(x: Char) { // return a match when the next char matches x if (nextChar == x) getNextChar() else expected("'$x'") } /** * get an identifier * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] * */ fun getName(): String { var token: String = "" if (!isAlpha(nextChar)) expected("Identifier") else { while (isAlphanumeric(nextChar)) { token += nextChar.uppercase() skipNextChar() } skipWhite() } return token } /** * get a number * <number> ::= [ <digit> ] + */ fun getNumber(): String { var value: String = "" if (!isNumeric(nextChar)) { expected("Number") } else { while (isNumeric(nextChar)) { value += nextChar.toString() skipNextChar() } skipWhite() } return value } /** check for an alpha char */ fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z' /** check for a numeric char */ fun isNumeric(c: Char): Boolean = c in '0'..'9' /** check for alphanumeric */ fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_' /** check for an "addop" (+,-) */ fun isAddop(c: Char): Boolean = c == addOp || c == subOp /** check for a "mulop" (*,/) */ fun isMulop(c: Char): Boolean = c == mulOp || c == divOp /** check for left parenthesis */ fun isLeftParen(c: Char): Boolean = c == leftParen /** check for end of line */ fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r' /** check for a white space */ fun isWhite(c: Char): Boolean = c == ' ' || c == '\t' }
ProgramParser.kt
package you.set.this /** * Program parsing - module * */ /** * parse assignment * <assignment> ::= <identifier> = <expression> */ fun parseAssignment() { var identName: String = inp.getName() inp.match(equalsOp) parseExpression() code.assignment(identName) } /** * parse a numeric expression * <expression> ::= <term> [ <addop> <term> ] * */ fun parseExpression() { if (inp.isAddop(inp.nextChar)) // "trick" to deal with -Expression or +Expression code.clearAccumulator() else parseTerm() while (inp.isAddop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { addOp -> add() subOp -> subtract() } } } /** * parse a term * <term> ::= <factor> [ <mulop> <factor> ] * */ fun parseTerm() { parseFactor() while (inp.isMulop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { mulOp -> multiply() divOp -> divide() } } } /** * parse a factor * <factor> ::= ( <expression> ) | <number> | <identifier> */ fun parseFactor() { if (inp.isLeftParen(inp.nextChar)) { // ( Expression ) inp.getNextChar() parseExpression() inp.match(rightParen) } else if (inp.isAlpha(inp.nextChar)) // Identifier parseIdentifier() else // Number code.setAccumulator(inp.getNumber()) } /** * parse an identifier * <identifier> ::= <variable> | <function> */ fun parseIdentifier() { val identName: String = inp.getName() if (inp.isLeftParen(inp.nextChar)) { // function inp.getNextChar() inp.match(rightParen) code.callSubroutine(identName) } else // variable code.setAccumulatorToVar(identName) } /** parse an addition */ fun add() { inp.getNextChar() parseTerm() code.addToAccumulator() } /** parse a subtraction */ fun subtract() { inp.getNextChar() parseTerm() code.subFromAccumulator() } /** parse a multiplication */ fun multiply() { inp.getNextChar() parseFactor() code.multiplyAccumulator() } /** parse a division */ fun divide() { inp.getNextChar() parseFactor() code.divideAccumulator() }
M68000Instructions.kt
package you.set.this class M68000Instructions { /** output code */ fun outputCode(s: String) = print("\t$s") /** output code with newline */ fun outputCodeNl(s: String) = outputCode("$s\n") /** set accumulator to a value */ fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0") /** clear accumulator */ fun clearAccumulator() = outputCodeNl("CLR D0") /** push accumulator to the stack */ fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)") /** add top of stack to accumulator */ fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0") /** subtract top of stack from accumulator */ fun subFromAccumulator() { outputCodeNl("SUB (SP)+,D0") // negate accumulator as the result is the wrong way round outputCodeNl("NEG D0") } /** multiply accumulator by top of stack */ fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0") /** divide accumulator by top of stack */ fun divideAccumulator() { outputCodeNl("MOVE (SP)+,D1") outputCodeNl("DIVS D1,D0") } /** set accumulator to variable */ fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0") /** call a subroutine */ fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}") /** assignment var to value */ fun assignment(identifier: String) { outputCodeNl("LEA $identifier(PC),A0") outputCodeNl("MOVE D0,(A0)") } }