Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Getting Started
Those of you who have already read Chapter 1 will know what this is about. You will also have copied the cradle code in your Kotlin IDE, complied it and ready to go.
In this chapter Jack teaches us how to parse and translate numerical expressions. For purposes of definition a numerical expression is the right hand of an equation, as in:
x = 2 * y + 3 / (4 * z)
I will follow Jack’s method and will take this in *very* small steps.
Single Digits
In keeping with the simplicity theme, we are starting with the absolutely simplest case that we can think of: a numerical expression consisting of a single digit.
Let’s add this code:
- Add the line
parseExpression
to themain
function
/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseExpression()
}
- Add the
code
variable to the main programCompilerMain.kt
in the global variables definition section. This object will have all the necessary functions to produce the assembly code for our target cpu.
// the input program scanner lateinit var inp: InputProgramScanner // the Motorola 68000 instruction set val code = M68000Instructions()
- New file
ProgParser.kt
We will add all the expression parsing functions into this file.
package mpdev.compiler.m68kcompiler /** * Program parsing - module * */ /** parse a numeric expression */ fun parseExpression() { code.setAccumulator(inp.getNumber()) }
- New file
M68000Instructions.kt
This is where we will produce the 68000 Assembly code for each operation. If we do this properly, we should be able to change the functions in this module and produce code for some other microprocessor (or the TI59 which is the end-goal of this blog).
As always I will keep with Jack’s standard and use the D0 register of the 68000 as the register that holds the result, which going forward I will call “Accumulator” to keep with the Z80 architecture.
Please note the functions outputCode
and outputCodeNl :
they send the resulting code to the output stream without or with a newline respectively.
package mpdev.compiler.m68kcompiler class M68000Instructions { /** output code */ fun outputCode(s: String) = print("\t$s") /** output code with newline */ fun outputCodeNl(s: String) = outputCode("\t$s\n") /** set accumulator to a value */ fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0") }
Now compile and run our compiler with a text file as the only command line argument containing our “program”: any single decimal digit. Observe the corresponding MOVE
assembly instructions that will be produced on your screen. Also, try it with an input file that contains a letter, a special character or a space to test the error handling.
So, we now have a working translator!
OK, it’s pretty limited, but don’t be too quick to dismiss it. It does, on a very limited scale, what a proper compiler should do: it correctly recognises legal statements in the input language as per our definition, it produces correct, executable assembler code, suitable for assembling into object code, and, just as importantly, it correctly recognises statements that are not legal and produces a meaningful error message.
Binary (Two-operand) Expressions
Let’s expand a bit. Admittedly, an “expression” consisting of only one character is not going to take us anywhere far, so let’s see how we can extend it. Naturally we would want to handle expressions like:
1+2
5-3
or similarly to Backus-Naur Form:
<term> + | - <term>
For this, we will need a function that recognises a term and leaves the result somewhere, and a function that recognises and distinguishes between '+'
and '-'
and produces the appropriate code. Jack has chosen the 68000 register D1 for holding that temporary result.
So, here’s what’s needed:
- in
ProgParser.kt
rename the currentparseExpression
function toparseTerm
and create a new functionparseExpression
:
/** parse a term */ fun parseTerm() { code.setAccumulator(inp.getNumber()) } /** * parse a numeric expression * <term> +/- <term> */ fun parseExpression() { parseTerm() code.saveAccumulator() when (inp.nextChar) { addOp -> add() subOp -> subtract() else -> expected("Addop") } }
- In the same file create the two functions that will implement addition and subtraction:
/** parse an addition */ fun add() { inp.getNextChar() parseTerm() code.addToAccumulator() } /** parse a subtraction */ fun subtract() { inp.getNextChar() parseTerm() code.subFromAccumulator() }
- In
InputProgramScanned.kt
add these global variables outside ofclass InputProgramScanner
:
///////////////////////////////////////////////////////// // global special characters definitions ///////////////////////////////////////////////////////// // numeric operators val addOp: Char = '+' val subOp: Char = '-' ///////////////////////////////////////////////////////// /** this class implements the lexical scanner */ class InputProgramScanner(inputFile: String = "") { .......
- Add this to file
M68000Instructions.kt
/** save accumulator */
fun saveAccumulator() = outputCodeNl("MOVE D0,D1")
/** add temp result to accumulator */
fun addToAccumulator() = outputCodeNl("ADD D1,D0")
/** subtract temp result from accumulator */
fun subFromAccumulator()= outputCodeNl("SUB D1,D0")
Now run it. Try it with various expressions (combinations of two single digits separated by a + or a -) in your input file:
8+5 should give you:
MOVE #8,D0
MOVE D0,D1
MOVE #5,D0
ADD D1,D0
9-2 should give you
MOVE #9,D0
MOVE D0,D1
MOVE #2,D0
SUB D1,D0
Also, try some invalid expressions like 5+a or @-2 etc and see the error you are getting.
One thing to note here, is that our code is a bit inefficient. Storing 8 to D0 and then moving D0 to D1 could have been done in just one instruction (store 8 to D1). However, this is the drawback of the simple approach that Jack has chosen and I am following, which focuses on teaching us and on producing code that works.
Speaking of which, our code does not quite work! If you look at the subtraction example above and follow the assembly language instructions, you will see that it actually executes 2-9. We have to reverse the sign of the result in order to make it work. Please change the function subFromAccumulator
as below:
/** subtract temporay result from accumulator */ fun subFromAccumulator() { outputCodeNl("SUB D1,D0") // negate accumulator as the result is the wrong way round outputCodeNl("NEG D0") }
Try it now with a subtraction as input and verify the output.
At this point we have a parser that can recognise the sum or difference between two digits. This a big step ahead compared to earlier where we could only recognise a single digit. Try running it again with a single digit expression in the input file. Well… that did not work, did it? And why would it, we just told our compiler that an expression is two single-digit terms separated by a '+'
or a '-'
.
But in real life expressions can consist of one or more terms separated by addops ('+' or '-')
. In BNF:
<expression> ::= <term> [ <addop> <term> ] *
Looks like the function parseExpression needs to be modified with addition of a loop to match the above definition:
/** * parse a numeric expression * <expression> ::= <term> [ <addop> <term> ] * */ fun parseExpression() { parseTerm() while (inp.isAddop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { addOp -> add() subOp -> subtract() } } }
You may have noticed that the line else -> expected("Addop")
has been removed in this version. This is because it would have been an unreachable instruction.
Also note how well the new version of parseExpression
matches the BNF definition and this is the beauty in Jack’s method, which I personally find very powerful. Also don’t forget to add this function into our InputProgramScanner.kt
file:
/** check for an "addop" (+,-) */ fun isAddop(c: Char): Boolean = c == addOp || c == subOp
Compile this new version and try it. It should process any combination of any number of single-digit integers separated by '+'
or '-'
. Also verify that it handles properly error conditions.
Using the Stack
This is a great example of how powerful Jack’s method is, as he leads us from the simplest possible concept to more and more complex aspects.
So, imagine the expression:
2 + (3 + 4 * ( 8 – 6 ) / 5 )
In this example, we store the first intermediate result (2) in D1. Where will we store the 3 and so on? We will soon run out of CPU registers. The answer is, of course, the Stack.
To achieve this we will modify the functions saveAccumulator
, addToAccumulator
and subFromAccumulator
as per below:
/** push accumulator to the stack */ fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)") /** add top of stack to accumulator */ fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0") /** subtract top of stack from accumulator */ fun subFromAccumulator() { outputCodeNl("SUB (SP)+,D0") // negate accumulator as the result is the wrong way round outputCodeNl("NEG D0") }
Please try the “compiler” again, observe the output and make sure we have not broken anything.
Multiplication and Division
And now it’s time to get more serious. As a primary school kid would point out, expressions can also have multiplications and divisions. They would also point out that we should execute these first and then any additions and subtractions. So, this raises the question of priority.
So in
2 + 3 * 4
we know that we need to multiply first and then add. And as you can guess, here’s where the stack comes handy.
Jack’s top-down parsing technique is ideal to support operator precedence. So, if we extend our current simplified definition of “term” from a single digit to “product of factors” or in BNF
<term> ::= <factor> [ <mulop> <factor> ] *
where factor here is what term used to be until now (i.e. a single digit), then we can see the amazing symmetry between term and expression. This makes it really simple to apply some copy-and-paste and produce the next set of functions to parse and handle multiplication and division.
- Rename function
parseTerm
toparseFactor
/** parse a factor */ fun parseFactor() { code.setAccumulator(inp.getNumber()) }
- Add the following to
ProgParser.kt
:
/** * parse a term * <term> ::= <factor> [ <mulop> <factor> ] * */ fun parseTerm() { parseFactor() while (inp.isMulop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { mulOp -> multiply() divOp -> divide() } } } /** parse a multiplication */ fun multiply() { inp.getNextChar() parseFactor() code.multiplyAccumulator() } /** parse a division */ fun divide() { inp.getNextChar() parseFactor() code.divideAccumulator() }
- Add this to
InputProgramScanner.kt
in the global vars section
val mulOp: Char = '*' val divOp: Char = '/'
- And inside
class InputProgramScanner
:
/** check for a "mulop" (*,/) */ fun isMulop(c: Char): Boolean = c == mulOp || c == divOp
- Add this to file
M68000Instructions.kt
/** multiply accumulator by top of stack */ fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0") /** divide accumulator by top of stack */ fun divideAccumulator() { outputCodeNl("MOVE (SP)+,D1") outputCodeNl("DIVS D1,D0") }
Now try it out, e.g. with input “program” 3 + 4 * 2 – 6 / 3 and verify the assembly output. Amazing result – with a few more lines that we added in around 10 minutes, we have doubled the scope of our “compiler”.
Parenthesis
At this stage Jack wraps up this part of the parser by adding parentheses. As we all know, parentheses are a mechanism to force a desired operator precedence:
2 * ( 3 + 4 )
is telling us to add first and then multiply. With parenthesis we can create expressions of any degree of complexity such as:
( 1 + 2 ) / ( ( 3 + 4 ) + ( 5 – 6 ) )
The key to easy incorporate parenthesis into the parser, as Jack pointed out, is to realise that no matter how complicated the expression enclosed by the parentheses may be, to the outside world it just looks like a factor. So, one of the forms of a factor can be:
<factor> ::= ( <expression> ) | <number>
And this is where the recursion comes into the picture: an expression can contain a term that can contain a factor, which contains another expression that contains another term that contains another factor, etc., to infinity (or more realistically, until your Kotlin runtime environment runs out of memory).
Please modify the function parseFactor
as below:
/** * parse a factor * <factor> ::= ( <expression> ) | <number> */ fun parseFactor() { if (inp.isLeftParenthesis(inp.nextChar)) { inp.getNextChar() parseExpression() inp.match(rightParen) } else code.setAccumulator(inp.getNumber()) }
Also add this to InputProgramScanner.kt
in the global vars section:
// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'
- And inside
class InputProgramScanner
:
/** check for left parenthesis */ fun isLeftParenthesis(c: Char): Boolean = c == leftParen
Once more, we can clearly see the direct analogy between BNF and our (i.e. Jack’s) code. As usual, try it out and give it lots of wrong input to see if it works.
Unary Minus
Before pausing until the next chapter, we need to fix one thing: those of you who have done rigorous testing, would have noticed that our compiler does not recognise -1 as a valid expression. Our parseExpression
function expects an expression to start with an integer. By the way, +2 does not work either.
A quick way (but not necessarily the best, as Jack admits) to fix this is to stick an imaginary leading zero in front of this kind of expressions, so that -1 becomes 0-1 and +2 becomes 0+2.
Please modify parseExpression
as per below:
/** * parse a numeric expression * <expression> ::= <term> [ <addop> <term> ] * */ fun parseExpression() { if (inp.isAddop(inp.nextChar)) // "trick" to deal with -Expression or +Expression code.clearAccumulator() else parseTerm() while (inp.isAddop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { addOp -> add() subOp -> subtract() } } }
Also add the following function to our 68k Instruction set:
/** clear accumulator */
fun clearAccumulator() = outputCodeNl("CLR D0")
When you test your code please remember what it’s supposed to be doing at this stage: it can process any numerical expression (but remember, only one) consisting of a combination of single-digit integers with addition, subtraction, multiplication and division, including parentheses.
A note on my programming style. First, let say once more that I fully agree with Jack’s approach of keeping things simple. Nevertheless, I’ve made a couple of changes in his code:
- I’ve declared global variables for the special symbols like numerical operators, parenthesis, etc. Even though this adds lines of code, I personally prefer not to have any direct reference to constant values anywhere in my program other than once as a global declaration. This way if e.g. you decide to use
'&'
instead of'+'
or'{}'
instead of'()'
you just change it once in the global declaration and your compiler will start recognising'&'
as the addition operator and'{'
as the left parenthesis. - I have changed the call to
match
to a call togetNextChar
in the beginning ofadd
,subtract
,multiply
, etc. The reason is that the call tomatch
was partly redundant as the symbol (e.g. the'+', '-',
etc) has already been matched and the call tomatch
only served the purpose of advancing to the next input character. Hence, I have replaced it withgetNextChar
. I would not want the reader to start wondering why we are callingmatch
there and whether he or she is missing something. Jack had already mentioned this point in his original tutorial and made it clear that he prefers to always use match for standardisation. I hope my decision to change this will not cause any issues with following Jack’s code at a later stage. If this proves not to be the case then I would be more than happy to reverse my decision and reinstate the call tomatch
.
And that’s it for this round. At this point Jack gave us in the original blog “a word about optimisation”, which I would invite you to read as it’s quite useful reading that adds more colour into the purpose and goals of Jack’s tutorial.
Coming up next: More Expressions – Variables, Functions