Chapter 2: Numerical Expression Parsing

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Getting Started

Those of you who have already read Chapter 1 will know what this is about. You will also have copied the cradle code in your Kotlin IDE, complied it and ready to go.

In this chapter Jack teaches us how to parse and translate numerical expressions. For purposes of definition a numerical expression is the right hand of an equation, as in:

x = 2 * y + 3 / (4 * z)

I will follow Jack’s method and will take this in *very* small steps.

Single Digits

In keeping with the simplicity theme, we are starting with the absolutely simplest case that we can think of: a numerical expression consisting of a single digit.

Let’s add this code:

  • Add the line parseExpression to the main function
/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseExpression()
}
  • Add the code variable to the main program CompilerMain.kt in the global variables definition section. This object will have all the necessary functions to produce the assembly code for our target cpu.
// the input program scanner
lateinit var inp: InputProgramScanner
// the Motorola 68000 instruction set
val code = M68000Instructions()
  • New file ProgParser.kt

We will add all the expression parsing functions into this file.

package mpdev.compiler.m68kcompiler
/**
 * Program parsing - module
 *
 */

/** parse a numeric expression */
fun parseExpression() {
        code.setAccumulator(inp.getNumber())
}
  • New file M68000Instructions.kt

This is where we will produce the 68000 Assembly code for each operation. If we do this properly, we should be able to change the functions in this module and produce code for some other microprocessor (or the TI59 which is the end-goal of this blog).

As always I will keep with Jack’s standard and use the D0 register of the 68000 as the register that holds the result, which going forward I will call “Accumulator” to keep with the Z80 architecture.

Please note the functions outputCode and outputCodeNl : they send the resulting code to the output stream without or with a newline respectively.

package mpdev.compiler.m68kcompiler

class M68000Instructions {

    /** output code */
    fun outputCode(s: String) = print("\t$s")

    /** output code with newline */
    fun outputCodeNl(s: String) = outputCode("\t$s\n")

    /** set accumulator to a value */
    fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0")

}

Now compile and run our compiler with a text file as the only command line argument containing our “program”: any single decimal digit. Observe the corresponding MOVE assembly instructions that will be produced on your screen. Also, try it with an input file that contains a letter, a special character or a space to test the error handling.

So, we now have a working translator!

OK, it’s pretty limited, but don’t be too quick to dismiss it. It does, on a very limited scale, what a proper compiler should do: it correctly recognises legal statements in the input language as per our definition, it produces correct, executable assembler code, suitable for assembling into object code, and, just as importantly, it correctly recognises statements that are not legal and produces a meaningful error message.

Binary (Two-operand) Expressions

Let’s expand a bit. Admittedly, an “expression” consisting of only one character is not going to take us anywhere far, so let’s see how we can extend it. Naturally we would want to handle expressions like:

1+2

5-3

or similarly to Backus-Naur Form:

<term> + | - <term>

For this, we will need a function that recognises a term and leaves the result somewhere, and a function that recognises and distinguishes between '+' and '-' and produces the appropriate code. Jack has chosen the 68000 register D1 for holding that temporary result.

So, here’s what’s needed:

  • in ProgParser.kt rename the current parseExpression function to parseTerm and create a new function parseExpression :
/** parse a term */
fun parseTerm() {
        code.setAccumulator(inp.getNumber())
}

/**
 * parse a numeric expression
 * <term> +/- <term>
 */
fun parseExpression() {
        parseTerm()
        code.saveAccumulator()
        when (inp.nextChar) {
                addOp -> add()
                subOp -> subtract()
                else -> expected("Addop")
        }
}
  • In the same file create the two functions that will implement addition and subtraction:
/** parse an addition */
fun add() {
        inp.getNextChar()
        parseTerm()
        code.addToAccumulator()
}

/** parse a subtraction */
fun subtract() {
        inp.getNextChar()
        parseTerm()
        code.subFromAccumulator()
}
  • In InputProgramScanned.kt add these global variables outside of class InputProgramScanner:
/////////////////////////////////////////////////////////
// global special characters definitions
/////////////////////////////////////////////////////////
// numeric operators
val addOp: Char = '+'
val subOp: Char = '-'
/////////////////////////////////////////////////////////

/** this class implements the lexical scanner */
class InputProgramScanner(inputFile: String = "") {
.......
  • Add this to file M68000Instructions.kt
/** save accumulator */
fun saveAccumulator() = outputCodeNl("MOVE D0,D1")

/** add temp result to accumulator */
fun addToAccumulator() = outputCodeNl("ADD D1,D0")

/** subtract temp result from accumulator */
fun subFromAccumulator()= outputCodeNl("SUB D1,D0")

Now run it. Try it with various expressions (combinations of two single digits separated by a + or a -) in your input file:

8+5 should give you:

    MOVE #8,D0
    MOVE D0,D1
    MOVE #5,D0
    ADD D1,D0

9-2 should give you

    MOVE #9,D0
    MOVE D0,D1
    MOVE #2,D0
    SUB D1,D0

Also, try some invalid expressions like 5+a or @-2 etc and see the error you are getting.

One thing to note here, is that our code is a bit inefficient. Storing 8 to D0 and then moving D0 to D1 could have been done in just one instruction (store 8 to D1). However, this is the drawback of the simple approach that Jack has chosen and I am following, which focuses on teaching us and on producing code that works.

Speaking of which, our code does not quite work! If you look at the subtraction example above and follow the assembly language instructions, you will see that it actually executes 2-9. We have to reverse the sign of the result in order to make it work. Please change the function subFromAccumulator as below:

/** subtract temporay result from accumulator */
fun subFromAccumulator() {
    outputCodeNl("SUB D1,D0")
    // negate accumulator as the result is the wrong way round
    outputCodeNl("NEG D0")
}

Try it now with a subtraction as input and verify the output.

At this point we have a parser that can recognise the sum or difference between two digits. This a big step ahead compared to earlier where we could only recognise a single digit. Try running it again with a single digit expression in the input file. Well… that did not work, did it? And why would it, we just told our compiler that an expression is two single-digit terms separated by a '+' or a '-'.

But in real life expressions can consist of one or more terms separated by addops ('+' or '-'). In BNF:

<expression> ::= <term> [ <addop> <term> ] *

Looks like the function parseExpression needs to be modified with addition of a loop to match the above definition:

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

You may have noticed that the line else -> expected("Addop") has been removed in this version. This is because it would have been an unreachable instruction.

Also note how well the new version of parseExpression matches the BNF definition and this is the beauty in Jack’s method, which I personally find very powerful. Also don’t forget to add this function into our InputProgramScanner.kt file:

/** check for an "addop" (+,-) */
fun isAddop(c: Char): Boolean = c == addOp || c == subOp 

Compile this new version and try it. It should process any combination of any number of single-digit integers separated by '+' or '-'. Also verify that it handles properly error conditions.

Using the Stack

This is a great example of how powerful Jack’s method is, as he leads us from the simplest possible concept to more and more complex aspects.

So, imagine the expression:

2 + (3 + 4 * ( 8 – 6 ) / 5 )

In this example, we store the first intermediate result (2) in D1. Where will we store the 3 and so on? We will soon run out of CPU registers. The answer is, of course, the Stack.

To achieve this we will modify the functions saveAccumulator, addToAccumulator and subFromAccumulator as per below:

/** push accumulator to the stack */
fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)")

/** add top of stack to accumulator */
fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0")

/** subtract top of stack from accumulator */
fun subFromAccumulator() {
    outputCodeNl("SUB (SP)+,D0")
    // negate accumulator as the result is the wrong way round
    outputCodeNl("NEG D0")
}

Please try the “compiler” again, observe the output and make sure we have not broken anything.

Multiplication and Division

And now it’s time to get more serious. As a primary school kid would point out, expressions can also have multiplications and divisions. They would also point out that we should execute these first and then any additions and subtractions. So, this raises the question of priority.

So in

2 + 3 * 4

we know that we need to multiply first and then add. And as you can guess, here’s where the stack comes handy.

Jack’s top-down parsing technique is ideal to support operator precedence. So, if we extend our current simplified definition of “term” from a single digit to “product of factors” or in BNF

<term> ::= <factor> [ <mulop> <factor> ] *

where factor here is what term used to be until now (i.e. a single digit), then we can see the amazing symmetry between term and expression. This makes it really simple to apply some copy-and-paste and produce the next set of functions to parse and handle multiplication and division.

  • Rename function parseTerm to parseFactor
/** parse a factor */
fun parseFactor() {
        code.setAccumulator(inp.getNumber())
}
  • Add the following to ProgParser.kt:
/**
 * parse a term
 * <term> ::= <factor> [ <mulop> <factor> ] *
 */
fun parseTerm() {
        parseFactor()
        while (inp.isMulop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        mulOp -> multiply()
                        divOp -> divide()
                }
        }
}

/** parse a multiplication */
fun multiply() {
        inp.getNextChar()
        parseFactor()
        code.multiplyAccumulator()
}

/** parse a division */
fun divide() {
        inp.getNextChar()
        parseFactor()
        code.divideAccumulator()
}
  • Add this to InputProgramScanner.kt in the global vars section
val mulOp: Char = '*'
val divOp: Char = '/'
  • And inside class InputProgramScanner:
/** check for a "mulop" (*,/) */
fun isMulop(c: Char): Boolean = c == mulOp || c == divOp
  • Add this to file M68000Instructions.kt
/** multiply accumulator by top of stack */
fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0")

/** divide accumulator by top of stack */
fun divideAccumulator() {
    outputCodeNl("MOVE (SP)+,D1")
    outputCodeNl("DIVS D1,D0")
}

Now try it out, e.g. with input “program” 3 + 4 * 2 – 6 / 3 and verify the assembly output. Amazing result – with a few more lines that we added in around 10 minutes, we have doubled the scope of our “compiler”.

Parenthesis

At this stage Jack wraps up this part of the parser by adding parentheses. As we all know, parentheses are a mechanism to force a desired operator precedence:

2 * ( 3 + 4 )

is telling us to add first and then multiply. With parenthesis we can create expressions of any degree of complexity such as:

( 1 + 2 ) / ( ( 3 + 4 ) + ( 5 – 6 ) )

The key to easy incorporate parenthesis into the parser, as Jack pointed out, is to realise that no matter how complicated the expression enclosed by the parentheses may be, to the outside world it just looks like a factor. So, one of the forms of a factor can be:

<factor> ::= ( <expression> ) | <number> 

And this is where the recursion comes into the picture: an expression can contain a term that can contain a factor, which contains another expression that contains another term that contains another factor, etc., to infinity (or more realistically, until your Kotlin runtime environment runs out of memory).

Please modify the function parseFactor as below:

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number>
 */
fun parseFactor() {
        if (inp.isLeftParenthesis(inp.nextChar)) {
              inp.getNextChar()
              parseExpression()
              inp.match(rightParen)  
        }
        else
              code.setAccumulator(inp.getNumber())
}

Also add this to InputProgramScanner.kt in the global vars section:

// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'
  • And inside class InputProgramScanner:
/** check for left parenthesis */
fun isLeftParenthesis(c: Char): Boolean = c == leftParen

Once more, we can clearly see the direct analogy between BNF and our (i.e. Jack’s) code. As usual, try it out and give it lots of wrong input to see if it works.

Unary Minus

Before pausing until the next chapter, we need to fix one thing: those of you who have done rigorous testing, would have noticed that our compiler does not recognise -1 as a valid expression. Our parseExpression function expects an expression to start with an integer. By the way, +2 does not work either.

A quick way (but not necessarily the best, as Jack admits) to fix this is to stick an imaginary leading zero in front of this kind of expressions, so that -1 becomes 0-1 and +2 becomes 0+2.

Please modify parseExpression as per below:

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        if (inp.isAddop(inp.nextChar))
                // "trick" to deal with -Expression or +Expression
                code.clearAccumulator()
        else
                parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

Also add the following function to our 68k Instruction set:

/** clear accumulator */
fun clearAccumulator() = outputCodeNl("CLR D0")

When you test your code please remember what it’s supposed to be doing at this stage: it can process any numerical expression (but remember, only one) consisting of a combination of single-digit integers with addition, subtraction, multiplication and division, including parentheses.

A note on my programming style. First, let say once more that I fully agree with Jack’s approach of keeping things simple. Nevertheless, I’ve made a couple of changes in his code:

  • I’ve declared global variables for the special symbols like numerical operators, parenthesis, etc. Even though this adds lines of code, I personally prefer not to have any direct reference to constant values anywhere in my program other than once as a global declaration. This way if e.g. you decide to use '&' instead of '+' or '{}' instead of '()' you just change it once in the global declaration and your compiler will start recognising '&' as the addition operator and '{' as the left parenthesis.
  • I have changed the call to match to a call to getNextChar in the beginning of add, subtract, multiply, etc. The reason is that the call to match was partly redundant as the symbol (e.g. the '+', '-', etc) has already been matched and the call to match only served the purpose of advancing to the next input character. Hence, I have replaced it with getNextChar. I would not want the reader to start wondering why we are calling match there and whether he or she is missing something. Jack had already mentioned this point in his original tutorial and made it clear that he prefers to always use match for standardisation. I hope my decision to change this will not cause any issues with following Jack’s code at a later stage. If this proves not to be the case then I would be more than happy to reverse my decision and reinstate the call to match.

And that’s it for this round. At this point Jack gave us in the original blog “a word about optimisation”, which I would invite you to read as it’s quite useful reading that adds more colour into the purpose and goals of Jack’s tutorial.

Coming up next: More Expressions – Variables, Functions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s