Chapter VII: Lexical Scanning

30th Nov 202129th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous 6 chapters we have covered the basic structures of a programming language, like numerical expressions, Boolean expressions and control structures, and by now we have managed to build a basic compiler that can parse something that looks like a programming language. The biggest restriction we still have in place is the single-character tokens, which make our programming language look a bit funny. And this restriction we will get rid of in this chapter.

Until now I have followed closely Jack’s original tutorial (apart from some minor exceptions) and that has given us a great benefit in terms of delivering a working “product”, by using small blocks and continuously building on what we’ve done so far, following Jack’s step-by-step approach, while at the same time, keeping it simple.

In this chapter I will change approach and will introduce some of my own enhancements. The reasons for this are (a) having benefited and learnt from Jack’s tutorial, I feel I can start making my own experiments and see where this can lead us (b) I’m keen to use the capabilities that a language like Kotlin gives us and (c) I want to start focusing on my original goal, which is to produce code for the TI-59. Hence the different numbering style from this chapter on, using Latin numbers.

Nevertheless, I will continue using ideas and concepts from Jack’s following chapters, and will be mentioning these clearly in my text. I strongly believe that the original “Let’s Build a Compiler!” series is a masterpiece, and this is proven by the fact that it has been adapted and reproduced numerous times.

The Various Parts of our Compiler

You may have noticed that our compiler code is split in various files. This is not by accident – the split represents the 3 basic modules a compiler consists of: the Lexical Scanner, the Parser and the Code Generator.

The file InputProgramScanner.kt has all the functions that implement the lexical scanning, i.e. the breaking down of the input stream to distinct tokens, which are then passed to the next stage. It is here that we deal with all the detail about handling white spaces, newlines, and other input format specifics.

The three files ProgParser*.kt implement the Parser. I have split this in 3 sub-modules to make it more structured and easier to follow (one for numerical expressions, one for Boolean expressions and one for control structures). It is in the parser where the language structures are recognised and processed so that they can be translated to object code in the next stage.

Finally, the file M68000Instructions.kt implements the code generator. this is where the specific instructions for the target machine are produced.

I used this separation in these 3 modules from the beginning by using common sense and by following the modular approach that I always like to follow – write small pieces of code that interact with each other, each of which does a small part of the whole job, and if this little part had to be changed, then only that corresponding small module would need need to be changed. E.g. if we want to produce code for 80×86 instead of M68000, we just have to replace the code generator, while everything else remains as is. Similarly, if we want to start using '.AND.' instead of '&' for the Boolean “and” operator we just have to change this in our lexical scanner so that it can recognise the correct token (and we will be doing this in the next chapter).

The Approach

The main concept in Jack’s tutorial is the lookahead character (where “character” means “token” in the single-character token world). By doing a simple check on this lookahead character, our parser knows the next token and can make a decision as to what follows and therefore take the appropriate path. And I intend to keep this concept that Jack taught us.

In order to convert form single-character to multi-character tokens we obviously need to change the value of the token from char to string. But by doing this, we are losing the benefit that the single character gave us, which was by doing one single comparison (character comparison) we knew exactly what followed (which keyword, which operator, identifier, number, etc). If we try to do this with a string instead, we will end up with quite complex and convoluted IF statements which would be against the concept of simplicity. E.g. if you see the character 'I' then you would naturally check the following character to see if it’s an 'F' so you have the string 'IF'. But in this case how do you know it’s the keyword 'IF' and not the beginning of a variable name called 'IFILE'? Similarly, if you see a '+', how do you know it’s the operator '+' and not '++' or '+='?

So, it looks like we need to extend the definition of our token, and, apart from it’s value (string), include other attributes, like an encoded value to make the comparison easier, and also a category to help us group tokens together (e.g. addops). This thinking leads us to the use of enumerated values for these attributes and also points towards using an object for the token as opposed to a single value (string).

This is more or less what Jack has done by using a pair of variables Token and Value. In Token he holds the encoded token (e.g. number, identifier, if-token, else-token, operator) while in Value he holds the actual value of it (e.g. the identifier name, or the actual number)

Token Redefined

So, based on this thinking, let’s see how our token definition will look like:

class Token(val value: String = NO_TOKEN,
            val encToken: Kwd = Kwd.none,
            val info: TokInfo = TokInfo.none,
            val type: TokType = TokType.none)

val NO_TOKEN = "No Token"

The value is holding the string value of the token as it’s received (actually in our case it’s been converted to upper case). The encToken is holding the encoded value of the token that is used to drive the decisions in the parser. The enumeration for this can be something like:

enum class Kwd { addOp, subOp, mulOp, divOp, equalsOp, ifToken, elseToken, endifToken, whileToken, endwhileToken, repeatToken, untilToken, ... ..., endOfInput, any, invalid, none }

By using this enumeration, a single comparison (similar to the character comparison we have used so far) is good enough for the parser, so that it knows exactly what follows.

The info will be something like:

enum class TokInfo { identifier, keyword, operator, number, invalid, none }

While the type can be something like:

enum class TokType { addOps, mulOps, booleanLit, endOfBlock, endOfPRogram, endOfInput, invalid, none }

You can see that by using the TokType enumeration, functions like isAddop, isEndOfBlock, etc. can now be replaced by a simple check of the type attribute.

At this stage I’m not entirely convinced we need the info attribute – we will figure this out by the end of the next chapter.

Having defined how our tokens will look like from now on (until now they were single characters), let’s see what we need to do in order to identify them. I’m definitely in favour of Jack’s design, and intend to keep match as our main function of getting (and checking) tokens in the parser. However, match has to be changed now to compare the encToken instead of the single character that it used to compare:

/**
 * get the next token from the input stream and advance the cursor
 * match this token against a specific given token 'x'
 * also produce a match if called with no token or if token is "any"
 * return the token object that has been matched
 * called by the parser functions
 */
fun match(x: Kwd = Kwd.any): Token {
    if (nextToken.encToken != x && x != Kwd.any)
        expected(decodeToken(x))
    val thisToken = nextToken
    nextToken = scan()
    return thisToken
}

In this version of the scanner, nextToken replaces nextChar (as a concept , that is – nextChar is still used internally in the scanner):

// the next token is here so that we can look ahead
private lateinit var nextToken: Token

There are two more changes in this version of match (apart from comparing the encToken instead of char): The first one is that it now returns the token that it matched (or any token found if it has been asked not to match any). The reason for this is that by returning the token to the caller, match can now be used as the main interface between the parser and the scanner (which was also Jack’s original intention). The second difference is that it now calls scan (instead of getNextChar) to advance to the next token. This was necessary as our tokens are not single characters anymore, so we need to do a bit more work in order to get the next token. I will speak about scan shortly – this is where the intelligence in identifying the next token is going to be.

We have also seen that the parser needs to occasionally check the value of the next token without advancing the “cursor” and so far this has been done by checking the value of nextChar. I would not want to directly access nextToken from the parser, I would prefer this, from the structural point of view, to be a private variable in our scanner class. So we need one more function in our scanner to return the value of the next token, to fulfil the lookahead function without advancing the cursor:

/** lookahead function - returns next token without advancing the cursor */
fun lookahead(): Token = nextToken

So with this, the only two points of interaction between the parser and the scanner are match (to match a token and advance to the next one or just advance to the next one) and lookahead (to check the next token but do not advance).

Scan

And now we are getting to scan which is the major change in our scanner. I’m using the same name and same concept as Jack here but my implementation of scan is slightly different, as my intention is to make it as generic as possible, so that it covers all possible scenarios.

Let’s go down the list of scenarios that scan has to go through, starting from the simplest case and going to more and more complex ones.

The first case is when we encounter the end of the input stream (marked by char(0) in nextChar if you remember). In this scenario scan returns a special token that indicates end-of-input:

if (nextChar == endOfInput)  // check for end of input
    return Token(END_OF_INPUT, Kwd.endOfInput, TokInfo.none, TokType.none)

The next obvious case is when nextChar is a number, in which case scan reads and returns that numeric value:

if (isNumeric(nextChar))            // check for number
    return Token(getNumber(), Kwd.number, TokInfo.number, TokType.none)

Then we have the case where the next character is alphabetic. In this case scan will read that alphanumeric token, but then what? Our gut feel tells that we should check first to see if it is one of the reserved keywords, and if not, it will be an identifier:

val indx: Int
if (isAlpha(nextChar)) {            // check for name
    val name = getName()
    indx = isKeyword(name)
    return (if (indx >= 0) languageTokens[indx]  /* keyword found */
            else Token(name, Kwd.identifier, TokInfo.identifier, TokType.none))  /* identifier found */
}

Having gone through these cases, we are now across a special character, most likely an operator but how can we identify it? The most generic approach I could think of (but maybe not the simplest one) is to check the input stream (starting at the next character) against the predefined reserved sequences and see if any of them appears at the start of what remains of the input stream. We need to compare first against the longest sequences so that if our next character is e.g. '/' we check first for '/=' and also for '/*' (if this is our comment delimiter) and then we check against '/' which will match. Same goes for '==' vs '=', '<=' vs '<' etc. And this is the most complex part of scan:

indx = specSeqPresent(languageTokens)  // we have a special character - check for special sequence
if (indx >= 0)
    return(getSpecSeq(languageTokens, indx))

I have tried to simplify this part by moving the checking for a special sequence and the retrieval of that special sequence, if one is found, to different routines (listed further down).

Finally, if everything so far has failed, then we’ve come across an invalid token and this is what scan will return:

// at this point we have an invalid token
val thisChar = nextChar
getNextChar()
return Token(thisChar.toString(), Kwd.invalid, TokInfo.invalid, TokType.invalid)

Here’s the full listing of scan and the three supporting functions isKeyword, specSeqPresent and getSpecSeq:

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    if (nextChar == endOfInput)  // check for end of input
        return Token(END_OF_INPUT, Kwd.endOfInput, TokInfo.none, TokType.none)

    if (isNumeric(nextChar))            // check for number
        return Token(getNumber(), Kwd.number, TokInfo.number, TokType.none)

    val indx: Int
    if (isAlpha(nextChar)) {            // check for name
        val name = getName()
        indx = isKeyword(name)
        return (if (indx >= 0) languageTokens[indx]  /* keyword found */
            else Token(name, Kwd.identifier, TokInfo.identifier, TokType.none))  /* identifier found */
    }

    indx = specSeqPresent()            // we have a special character - check for special sequence
    if (indx >= 0)
        return(getSpecSeq(indx))

    val thisChar = nextChar           // at this point we have an invalid token
    getNextChar()
    return Token(thisChar.toString(), Kwd.invalid, TokInfo.invalid, TokType.invalid)
}

/** check if a specific name is a keyword */
private fun isKeyword(name: String): Int {
    if (cursor >= inputProgram.length)         // check for end of input
        return -1
    for (i in languageTokens.indices) {
        if (languageTokens[i].value.equals(name, true))  // check for keyword match
            return i
    }
    return -1
}

/**
 * check the beginning of the remaining input for special sequence (e.g. operator)
 * returns the index in our keywords list if found or -1 if not
 */
private fun specSeqPresent(): Int {
    if (cursor >= inputProgram.length)         // check for end of input
        return -1
    for (i in languageTokens.indices) {
        val tokenValue = languageTokens[i].value
        if (inputProgram.substring(cursor).startsWith(tokenValue))  // check for keyword match
            return i
    }
    return -1
}

/** get a special sequence from input (keyword or operator  */
private fun getSpecSeq(indx: Int): Token {
    if (indx >= languageTokens.size)
        return Token(NO_TOKEN, Kwd.none, TokInfo.none, TokType.none)
    val t = languageTokens[indx]
    cursor = min(cursor+t.value.length, inputProgram.length)
    nextChar = inputProgram[cursor]
    skipWhite()
    return t
}

You may have noticed a new variable above, languageTokens. This is the list of all the keywords, operators and other special sequences used in our language. This is fully configurable and by changing the contents of this list we can change the look and feel of the language.

var languageTokens = mutableListOf<Token>()

The declaration and initialisation of this list is done in LanguageTokens.kt, which is effectively where we define all the keywords and operators of our language.

The Final Touches

On top of what we just mentioned, there are a couple of changes still required in our scanner:

Add this statement in the import section:

import kotlin.math.min

Modify slightly the init function of the scanner class:

/** initialisation code class InputProgramScanner */
init {
    try {
        val f = File(inputFile)
        // read the whole program into a string
        inputProgram = f.readText()
        // init the list of tokens for our language
        initKeywords()
        initOperators()
        // set the lookahead character to the first input char and skip any white spaces
        nextChar = inputProgram[0]
        skipWhite()
        // get the first token from input
        nextToken = scan()
    } catch (e: Exception) {
        abort("$e")
    }
}

Rename the indx variable to cursor (I think it’s more appropriate) and change the skipNextChar function as follows (cursor now pointing to the current character that’s coming up in the input stream as opposed to the one after):

/** set the lookahead character to the next char from input */
private fun skipNextChar() {
    if (cursor < inputProgram.length-1)
        nextChar = inputProgram[++cursor]
    else
        nextChar = endOfInput
}

Remove all the global definitions of all the keywords and operators and the relevant functions to recognise them. Also remove the declaration of the blockTerminators set.

Move the endOfInput variable from the global section to inside the InputProgramScanner class and make it private. Also make all other variables in this class private.

// end of input mark
private val endOfInput = nullChar.toChar()

Modify the function that checks for end of Program:

/** check for end of program - called by parseBlock */
fun isEndOfProgram(): Boolean = nextToken.encToken == Kwd.endOfProgram ||
    nextToken.encToken == Kwd.endOfInput

Remove the function that checks for end of block – not needed anymore.

Add the decodeToken function that helps expected print the expected token value in case of error:

/** decode an encoded token to token name */
private fun decodeToken(token: Kwd): String {
    for (i in languageTokens.indices)
        if (languageTokens[i].encToken == token)
            return languageTokens[i].value
    return "*******"
}

Move the function expected from CompilerMain to InputProgramScanner as this the only place it is used. This function is now simplified:

/** report what was expected and abort */
private fun expected(expMsg: String) {
    abort("Expected $expMsg \n found [${nextToken.value}]")
}

Remove the getBoolean function (it’s functionality is now covered by scan). Also isBoolean is not needed either.

To test these changes you can use only these three files CompilerMain.kt, InputProgramScanner.kt. If you have managed to follow these incremental changes I’ve described here, these two files should be in good shape – if not, you can take them from the GitHub repository. The third file is LanguageTokens.kt. Please take this from GitHub (URL below). Finally, you will need to change the main function as follows:

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    while(true) {
        val t: Token = inp.lookahead()
        println("[${t.value}], [${t.encToken}], [${t.info}], [${t.type}]")
        if (t.encToken == Kwd.endOfInput)
            break
        inp.match()
    }
}

You can also comment out the following line in CompilerMain.kt as we will not be producing code during this test.

//val code = M68000Instructions()

You can now compile it and run it with any input consisting of any combination of numbers, words, operators and other special characters. It will recognise the tokens in this input sequence as per definitions in LanguageTokens and will print them one by one. Give it an input like this:

Program Test

if condition1
    alpha = 5
    beta = 6
else
    x = 1
    y = 2
endif

pi=3.14

EndProgram

It should produce this output:

[PROGRAM], [startOfProgram], [keyword], [none]
[TEST], [identifier], [identifier], [none]
[IF], [ifToken], [keyword], [none]
[CONDITION1], [identifier], [identifier], [none]
[ALPHA], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[5], [number], [number], [none]
[BETA], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[6], [number], [number], [none]
[ELSE], [elseToken], [keyword], [endOfBlock]
[X], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[1], [number], [number], [none]
[Y], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[2], [number], [number], [none]
[ENDIF], [endifToken], [keyword], [endOfBlock]
[PI], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[3], [number], [number], [none]
[.], [invalid], [invalid], [invalid]
[14], [number], [number], [none]
[ENDPROGRAM], [endOfProgram], [keyword], [endOfPRogram]
[End of Input], [endOfInput], [none], [none]

You can see how each token is recognised and categorised and how it is translated from string to an encoded token so that it can be easily processed by the parser. Play with it, feel free to change the token definitions in LanguageTokens.kt and see how the tokens you define will be recognised. Try any combinations you can think of to see if you can break it.

I will pause here to let you digest and experiment with this and will be back shortly.

And as always, you can find the code in my GitHub repository.

Coming up next: Introducing TINSEL

Chapter 6: Boolean Expressions

23rd Nov 202111th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we completed the handling of Control Structures and developed a rich set of parsing functions that can translate them and produce object code. However, that left a big gap in our compiler’s capabilities: we did not parse the conditions. To temporarily close that gap, we used a dummy parseCondition function that is just a placeholder for the real thing.

The General Plan

So far, Jack has lead us straightaway into writing code that deals with the issue at hand, without spending too much time in planning. And this worked well so far, as the topics covered were well defined: we know exactly what a '+' means and what we are supposed to do with it. The same is true for a while or a repeat structure: we know exactly how to handle it.

The way programming languages implement Boolean logic however, can vary quite a bit from language to language, so Jack’s advice at this point is, before we start any serious coding, let’s first make decisions as to what we want. And the way to do this, is at the level of the BNF syntax rules (the Grammar), which, as we’ve seen already, translates almost directly to code.

The Grammar

In chapters 2 and 3 we have been mentioning the BNF syntax for the various numerical expressions. It would be a good idea to summarize them here again:

    <expression> ::= [ <unary op> ] <term> [ <addop> <term> ] *
    <term>       ::= <factor> [ <mulop> <factor> ] *
    <factor>     ::= <integer> | <identifier> | ( <expression> )

The good thing about this grammar is that it enforces the operator precedence hierarchy as we would expect. At this point Jack wants us to amend this grammar and suggests that it’s better written like this:

    <expression>     ::= <term> [ <addop> <term> ] *
    <term>           ::= <signed factor> [ <mulop> <factor> ] *
    <signed factor>  ::= [ <addop> ] <factor>
    <factor>         ::= <integer> | <identifier> | ( <expression> )

This way the task to handle the unary minus falls on parseFactor, which is where it really belongs. This will be the syntax that we will be using from now on and we will adapt the parsing routines as necessary.

Now, if we try to define the corresponding grammar for Boolean algebra, the result will be very similar to the above:

    <b-expresssion>  ::= <b-term> [ <orop> <b-term> ] *
    <b-term>         ::= <not-factor> [ <andop> <not-factor> ] *
    <not-factor>     ::= [ NOT ] <b-factor>
    <b-factor>       ::= <b-literal> | <b-identifier> | ( <b-expression> )

In this grammar the <orop> (e.g. OR or XOR) is analogous to '+', the <andop> (e.g. AND) is analogous to '*' and the NOT is analogous to unary minus. There is a little difference though between the way the NOT and the unary minus are handled. In a numerical expression, the unary minus goes with the whole term, so it can only appear once in a give term. This means that an expression like

    a * -b

or even worse

    a - -b

are not allowed. However a Boolean expression like the below

a AND NOT b

makes perfect sense. And this is reflected in the difference between the <term> and <b-term> definitions above.

Relops

So far we have defined the syntax of numeric and Boolean expressions. The question is what happens when we combine the two. In case you are wondering, the reason why we need to combine them is the conditions in our control structures. In some scenarios the condition can consist of a pure Boolean expression like:

    IF a AND NOT b THEN ...

However, more often we will see something like:

    IF (x >=0) AND (x <=100) THEN ...

In this case the two terms in parenthesis are Boolean expressions but the individual terms compared (x, 0, 100) are numeric. The relational operators (>= and <=) are the catalysts (as Jack called them) that help Boolean and Numeric ingredients merge together.

In the above example the two terms compared are just terms, but generally they could be expressions. So a relation in NBNF could be defined as:

    <relation>   ::= <expression> <relop> <expression>

A relop is any of the well know relational operators: ==, !=, <, >, <=, >=

Jack points out here that this relation has a single Boolean value TRUE or FALSE, so it’s effectively a Boolean factor. With this in mind we can expand the definition of b-factor:

    <b-factor>    ::= <b-literal> | <b-identifier> | ( <b-expression> ) | <relation>

And this is exactly where the connection between Numerical and Boolean expressions is! As the BNF notation implies a hierarchy, where the arithmetic expressions have higher precedence than the Boolean expressions, let’s list clearly the precedence levels:

    Level   Syntax Element         Operator
      0     factor                 literal, variable
      1     signed factor          unary minus
      2     term                   * , /
      3     expression             + , -
      4     b-factor               b-literal, b-variable, relop
      5     not-factor             NOT
      6     b-term                 AND
      7     b-expression           OR, XOR

At this point I was really excited by this crystal-clear definition and was getting ready to start writing my parsing routines, when Jack brought me back to the ground by pointing out that this will not work! Even though this grammar would be good in theory, it would not be any good for a top-down parser like ours (actually Jack’s). The problem comes from cases like this:

    IF ((((A + B) < 0) AND ...

When the parser sees the IF it expects a Boolean expression (please go and check how our parseIf calls parseCondition). So parseCondition would start parsing that Boolean expression but as it would start going down the list of '(' expecting to find a b-expression as per the above grammar, it would come across a numerical expression which of course would not be able to process. The only way to get out of this situation without changing the grammar would be to accept an amount of backtracking so that our parser could work its way out of bad guesses like this. This however, would be a nightmare.

To deal with this we will have to make a compromise so that our simple parser can handle the grammar without backtracking.

Fixing the Grammar

The problem we’ve just encountered is caused by the fact that both the numerical <factor> and the Boolean <b-factor> allow for parenthesised expressions. And this becomes even more challenging as, through recursion, we can go dawn multiple levels of parentheses, at which point our parser has no idea whether it’s dealing with a numerical or a Boolean expression.

And the obvious solution is to allow parenthesis only in one kind of factor – and this will be the numerical factor. There will be no parenthesis allowed in Boolean factors. With this in mind the combined numerical and Boolean grammar can be written like this:

    <b-expression>  ::= <b-term> [ <orop> <b-term> ] *
    <b-term>        ::= <not-factor> [ <andop> <not-factor> ] *
    <not-factor>    ::= [ NOT ] <b-factor>
    <b-factor>      ::= <b-literal> | <b-identifier> | <relation>
    <relation>      ::= <expression> [ <relop> <expression> ]
    <expression>    ::= <term> [ <addop> <term> ] *
    <term>          ::= <signed factor> [ <mulop> <factor> ] *
    <signed factor> ::= [ <addop> ] <factor>
    <factor>        ::= <integer> | <identifier> | ( <expression> )

With this grammar we have the same 7 levels of precedence as mentioned above. The main difference is that there is no parenthesised b-expression as an option for the b-factor. There is also one subtle but crucial difference, which makes the whole thing work. Notice the square brackets in the definition of relation. This is telling us that a relation can also be a plain numerical expression. The result of this is that everything can potentially be treated as a Boolean expression. Our parser will start looking for a Boolean expression and as it goes down the grammar, it can settle for a numeric one. And this is how C works: any integer can be treated as Boolean and vice-versa. By the way, as Jack reminded us in the original, C has actually no less than 17 levels of precedence as it has many more operators (++, --, <<, >>, etc).

So, with this plan now agreed, let’s go and build the parsing functions.

The Parser

Please start with a copy of the code as it evolved at the end of Chapter 5. Also copy the numeric parser functions that you have from the end of Chapter 3. I have gathered these numeric parsing functions in a file called ProgParser2NumericExpr.kt. Please create a new file ProgParser3BooleanExpr.kt where we will write all the Boolean parsing functions.

Jack guides us through the same steps as he did for the numeric parsing functions. So, first things first, let’s recognise a Boolean literal and let’s fetch one from the input stream (in our InputPtogramScanner.kt):

/** check for a Boolean value (T,F) */
fun isBoolean(c: Char): Boolean = c.uppercaseChar() == 'T' || c.uppercaseChar() == 'F'

/** get a boolean literal */
fun getBoolean(): Boolean {
    var value = false
    if (!isBoolean(nextChar))
        expected("Boolean Literal")
    else
        value = nextChar.uppercaseChar() == 'T'
    skipNextChar()
    skipWhite()
    return value
}

To test these two routines please change parseOther in ProgParser1ControlStruct.kt as below:

/** dummy parse function for anything else than control statements */
fun parseOther() {
    println("\t${inp.getBoolean()}")
}

Run our compiler with a file containing a series of Ts and Fs as input. The compiler should properly recognise and print these Boolean values and will stop with an error message at any other unrecognised name or number or other unrecognised character (please remember to use upper case F for “false” as the lower case “f” will trigger the parser function for the FOR loop – our compiler still recognises the control structures!).

Next step is to set the Accumulator (register D0 in the case of the 68000) to our Boolean value. Jack chose 0 for False and FFFF hex for True, which is more Assembly-language-friendly (the bitwise NOT also becomes a Boolean NOT). I will use 0 and 1 as it is more appropriate for the decimal arithmetic that the TI-59 is using. The first cut of our Boolean expression parser is:

/** parse a Boolean expression */
fun parseBooleanExpression() {
    if (inp.getBoolean())
        code.setAccumulator("1")
    else
        code.clearAccumulator()
}

To test it, change again the parseOther function, this time to:

/** dummy parse function for anything else than control statements */
fun parseOther() {
    parseBooleanExpression()
}

Our compiler will now produce the correct assembly instruction for each T and F in the input to set D0 to 1 or 0 respectively.

With the basics out of the way, we can now crack on and start writing the necessary parse functions for each level of our Boolean grammar. We just have to copy the corresponding numerical parsers and make some minor adjustments. For Boolean operators, given that we are still constraint by single-character tokens we will use '|' for OR, '~' for XOR, '&' for AND, '!' for NOT, '?' for is equal to, '#' for is not equal to, '<' for is less than and '>' for is greater than (we will expand to “less than or equal to” and “greater than or equal to” in the next chapter where we will be able to process multi-character operators).

We will start with the parsing of Boolean expressions and the “orops” OR and XOR. Please rename the version of parseBooleanExpression to parseBooleanTerm:

/** parse a boolean term */
fun parseBooleanTerm() {
    if (inp.getBoolean())
        code.setAccumulator("1")
    else
        code.clearAccumulator()
}

This is the new version of ParseBooleanExpression and also the or and xor functions:

/** parse a Boolean expression */
fun parseBooleanExpression() {
    parseBooleanTerm()
    while (inp.isOrop(inp.nextChar)) {
        code.saveAccumulator()
        when (inp.nextChar) {
            orOp -> boolOr()
            xorOp -> boolXor()
        }
    }
}

/** parse boolean or */
fun boolOr() {
    inp.match()
    parseBooleanTerm()
    code.orAccumulator()
}

/** parse boolean xor */
fun boolXor() {
    inp.match()
    parseBooleanTerm()
    code.xorAccumulator()
}

We need to define the two Boolean operators in our global definitions section and write the function that checks for them:

val orOp: Char = '|'
val xorOp: Char = '~'

/** check for an "orop" (|,~) */
fun isOrop(c: Char): Boolean = c == orOp || c == xorOp

Don’t forget the two new instructions in our code module:

/** or top of stack with accumulator */
fun orAccumulator() = outputCodeNl("OR (SP)+,D0")

/** exclusive or top of stack with accumulator */
fun xorAccumulator() = outputCodeNl("EOR (SP)+,D0")

Our compiler now will recognise OR and/or XOR operations between Boolean literals.

Next step, let’s add the parsing for a Boolean term and for the AND operation. First, rename the current version of parseBooleanTerm to parseNotFactor and write this version of parseBooleanTerm:

/** parse a boolean term */
fun parseBooleanTerm() {
    parseNotFactor()
    while (inp.isAndop(inp.nextChar)) {
        code.saveAccumulator()
        when (inp.nextChar) {
            andOp -> boolAnd()
        }
    }
}

/** parse a not factor */
fun parseNotFactor() {
    if (inp.getBoolean())
        code.setAccumulator("1")
    else
        code.clearAccumulator()
}

Also, need the function to execute the Boolean AND:

/** parse boolean and */
fun boolAnd() {
    inp.match()
    parseNotFactor()
    code.andAccumulator()
}

You may want to point out that we have just one “andop” (which is actually the AND operator) so why complicate parseBooleanTerm by checking which andop we have and why call a separate function (boolAnd) to perform the AND operation instead of doing it all inside parseBooleanTerm? And you would be right. I just chose to keep the standard we have followed in parseTerm. Also, by putting a placeholder that checks which “andop” we have, if we later decide that we have another “andop” at the same precedence as the AND, it will be just one more line in parseBooleanTerm and one new function like boolAnd to activate it.

Same as before we also need to define the AND operator and the function that recognises it:

val andOp: Char = '&'

/** check for an "andop" (&) */
fun isAndop(c: Char): Boolean = c == andOp

And of course the new instruction:

/** and top of stack with accumulator */
fun andAccumulator() = outputCodeNl("AND (SP)+,D0")

Verify it works, it processes a series of T and F literals and does OR, XOR and AND with the right level of precedence.

Next, let’s add the NOT operator. Please rename parseNotFactor to parseBooleanFactor and write the new version of parseNotFactor:

/** parse a not factor */
fun parseNotFactor() {
    if (inp.nextChar == notOp) {
        inp.match()
        parseBooleanFactor()
        code.booleanNotAccumulator()
    }
    else
        parseBooleanFactor()
}

/** parse a boolean factor */
fun parseBooleanFactor() {
    if (inp.getBoolean())
        code.setAccumulator("1")
    else
        code.clearAccumulator()
}

We also need to define the NOT operator:

val notOp: Char = '!'

and the corresponding instruction:

/** boolean not accumulator */
fun booleanNotAccumulator() = outputCodeNl("EOR #1,D0")

Now you can add some NOT factors in your test cases and make sure it works.

And now it’s time to add the relations. First we need to modify parseBooleanFactor:

/** parse a boolean factor */
fun parseBooleanFactor() {
    if (inp.isBoolean(inp.nextChar)) {
        if (inp.getBoolean())
            code.setAccumulator("1")
        else
            code.clearAccumulator()
    }
    else
        parseRelation()
}

This is how to parse a relation:

/** parse a relation */
fun parseRelation() {
    parseExpression()
    if (inp.isRelop(inp.nextChar)) {
        code.saveAccumulator()
        when (inp.nextChar) {
            isEqual -> parseEquals()
            isNotequal -> parseNotEquals()
            isLess -> parseLess()
            isGreater -> parseGreater()
        }
    }
}

Please note the call to parseExpression above. This is where the circle closes, where our Boolean parser meets the Numerical one.

The 4 parsers for equals to, not equals to, less than and great than are here:

/** parse is equal to */
fun parseEquals() {
    inp.match()
    parseExpression()
    code.compareEquals()
}

/** parse is not equal to */
fun parseNotEquals() {
    inp.match()
    parseExpression()
    code.compareNotEquals()
}

/** parse is less than */
fun parseLess() {
    inp.match()
    parseExpression()
    code.compareLess()
}

/** parse is greater than */
fun parseGreater() {
    inp.match()
    parseExpression()
    code.compareGreater()
}

We need to define the relational operators in our global definitions section and write the function to recognise them:

val isEqual: Char = '?'
val isNotequal: Char = '#'
val isGreater: Char = '>'
val isLess: Char = '<'

/** check for a relop (=, #, <, >) */
fun isRelop(c: Char): Boolean = c == isEqual || c == isNotequal || c == isLess || c == isGreater

At this point Jack explains in detail how these operations should be translated to M68000 Assembly. The key here is that both the CPU flags must be set as they are used to control the flow in a control statement, but also the “Accumulator” must be set to the corresponding Boolean value as it can be used a Boolean variable. I will skip that detail, but feel free to read it in the original. I will just copy the Assembly instructions from Jack’s tutorial:

/** compare and set accumulator and flags - is equal to */
fun compareEquals() {
    outputCodeNl("CMP (SP)+,D0")
    outputCodeNl("SEQ D0")
    outputCodeNl("TST D0")
}

/** compare and set accumulator and flags - is not equal to */
fun compareNotEquals() {
    outputCodeNl("CMP (SP)+,D0")
    outputCodeNl("SNE D0")
    outputCodeNl("TST D0")
}

/** compare and set accumulator and flags - is less than */
fun compareLess() {
    outputCodeNl("CMP (SP)+,D0")
    outputCodeNl("SGE D0")
    outputCodeNl("TST D0")
}

/** compare and set accumulator and flags - is greater than */
fun compareGreater() {
    outputCodeNl("CMP (SP)+,D0")
    outputCodeNl("SLE D0")
    outputCodeNl("TST D0")
}

And with this we have the complete Grammar covered. Our compiler can now parse Boolean expressions like

x > 0 & x < 100

and produce the right Assembly code for them. Remember, no parentheses are needed in the above expression. If you use (x > 0) & (x < 100) it will stop at the '>' with an error as it will be processed by the numeric parser which of course does not accept a '>'.

Adapting the Numeric Parsers to the New Grammar

If you remember, our grammar for the numeric parsing was slightly changed in the beginning of this chapter, by moving the handling of the unary minus from the Term to the Factor. Let’s modify our parsing routines to follow this improved grammar.

First parseExpression is changed back to what it was initially before we added the “trick” for the unary minus:

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

Next parseTerm is changed to use parseSignedFactor as per grammar:

/**
 * parse a term
 * <term> ::= <signed factor> [ <mulop> <factor> ] *
 */
fun parseTerm() {
        parseSignedFactor()
        while (inp.isMulop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        mulOp -> multiply()
                        divOp -> divide()
                }
        }
}

We need the new function parseSignedFactor:

/**
 * parse a signed factor
 * this can be only the first factor in a term
 * <signed factor> ::= [ addop ] <factor>
 */
fun parseSignedFactor() {
        if (inp.nextChar == addOp)
                inp.match()
        if (inp.nextChar == subOp) {
                inp.match()
                if (inp.isNumeric(inp.nextChar))
                        code.setAccumulator("-${inp.getNumber()}")
                else {
                        parseFactor()
                        code.negateAccumulator()
                }
        }
        else
                parseFactor()
}

This function checks first for a leading '+' and ignores it. Then it checks for a '-number' which is processed straight away in one instruction. '-(expression)' and '-identifier' are processed by parseFactor as usual and then negated. This way the output code is much more efficient.

We also need to add the new instruction to negate the accumulator:

/** negate accumulator */
fun negateAccumulator() = outputCodeNl("NEG D0")

And with that our numeric parser is now compliant with the new improved grammar. We will test this after we complete the merging of this code with the rest of the structures further down in this chapter.

Merging with Control Structures

Now it’s time to go back to the ProgPraser1ControlStruct.kt that’s holding all the control structures parser functions. All we need to do is replace all the calls to parseCondition with calls to parseBooleanExpression. ParseCondition, which was just a dummy placeholder, has served its purpose and can now be deleted.

Adding Assignments

And finally, to merge the numeric functions that we developed in chapters 2 and 3 we need to replace the call to parseOther in praseBlock with a call to parseAssignment. ParseOther can also be deleted.

You also need to replace the two calls to parseOther in the parseFor function with calls to parseExpression for now.

ParseAssignment needs a minor change as it now has to call parseBooleanExpression.

/**
 * parse assignment
 * <assignment> ::= <identifier> = <b-expression>
 */
fun parseAssignment() {
        val identName: String = inp.getName()
        inp.match(equalsOp)
        parseBooleanExpression()
        code.assignment(identName)
}

This change is necessary to properly support our new grammar. If you remember, we realised above that any integer can be treated as Boolean and vice versa. Our numeric expression parser will start looking for a Boolean expression, and if it does not find one, it can settle for a numeric one.

And now we can write a (-n almost) proper program:

x = set_x()
w x > 0 & x < 100
   x = x + 1
   i check_y() ? 0
      b
   e
e
a = -(x+1) * 5
.

The output will be 46(!) lines of Assembly code but it will do the job. This program is now looking very realistic, the only “funny” bit about it being the single-character tokens.

As the code has significantly grown by now, you can find it in my GitHub repository.

Coming up next: Lexical Scanning

In the next chapter we will build a more advanced lexical scanner that will help us eliminate the single-character token constraint.

Chapter 5b: Control Structures – part 2

12th Nov 202110th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

This is the second part of the Control Structures chapter. Without further ado, we will go straight to the REPEAT statement.

REPEAT Statement

Having done WHILE, it should be very easy to process the REPEAT statement. The syntax of REPEAT in our language will be:

    REPEAT <block> UNTIL <condition>

The result of this statement will be again a loop like while, with the only difference that the condition is executed at the end of each iteration. In syntax-directed translation:

    REPEAT         { L = NewLabel
                     PostLabel(L) }
    <block>
    UNTIL
    <condition>    { BranchIfFalse(L) }

And the corresponding parse function can be written very easily. We will use 'r' for REPEAT and 'u' for UNTIL:

/**
 * parse repeat statement
 * REPEAT <block> UNTIL <condition>
 */
fun parseRepeat() {
    inp.getNextChar()
    val label = newLabel()
    postLabel(label)
    parseBlock()
    inp.match(untilToken)
    parseCondition()
    code.branchIfFalse(label)
}

We also need to define our ‘repeat’ and ‘until’ tokens:

val repeatToken = 'r'
val untilToken = 'u'

The function isEndBlock is now becoming a long list of comparisons which is a bit messy. Why not use a Kotlin Set object to define all the possible block terminators:

// the set of tokens that are block terminators
private val blockTerminators = setOf(endifToken, elseToken, endwhileToken, untilToken)

And with this, isEndBlock now looks much better:

/** check for end of block */
fun isEndBlock(c: Char) = blockTerminators.contains(c) || isEndProgram(c)

Finally, need to change parseBlock and tell it to recognise the repeat statement:

/**
 * parse a block
 * <block> ::= [ <statement> ] * end
 */
fun parseBlock() {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            whileToken -> parseWhile()
            repeatToken -> parseRepeat()
            else -> parseOther()
        }
    }
}

Compile it, verify it and make sure it works.

BREAK Statement

This is a quite useful statement that can simplify a program a lot. The challenge with it is when you have nested loops, which of them do you break out of? The answer is of course the inner-most loop, so the parser for break statement needs to know the exit address of the loop it has been called from, in order to produce the correct assembly branch code.

In our implementation, each loop iteration is processed by parseBlock, which also processes the IF block, which is where the break statement is most likely to appear. Based on this train of thought, it sounds like we would need the following:

when there is an IF command in a loop (parseWhile or parseRepeat), the call to parseIf must pass the exit label for that loop
parseIf must also pass that label to both of its blocks (if and else) as this is where the BREAK would be (if any)
remember that parseIf will be called by parseBlock, so parseBlock must receive the exit label when it is called by parseWhile or parseRepeat in order to pass it down to parseIf

You can see the recursion here, where parseBlock calls parseWhile, which calls parseBlock, which calls parseIf, which calls parseBlock, and over again, and at some point, one of those parseBlock instances will call parseBreak. The stack in Kotlin (that we are getting for free) ensures that the value of the exit label parameter that is being passed down the chain is always the most recent.

This may sound a bit complicated but it’s a really clever yet simple solution from Jack, and here’s the code for it:

/** parse break statement */
fun parseBreak(label: String) {
    inp.getNextChar()
    if (label == "")
        abort("no loop to break of")
    else
        code.branch(label)
}

We need to define our break token:

val breakToken = 'b'

And of course tell parseBlock to deal with BREAK as well:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock(breakLabel: String = "") {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf(breakLabel)
            whileToken -> parseWhile()
            repeatToken -> parseRepeat()
            breakToken -> parseBreak(breakLabel)
            else -> parseOther()
        }
    }
}

This is the modified code for ifParse that passes the break Label it receives from the caller, to the two blocks further down:

/**
 * parse if statement
 * IF <condition> <block> [ ELSE <block> ] ENDIF
 */
fun parseIf(breakLabel: String) {
    inp.getNextChar()
    parseCondition()
    val label1 = newLabel()
    code.branchIfFalse(label1)
    parseBlock(breakLabel)
    if (inp.nextChar == elseToken) {
        inp.getNextChar()
        val label2 = newLabel()
        code.branch(label2)
        postLabel(label1)
        parseBlock(breakLabel)
        postLabel(label2)
    }
    else
        postLabel(label1)
    inp.match(endifToken)
}

The version of parseWhile that passes the break Label is below:

/**
 * parse while statement
 * WHILE <condition> <block> ENDWHILE
 */
fun parseWhile() {
    inp.getNextChar()
    val label1 = newLabel()
    val label2 = newLabel()
    postLabel(label1)
    parseCondition()
    code.branchIfFalse(label2)
    parseBlock(label2)
    inp.match(endwhileToken)
    code.branch(label1)
    postLabel(label2)
}

And the new version of parseRepeat is:

/**
 * parse repeat statement
 * REPEAT <block> UNTIL <condition>
 */
fun parseRepeat() {
    inp.getNextChar()
    val label1 = newLabel()
    val label2 = newLabel()
    postLabel(label1)
    parseBlock(label2)
    inp.match(untilToken)
    parseCondition()
    code.branchIfFalse(label1)
    postLabel(label2)
}

As usual, compile it, verify it and try to break it (pun intended and copied from the original).

FOR Statement

I’ve left the FOR loop last for a reason. Jack in his original chapter 5, explained how complex it is to produce M68000 assembly code to process a FOR loop while having the value of the counter accessible within the loop. In fact the simplest he could do was 14 lines of assembly code just for the FOR loop, and that’s without counting the code to calculate the <from> and <to> expressions and without the code that is executed inside the loop. This is 2x or 3x the size of the while or repeat code and this is also reflected in the parser function.

I will take a different approach. I will analyse and process the FOR statement, but given that my goal is not to learn or to teach M68000 assembly, I will just produce dummy pseudo-code just to demonstrate the concept as Jack analysed it. When we get to convert our compiler to produce TI-59 code, I will make it produce real TI-59 code for the FOR loop.

So, the FOR in our language is:

    FOR <variable> = <expression1> TO <expression2> <block> ENDFOR

This is equivalent to our familiar WHILE loop:

<variable> = <expression1>
TEMP = <expression2>
WHILE <variable> <= TEMP
<block>
++<variable>
ENDWHILE

Please note, the assumption here is that the step is always +1 and the value of expression2 is calculated only once, before the loop starts.

I will translate this into code, with one difference. I will increase the value of variable just before the condition is evaluated, instead of increasing it at the end of the iteration. This means that I have to decrease it by 1 just after I set it to expression1 in the beginning, to get the first time right.

/** 
 * parse for statement 
 * dummy version using pseudo-code
 * focuses on parsing of the structure only - not on producing code
 */
fun parseFor() {
        inp.getNextChar()
        val counterName = inp.getName()
        inp.match(equalsOp)
        code.dummyInstr("Allocate space for $counterName and set value to:")
        parseOther() // this is the FROM expression
        code.dummyInstr("Decrease $counterName by 1")
        inp.match(toToken)
        code.dummyInstr("Allocate space for TO variable and set value to:")
        parseOther() // this is the TO expression
        val label1 = newLabel()
        val label2 = newLabel()
        // actual start of the loop
        postLabel(label1)
        // increase counter and check the condition
        code.dummyInstr("Increase $counterName by 1")
        code.dummyInstr("Is $counterName <= TO?")
        code.branchIfFalse(label2)
        // execute the body of the loop
        parseBlock(label2)
        inp.match(endforToken)
        // back to the beginning of the loop
        code.branch(label1)
        // exit point of the loop
        postLabel(label2)
        code.dummyInstr("Release space held for $counterName and TO")
}

I have also added a dummy instruction to our M68000Instructions.kt:

/** dummy instruction */
fun dummyInstr(cmd: String) = outputCodeNl(cmd)

We need to define the tokens for FOR, TO and ENDFOR in our global definitions and add ENDFOR to the list of block terminators:

val forToken = 'f'
val toToken = 't'
val endforToken = 'e'

// the set of tokens that are block terminators
private val blockTerminators = setOf(endifToken, elseToken, endwhileToken, untilToken, endforToken)

And of course we need to recognise the FOR token in parseBlock:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock(breakLabel: String = "") {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf(breakLabel)
            whileToken -> parseWhile()
            repeatToken -> parseRepeat()
            forToken -> parseFor()
            breakToken -> parseBreak(breakLabel)
            else -> parseOther()
        }
    }
}

Compile it and try it with an input program like:

f counter = expr1 t expr2
    xloop_block
e
.

If you feel adventurous you can add a STEP token as well (but don’t forget that this logic assumes a positive step as the exit condition checks for Counter <= TO). Also, please remember, for the block inside the FOR loop use for now a letter that has not been used as a token before (e.g. the word block will be interpreted as break – but as discussed this will be dealt with in chapter 7 where we will have multi-character tokens).

At this point I would like to make one more little improvement: in the various parser functions I have used getNextChar to advance to the next token (in our case single-character token) where Jack was using match. He did that for consistency and used match everywhere, even when he did not have to actually match the token as it had already been matched immediately above. In these cases I have used getNextChar in order to avoid the unnecessary call to match that would cause confusion to the reader.

However I really appreciate the importance of consistent code and I’d really like to use match everywhere – but only when a match is needed. So how do I reconcile the two? I have changed slightly function match to (a) match the given token and advance to the next character or abort if not matched and (b) just advance to the next character if no token is passed as a parameter. This can be easily achieved in Kotlin by using the default value for the parameter to the function: when there is no parameter provided by the caller, the value of the parameter in the function will be the default value I have defined (null character(0)) and this can be used as our wildcard.

/**
 * match a specific input char
 * and advance to next character
 * also produce a match if called with no input or if input is null character
 */
fun match(x: Char = nullChar.toChar()) {
    // return a match when the next char matches x
    // or when x is null character
    if (nextChar == x || x == nullChar.toChar())
        getNextChar()
    else
        expected("'$x'")
}

With this new version of match, I can now use inp.match(endforToken) where I need to match a token and advance to the next one, or inp.match() where I just need to skip the token and advance to the next one (the match has already happened before this point). Please replace all the calls to getNextChar in the parser functions with match.

And that’s it for this chapter.

Coming up next: Boolean Expressions

Below is the full code for this version of our Compiler. As you can see I have gathered all the parser functions in a new file called ProgParser1ControlStruct.kt to separate them from the numerical expressions parser functions:

CompilerMain.kt

package you.set.this

import kotlin.system.exitProcess

/**
 * A Simple Compiler
 * Based on the Let's Build a Compiler! series by Jack Crenshaw
 * This version produces Assembly language code for the 68000 microprocessor
 * A future version will produce code for the TI-59 Calculator
 * It will produce output that uses the TI-59 mnemonics that can be keyed directly to the calculator
 * Version 1.0 01.10.2021
 */

// the input program scanner
lateinit var inp: InputProgramScanner
// the Motorola 68000 instruction set
val code = M68000Instructions()

/** report an error */
fun error(errMsg: String) {
    System.err.println("Error: $errMsg")
}

/** abort compilation */
fun abort(errMsg: String) {
    error(errMsg)
    exitProcess(1)
}

/** print message and exit */
fun exit(msg: String) {
    System.err.println(msg)
    exitProcess(0)
}

/** report what was expected and abort */
fun expected(expMsg: String) {
    // setup a meaningful message if we have reached end of input
    val foundStr: String = if (inp.nextChar==endOfInput) "end of input" else inp.nextChar.toString()
    abort("Expected $expMsg \n found [${foundStr}]")
}

/** compiler initialisation */
fun initCompiler(args: Array<String>) {
    if (args.isEmpty())
        abort("no input file specified")
    // get a new input program scanner object - initialise from input file
    inp = InputProgramScanner(args[0])
}

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseProgram()
}

InputProgramScanner.kt

package you.set.this

import java.io.File

/**
 * The input program scanner class
 * Performs the lexical scanner functions
 * Processes the char-by-char input and returns the tokens from the input stream
 */

/////////////////////////////////////////////////////////
// global special characters definitions
/////////////////////////////////////////////////////////
// numeric operators
val addOp: Char = '+'
val subOp: Char = '-'
val mulOp: Char = '*'
val divOp: Char = '/'
// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'
// equals
val equalsOp = '='
// tokens
val endProg = '.'
val ifToken = 'i'
val endifToken = 'e'
val elseToken = 'l'
val whileToken = 'w'
val endwhileToken = 'e'
val repeatToken = 'r'
val untilToken = 'u'
val forToken = 'f'
val toToken = 't'
val endforToken = 'e'
val breakToken = 'b'
// end of input mark
val nullChar = 0
val endOfInput = nullChar.toChar()
/////////////////////////////////////////////////////////

/** this class implements the lexical scanner */
class InputProgramScanner(inputFile: String = "") {

    // the input program as string
    var inputProgram: String = ""
    var indx = 0

    // the next character from input
    // this is our lookahead character
    var nextChar: Char = ' '

    // the set of tokens that are block terminators
    private val blockTerminators = setOf(endifToken, elseToken, endwhileToken, untilToken, endforToken)

    /** initialisation code class InputProgramScanner */
    init {
        try {
            val f = File(inputFile)
            // read the whole program into a string
            inputProgram = f.readText()
            // set the lookahead character to the first input char
            getNextChar()
        } catch (e: Exception) {
            abort("could not open file [$inputFile]")
        }
    }

    /** get the next character from input after skipping white spaces */
    fun getNextChar() {
        skipNextChar()
        skipWhite()
    }

    /** set the lookahead character to the next char from input */
    fun skipNextChar() {
        if (indx < inputProgram.length)
            nextChar = inputProgram[indx++]
        else
            nextChar = endOfInput
    }

    /** skip white spaces */
    fun skipWhite() {
        while (isWhite(nextChar))
            skipNextChar()
    }

    /**
     * match a specific input char
     * and advance to next character
     * also produce a match if called with no input or if input is null character
     */
    fun match(x: Char = nullChar.toChar()) {
        // return a match when the next char matches x
        // or when x is null character
        if (nextChar == x || x == nullChar.toChar())
            getNextChar()
        else
            expected("'$x'")
    }

    /**
     * get an identifier
     * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] *
     */
    fun getName(): String {
        var token = ""
        if (!isAlpha(nextChar))
            expected("Identifier")
        else {
            while (isAlphanumeric(nextChar)) {
                token += nextChar.uppercase()
                skipNextChar()
            }
            skipWhite()
        }
        return token
    }

    /**
     * get a number
     * <number> ::= [ <digit> ] +
     */
    fun getNumber(): String {
        var value = ""
        if (!isNumeric(nextChar)) {
            expected("Number")
        } else {
            while (isNumeric(nextChar)) {
                value += nextChar.toString()
                skipNextChar()
            }
            skipWhite()
        }
        return value
    }

    /** check for an alpha char */
    fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z'

    /** check for a numeric char */
    fun isNumeric(c: Char): Boolean = c in '0'..'9'

    /** check for alphanumeric */
    fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'

    /** check for an "addop" (+,-) */
    fun isAddop(c: Char): Boolean = c == addOp || c == subOp

    /** check for a "mulop" (*,/) */
    fun isMulop(c: Char): Boolean = c == mulOp || c == divOp

    /** check for left parenthesis */
    fun isLeftParen(c: Char): Boolean = c == leftParen

    /** check for end of line */
    fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'

    /** check for a white space */
    fun isWhite(c: Char): Boolean = c == ' ' || c == '\t' || isEndOfLine(c)

    /** check for end of block */
    fun isEndBlock(c: Char) = blockTerminators.contains(c) || isEndProgram(c)

    /** check for end of program */
    fun isEndProgram(c: Char): Boolean = c == endProg || c == endOfInput
}

ProgParser1ControlStruct.kt

package you.set.this

/**
 * Program parsing - module 1
 * Control Structures
 */

// global vars
var labelIndx: Int = 0

/** create a unique label*/
fun newLabel(): String = "L${labelIndx++}"

/** post a label to output */
fun postLabel(label: String) = code.outputLabel(label)

////////////////////////////////////////////////////////////

/**
 * parse a program
 * <program> ::= <block> end
 */
fun parseProgram() {
    parseBlock()
    inp.match(endProg)
    println("\nEND PROGRAM")
}

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock(breakLabel: String = "") {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf(breakLabel)
            whileToken -> parseWhile()
            repeatToken -> parseRepeat()
            forToken -> parseFor()
            breakToken -> parseBreak(breakLabel)
            else -> parseOther()
        }
    }
}

/**
 * parse if statement
 * IF <condition> <block> [ ELSE <block> ] ENDIF
 */
fun parseIf(breakLabel: String) {
    inp.match()
    parseCondition()
    val label1 = newLabel()
    code.branchIfFalse(label1)
    parseBlock(breakLabel)
    if (inp.nextChar == elseToken) {
        inp.match()
        val label2 = newLabel()
        code.branch(label2)
        postLabel(label1)
        parseBlock(breakLabel)
        postLabel(label2)
    }
    else
        postLabel(label1)
    inp.match(endifToken)
}

/**
 * parse while statement
 * WHILE <condition> <block> ENDWHILE
 */
fun parseWhile() {
    inp.match()
    val label1 = newLabel()
    val label2 = newLabel()
    postLabel(label1)
    parseCondition()
    code.branchIfFalse(label2)
    parseBlock(label2)
    inp.match(endwhileToken)
    code.branch(label1)
    postLabel(label2)
}

/**
 * parse repeat statement
 * REPEAT <block> UNTIL <condition>
 */
fun parseRepeat() {
    inp.match()
    val label1 = newLabel()
    val label2 = newLabel()
    postLabel(label1)
    parseBlock(label2)
    inp.match(untilToken)
    parseCondition()
    code.branchIfFalse(label1)
    postLabel(label2)
}

/**
 * parse for statement
 * dummy version using pseudo-code
 * focuses on parsing of the structure only - not on producing code
 */
fun parseFor() {
    inp.match()
    val counterName = inp.getName()
    inp.match(equalsOp)
    code.dummyInstr("Allocate space for $counterName and set value to:")
    parseOther() // this is the FROM expression
    code.dummyInstr("Decrease $counterName by 1")
    inp.match(toToken)
    code.dummyInstr("Allocate space for TO variable and set value to:")
    parseOther() // this is the TO expression
    val label1 = newLabel()
    val label2 = newLabel()
    // actual start of the loop
    postLabel(label1)
    // increase counter and check the condition
    code.dummyInstr("Increase $counterName by 1")
    code.dummyInstr("Is $counterName <= TO?")
    code.branchIfFalse(label2)
    // execute the body of the loop
    parseBlock(label2)
    inp.match(endforToken)
    // back to the beginning of the loop
    code.branch(label1)
    // exit point of the loop
    postLabel(label2)
    code.dummyInstr("Release space held for $counterName and TO")
}

/** parse break statement */
fun parseBreak(label: String) {
    inp.match()
    if (label == "")
        abort("no loop to break of")
    else
        code.branch(label)
}

/** dummy parse condition */
fun parseCondition() {
    println("\t${inp.getName()}")
}

/** dummy parse function for anything else than control statements */
fun parseOther() {
    println("\t${inp.getName()}")
}

M68000Instructions.kt

package you.set.this

class M68000Instructions {

    /** output code */
    fun outputCode(s: String) = print("\t$s")

    /** output code with newline */
    fun outputCodeNl(s: String) = outputCode("$s\n")

    /** output a label */
    fun outputLabel(s: String) = println("$s:")

    /** set accumulator to a value */
    fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0")

    /** clear accumulator */
    fun clearAccumulator() = outputCodeNl("CLR D0")

    /** push accumulator to the stack */
    fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)")

    /** add top of stack to accumulator */
    fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0")

    /** subtract top of stack from accumulator */
    fun subFromAccumulator() {
        outputCodeNl("SUB (SP)+,D0")
        // negate accumulator as the result is the wrong way round
        outputCodeNl("NEG D0")
    }

    /** multiply accumulator by top of stack */
    fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0")

    /** divide accumulator by top of stack */
    fun divideAccumulator() {
        outputCodeNl("MOVE (SP)+,D1")
        outputCodeNl("DIVS D1,D0")
    }

    /** set accumulator to variable */
    fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0")

    /** call a subroutine */
    fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}")

    /** assignment var to value */
    fun assignment(identifier: String) {
        outputCodeNl("LEA $identifier(PC),A0")
        outputCodeNl("MOVE D0,(A0)")
    }

    /** branch if false */
    fun branchIfFalse(label: String) = outputCodeNl("BEQ $label")

    /** branch */
    fun branch(label: String) = outputCodeNl("BRA $label")

    /** dummy instruction */
    fun dummyInstr(cmd: String) = outputCodeNl(cmd)
}

Chapter 5a: Control Structures – part 1

5th Nov 202110th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

This chapter brings us back to our compiler. Keep the interpreter that you have built in chapter 4 somewhere safe, as going forward you may have to refer to it and use some of those routines.

So far we have covered numerical expressions, so this is the natural next step where we add ifs, loops, breaks, etc. to make our experimental programming language a bit more structured. Jack promised in his original tutorial that having covered the expressions parsing, we will find control structures parsing very easy! Let’s see if you will agree with him.

As I have done in the first 4 chapters, I will stick with Jack’s approach: I will use single character tokens for the control statements to make things simpler while we concentrate on the parsing. I will however keep the handling of white spaces in the code, as it will make it a bit easier for us to construct the various input programs to test our compiler. Please note, I will also keep the handling of multi-character strings in getName and getNumber – they can process single-character tokens equally well as long as they are separated by white space or other special character. I will also keep the skipping of newlines as we have done in chapter 4, so that I can feed our compiler a multi-line program which will help us understand a bit better the flow of a control structure.

This chapter turned out to be quite long, so I will publish it in 2 parts.

The General Plan

We will use the version of our compiler from the end of chapter 3.

Given that we are sticking with single-character tokens for now, the input “code” will look a bit funny. We will use 'i' for IF, 'w' for WHILE, etc. To cheer you up a bit this restriction will be removed in chapter 7, at which point our programming language will be much easier to read.

Jack wants us to concentrate strictly on control statements in this chapter, so we will not be parsing any other statements here. For this purpose we need a dummy parse function that will print the next alphanumeric token to the output just for us to follow the code:

/** dummy parse function for anything else than control statements */
fun parseOther() {
    println("\t${inp.getName()}")
}

We now have to extend our scanner to be able to deal with more than one lines. This was done in the Interpreter chapter by adding the call to skipNewline to the main loop. We will do it a bit more formally here. Consider the following definitions:

<program> ::= <block> end

<block> ::= [ <statement> ] * end

…which say that a program is a block followed by the keyword 'end' while a block is any number of statements (including no statements at all – empty block) followed by the 'end' keyword. These definitions can be easily converted to parsing functions as we have been doing so far:

/**
 * parse a program
 * <program> ::= <block> end
 */
fun parseProgram() {
    parseBlock()
    inp.match(endProg)
    println("\nEND PROGRAM")
}

/**
 * parse a block
 * <block> ::= [ <statement> ] * end
 */
fun parseBlock() {
    while (!inp.isEndProgram(inp.nextChar))
        parseOther()
}

And with these new parsing functions in place, our main loop now becomes

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseProgram()
}

We also need to add the global definition for our program terminator ('.' is used here – this will be changed to something more user-friendly in chapter 7) and the function that checks for end of program.

// tokens
val endProg = '.'

/** check for end of program */
fun isEndProgram(c: Char): Boolean = c == endProg

Finally, in order to process a multi-line program, we need to deal with newline characters. There are two schools: if we want our programming language to be line-oriented (e.g. like BASIC where the end of each line is also the end of the statement) then we would have to make changes in many parts of the code and add processing of the newline character. If however, we decide to make our language free-field (like C or Kotlin) then we have an easy solution: we just need to change the function skipWhite so it will skip both white spaces and line terminators. I will follow the latter, so in other words, I will be treating the line terminator as a white space. We will see later if we can keep this approach or may have to change the handling of newlines.

/** check for a white space */
fun isWhite(c: Char): Boolean = c == ' ' || c == '\t' || isEndOfLine(c)

With these changes our compiler will process an input program of as many lines as you want and will recognise and print any alphanumeric tokens. It will flag any number, operator or other special character as an error. It will also flag the absence of the end-program token ('.') as an error.

IF Statement

The syntax for the IF statement that we will support here is as follows:

IF <condition> <block> ENDIF

As you can see there is a specific token that terminates the IF-block. This will be the approach in this chapter where each of the control statements will have its own termination token.

Thinking of the flow around the IF statement and trying to convert this to Assembly-friendly format (we will convert to TI-59 at a later stage) then the below IF block

IF <condition> A ENDIF B …

becomes

        Branch if NOT <condition> to L
        A
L:      B
        ...

Our assembly code that will be produced here, will be branching around the IF-block if the condition is false.

With the relaxed restrictions, as mentioned above, a block like IF Condition A ENDIF B ... can be written in our programming “language”:

i condition
      a
e
b
.

which is much easier to read than icaeb. (this is what our test “program” would look like if we did not process white spaces and newlines and returned single characters in getName)

First, the functions needed to produce the unique labels:

// global vars
var labelIndx: Int = 0

/** create a unique label*/
fun newLabel(): String = "L${labelIndx++}"

/** post a label to output */
fun postLabel(label: String) = code.outputLabel(label)

And in our M68000Instructions.kt:

/** output a label */
fun outputLabel(s: String) = println("$s:")

If we think of the actions that we need to take for each of the keywords in the IF-block, we have something like this:

IF: First, get the condition and issue the code for it.
Then, create a unique label and issue the code to branch to it if the condition is false
ENDIF: output the label

At this point Jack introduces us to what is called “syntax-directed translation”, which we have been doing anyway, but here’s how to be clear about it:

    IF
    <condition>    { condition
                     L = NewLabel
                     Code (Branch if False to L) }
    <block>
    ENDIF          { PostLabel (L) }

Based on this, the code for our IF parser will look like this (please note we are using a dummy function for the condition as well – this will be covered fully in the next chapter):

/**
 * parse if statement
 * IF <condition> <block> ENDIF
 */
fun parseIf() {
    inp.getNextChar()
    parseCondition()
    val label = newLabel()
    code.branchIfFalse(label)
    parseBlock()
    inp.match(endifToken)
    postLabel(label)
}

/** dummy parse condition */
fun parseCondition() {
    println("\t${inp.getName()}")
}

We need to add the new instruction to M68000Instructions.kt:

/** branch if false */
fun branchIfFalse(label: String) = outputCodeNl("BEQ $label")

And parseBlock has to be changed to call parseIf when it sees the 'i' token:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (inp.nextChar != endifToken && !inp.isEndProgram(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            else -> parseOther()
        }
    }
}

Don’t forget to define the tokens for IF and ENDIF in our global definitions section:

val ifToken = 'i'
val endifToken = 'e'

Compile this code and run it with this IF statement as input:

i condition1
    statement1
e
statement2
.

It produces the following output:

	CONDITION1
	BEQ L0
	STATEMENT1
L0:
	STATEMENT2

Note how the code branches around statement1 if condition1 is false. I’m not quite satisfied with our error checking in this version though. But I will address this very soon once we complete our IF statement with the addition of ELSE. In BNF this is:

IF <condition> <block> [ ELSE <block> ] ENDIF

And in syntax-directed translation we can write this as follows:

    IF
    <condition>    { L1 = NewLabel
                     L2 = NewLabel
                     code (Branch if False to L1) }
    <block>
    <ELSE>         { code (Branch to L2)
                     PostLabel (L1) }
    <block>
    <ENDIF>        { PostLabel (L2) }

If we combine this with the ELSE-less case that we dealt with further up then our parseIf function becomes (please remember, we are using 'i' for IF, 'e' for ENDIF and now 'l' for ELSE):

/**
 * parse if statement
 * IF <condition> <block> [ ELSE <block> ] ENDIF
 */
fun parseIf() {
    inp.getNextChar()
    parseCondition()
    val label1 = newLabel()
    code.branchIfFalse(label1)
    parseBlock()
    if (inp.nextChar == elseToken) {
        inp.getNextChar()
        val label2 = newLabel()
        code.branch(label2)
        postLabel(label1)
        parseBlock()
        postLabel(label2)
    }
    else
        postLabel(label1)
    inp.match(endifToken)
}

We need to remember to terminate the parseBlock loop when it sees the else token as well:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (inp.nextChar != endifToken &&
           inp.nextChar != elseToken && !inp.isEndProgram(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            else -> parseOther()
        }
    }
}

Don’t forget to define the else token in our global definitions:

val elseToken = 'l'

And of course the new instruction for our assembly code:

/** branch */
fun branch(label: String) = outputCodeNl("BRA $label")

Compile it and try it out. Give it the below input:

i condition1
    statement1
l
    statement2
e
statement3
.

It will give you this output:

	CONDITION1
	BEQ L0
	STATEMENT1
	BRA L1
L0:
	STATEMENT2
L1:
	STATEMENT3

As expected, the code branches around statement1 if false, and executes statement2 and statement3, while if true, it executes statement1 and then goes to statement3. Try it also without else to make sure this case is not broken. Also try it with nested ifs and verify it works. As an exercise you can try to add THEN as a mandatory keyword. I believe it can be done with just one line of code!

Error Handling

At this point I would like to improve our error handling. If you try it without the endif token, our error handling will correctly pick it up. If however you try it without the endif token and without the end-program token, it will just say “end of input” without pointing out the error. In fact, the end-program token (our '.' in this case) is not processed at all, unless there is some other non-white space character following it. Something is wrong here. It looks like the issue is with skipNextChar, which when it reaches the end of input, just aborts the compilation, so match does not get its chance to do its error checking.

To improve this, I’m changing skipNextChar to make it always return something in nextChar, even when it’s run out of characters in the input stream. For this I’ve picked the null character – this will be our end-of-input token.

// end of input mark
val nullChar = 0
val endOfInput = nullChar.toChar()

Our end of program check now becomes:

/** check for end of program */
fun isEndProgram(c: Char): Boolean = c == endProg || c == endOfInput

I will also define a function to check for end of block:

/** check for end of block */
fun isEndBlock(c: Char) = c == endifToken || c == elseToken || isEndProgram(c)

With this, the exit condition in the parseBlock loop can be changed to:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            else -> parseOther()
        }
    }
}

I’m also improving function expected to print a more meaningful message than trying to print the null character when we encounter end of input:

/** report what was expected and abort */
fun expected(expMsg: String) {
    // setup a meaningful message if we have reached end of input
    val foundStr: String = if (inp.nextChar==endOfInput) "end of input" else inp.nextChar.toString()
    abort("Expected $expMsg \n found [${foundStr}]")
}

Note how in Kotlin the if statement actually returns a value, so there is no need for the ternary operator in this language.

This is the new version of skipNextChar:

/** set the lookahead character to the next char from input */
fun skipNextChar() {
    if (indx < inputProgram.length)
        nextChar = inputProgram[indx++]
    else
        nextChar = endOfInput
}

And with this improvement we don’t need to add a '\n' at the end of our input program that was meant to help with handling of the end of input:

/** initialisation code class InputProgramScanner */
init {
    try {
        val f = File(inputFile)
        // read the whole program into a string
        inputProgram = f.readText()
        // set the lookahead character to the first input char
        getNextChar()
    } catch (e: Exception) {
        abort("could not open file [$inputFile]")
    }
}

Try it again. It looks like the error handling is much improved now.

WHILE Statement

With the IF out of the way, it should be now easy to process a WHILE structure. The syntax of WHILE in our language will be:

WHILE <condition> <block> ENDWHILE

It’s not really necessary to use different end-tokens for each control statement. It does add clarity to the code though. However, at this stage where we are using single-character tokens and we will be using 'e' for both endif and endwhile.

Looking at the above definition, we will need a Label at the beginning of the loop to branch to, when we start a new iteration, and a Label just after the end of the loop to branch to, when the condition is false. This in syntax-directed translation becomes:

    WHILE          { L1 = NewLabel
                     PostLabel(L1) }
    <condition>    { (Branch if False to L2) }
    <block>
    ENDWHILE       { (Branch to L1) 
                     PostLabel(L2) }

And the parse function for while is like this:

/**
 * parse while statement
 * WHILE <condition> <block> ENDWHILE
 */
fun parseWhile() {
    inp.getNextChar()
    val label1 = newLabel()
    val label2 = newLabel()
    postLabel(label1)
    parseCondition()
    code.branchIfFalse(label2)
    parseBlock()
    inp.match(endwhileToken)
    code.branch(label1)
    postLabel(label2)
}

We need to remember to declare our while and endwhile tokens and to terminate the loop in parseBlock when we see the endwhile token. This is done easily by updating isEndBlock:

val whileToken = 'w'
val endwhileToken = 'e'

/** check for end of block */
fun isEndBlock(c: Char) = c == endifToken || c == elseToken || c == endwhileToken || isEndProgram(c)

And of course we need to tell parseBlock to check for a while statement:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            whileToken -> parseWhile()
            else -> parseOther()
        }
    }
}

As usual compile it and try it out.

An example of an input program with nested while loop can be:

w condition1
    statement1
    w condition2
        statement2
        statement3
    e
e

Give it various combinations of nested loops and ifs to make sure it works. Also give it invalid input, with missing ENDIF or ENDWHILE.

At this point I will pause, to give you time to experiment with these concepts and will come back soon with the 2nd part of Control Structures.

Chapter 4: Interpreters

2nd Nov 202110th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

As we all know, a compiler parses and translates an input language (in our case numerical expressions and assignments so far) to code that is executed on a target CPU or virtual machine. An interpreter on the other hand, generates no output code. Instead, the arithmetic is computed immediately as the parsing is going on.

As Jack says, our compiler is what is called a “pure” compiler, i.e. every time a construct is recognised, the code is produced immediately. And this is the reason for the not so efficient code. At the other end of the spectrum, a “pure” interpreter will do no translation on the source code. And these are the two “extremes” of translation. I’m sure your gut feel is telling you that in the real world it might be beneficial for a compiler to user techniques used in interpreters and vice-versa.

We know for a fact, that the Microsoft Basic interpreter translated the source code to some intermediate code (tokenizing) to make it easier to parse it in real time. Also an Assembler, looks for constant expressions, calculates them and uses the result instead of producing object code for them.

Let’s think of the below expression:

x = x + 3 – 2 – ( 5 – 4 )

Run it as input to our compiler. It produces 18 lines of object code! But if you look at it better and do the maths, you will see that it translates to

x = x + 0

x = x

which effectively tells us to do nothing at all.

This is the concept of “lazy” translation, which means that you don’t produce code at every action. You only produce code when you have to, and in some cases like above, you don’t produce code at all. Obviously, to implement this concept every parsing routine would have to make decisions whether to produce code or not and this would complicate these routines significantly. So, clearly, this technique is not named “lazy” because it’s easier on the compiler programmer…

Now that we have set the scene, I will present the Kotlin version of Jack’s interpreter.

The Interpreter

Please open a new project in your IDE and copy the InputProgramScanner.kt file as it evolved at the end of chapter 3 and the basic CompilerMain.kt from chapter 1 (which I have renamed to InterpreterMain.kt) to start with. Don’t forget to remove the call to getNextChar from the main program, as this has now been moved to the Init code of class InputProgramScanner. Also in CompilerMain please rename the function initCompiler to initInterpreter.

I’m using the InputProgramScanner from chapter 3 as it already supports white spaces and multi-character identifiers and this makes it a bit more user-friendly.

Given that we will be performing arithmetic, the first thing we need to do is to change the getNumber function to return an integer by changing the declaration and by modifying the return line at the end:

/**
 * get a number
 * <number> ::= [ <digit> ] +
 */
fun getNumber(): Int {
    var value: String = ""
    if (!isNumeric(nextChar)) {
        expected("Number")
    } else {
        while (isNumeric(nextChar)) {
            value += nextChar.toString()
            skipNextChar()
        }
        skipWhite()
    }
    return value.toInt()
}

And let’s have a simplified version of parseExpression, which will also have to return an integer:

/** parse a numeric expression */
fun parseExpression(): Int {
    return inp.getNumber()
}

To make this simple interpreter work, add this line in the main function:

/** main function */
fun main(args: Array<String>) {
    initInterpreter(args)
    println(parseExpression())
}

Compile this and run with an input program consisting of one integer; it will recognise and print that integer.

Once you have this simple version working, let’s move on and extend it with addition and subtraction. Here’s the new parseExpression:

/** parse a numeric expression */
fun parseExpression(): Int {
    var value: Int
    // provision for -Expression and +Expression
    if (inp.isAddop(inp.nextChar))
        value = 0
    else
        value = inp.getNumber()
    while (inp.isAddop(inp.nextChar)) {
        when (inp.nextChar) {
            addOp -> {
                inp.getNextChar()
                value += inp.getNumber()
            }
            subOp -> {
                inp.getNextChar()
                value -= inp.getNumber()
            }
        }
    }
    return value
}

This is of course very similar to the parseExpression from chapter 3, with one little difference. Given that we need both sides of the operation to do the the arithmetic, it’s simpler to just execute the addition or subtraction in-line in this function than define separate add and subtract functions as we did in the previous chapters. If we had chosen to keep them as separate functions, we would have to pass the current value of value to them and they would have to return the result back to the caller, which would add unnecessary complexity. This also shows that the way the compiler has been written in the first 3 chapters is not suitable for lazy evaluation.

Compile it and make sure it works. It can process any expression of integers that includes any number of additions and/or subtractions.

I’m sure you will agree that it will be very easy to add support for multiplication and division. Just add the below parseTerm function and replace every call to getNumber in parseExpression with a call to parseTerm:

/** parse e term */
fun parseTerm(): Int {
    var value: Int = inp.getNumber()
    while (inp.isMulop(inp.nextChar)) {
        when (inp.nextChar) {
            mulOp -> {
                inp.getNextChar()
                value *= inp.getNumber()
            }
            divOp -> {
                inp.getNextChar()
                value /= inp.getNumber()
            }
        }
    }
    return value
}

As usual compile it and run it with any combination of integer additions, subtractions, multiplications and divisions that you can think of, including erroneous input.

And to finish this first part of the interpreter we can add a new function parseFactor and replace all the calls to getNumber in parseTerm with calls to parseFactor. This will enable us to process parenthesised expressions:

/** parse a factor */
fun parseFactor(): Int {
    val value: Int
    if (inp.isLeftParen(inp.nextChar)) {
        inp.getNextChar()
        value = parseExpression()
        inp.match(rightParen)
    }
    else
        value = inp.getNumber()
    return value
}

Please compile it and try it out, make sure it handle any complex parenthesised expression and try to break it by giving it wrong input.

At this stage Jack talks about “a little philosophy” where he is giving us his thoughts about using the stack in expression parsing and how we are getting this for free here by using recursion in a high-level language like Kotlin (or Pascal which is what he used). I would invite you to read the original.

Using Variables

When we introduced variables in our compiler, it was a very simple matter as we just passed the variable name to the Assembler to deal with it. Here it’s slightly more complicated as we have to find a place to store the name of each variable and it’s value. Those of you who know Java, must have guessed that a good solution is a Map with the variable name as the key and the variable’s value as the data. So, here goes:

Add the Map that will hold all our variables:

// the variables' space
var variable: MutableMap<String, Int> = mutableMapOf()

Modify parseFactor to use the values stored in this map when a variable is referenced:

/** parse a factor */
fun parseFactor(): Int {
    var value: Int = 0
    if (inp.isLeftParen(inp.nextChar)) {
        inp.getNextChar()
        value = parseExpression()
        inp.match(rightParen)
    }
    else if (inp.isAlpha(inp.nextChar)) {
        val varName = inp.getName()
        val temp: Int? = variable[varName]
        if (temp == null)
            abort("variable ${varName} not initialised")
        else
            value = temp
    }
    else
        value = inp.getNumber()
    return value
}

Add a parseAssignment function to set a variable to a value:

/** parse an assignment */
fun parseAssignment(): Int {
    val identifier = inp.getName()
    inp.match(equalsOp)
    variable[identifier] = parseExpression()
    // return value used for testing purposes only
    return variable[identifier] ?: 0
}

You can only partly verify parseFactor as it will now fail in the first mention of any variable (as it’s unassigned). You can test parseAssignement if you change the main function to print the return value of parseAssignemt and run the interpreter with an assignment statement using only constant values.

And this brings us naturally to processing of newlines. This will enable us to feed a multi-line program to our interpreter and verify the result of each line. All we need to do is to add a function that will skip any newline character and change the main function to parse assignments in a loop, while skipping new lines (please remember also to skip any newlines in the initialisation code in InputProgramScanner.kt to get rid of any empty lines in the beginning of your program):

/** skip newline */
fun skipNewline() {
    while (isEndOfLine(nextChar))
        getNextChar()
}

and

/** main function */
fun main(args: Array<String>) {
    initInterpreter(args)
    while (true) {
        println(parseAssignment())
        inp.skipNewline()
    }
}

As you can see, in this test version, main is an endless loop, so how do we get out of it? In my code the loop terminates via the “end of input” test in getNextChar. Jack in his code had used a program terminator character ('.') and broke the loop when he saw that char.

A note on Kotlin:

Those of you who know Java would know that you can update a map or retrieve a value from a map using:

map.put(key, value) and

value = map.get(key)

As you can see above we can now do in Kotlin

map[key] = value

and value = map[key]

…and those of you who happen to know Pascal, may have noticed the similarity with Jack’s Turbo Pascal code:

Table: Array['A'..'Z'] of integer;
...
Table['A'] := 25;
...
Value := Table['A'];

Also, Kotlin, unlike Java, does not normally allow null values. One of Kotlin’s aims was to eliminate the NullPointerException that has caused so many program crashes and so much pain to so many of us. If however a variable may have a null value for a good reason (as in the return value from the map in case the variable name has not been saved before) then it has to be declared as above Int? as opposed to Int. You can read more about Null Safety in Kotlin here.

With these powerful features in mind, we can rewrite parseFactor and make it a bit more compact using the Elvis Operator:

/** parse a factor */
fun parseFactor(): Int {
    val value: Int
    if (inp.isLeftParen(inp.nextChar)) {
        inp.getNextChar()
        value = parseExpression()
        inp.match(rightParen)
    }
    else if (inp.isAlpha(inp.nextChar)) {
        val varName = inp.getName()
        value = variable[varName] ?: abort("variable ${varName} not initialised")
    }
    else
        value = inp.getNumber()
    return value
}

In order to use the abort function at the right-hand side of the Elvis operator I have used the trick to declare it as Int. The Kotlin compiler does not complain about the fact that abort does not return any value, because the last statement in it is exitProcess (so no one would care about whatever value it returned).

/** abort compilation */
fun abort(errMsg: String): Int {
    error(errMsg)
    exitProcess(1)
}

Input and Output

To make our interpreter a bit more interactive, Jack concluded his original chapter 4 by adding input and output functions. The token '?' is the “read” statement and the token '!' is the “write” statement. Below are the respective functions:

/** input function */
fun input() {
    inp.getNextChar()
    val varName = inp.getName()
    println("enter value for [${varName}]")
    val s: String? = readLine()
    variable[varName] = s?.toInt() ?: 0
}

/** output function */
fun output() {
    inp.getNextChar()
    val varName = inp.getName()
    println("value of [${varName}] = ${variable[varName] ?: abort("variable ${varName} not initialised")}")
}

I could not resist the “temptation” to make them a tiny bit more complex than in Jack’s version, so I have added prompts to tell us which variable we are inputting the value for, or seeing the value of. Also, please note the safeguards against null values and the use of the Elvis operator (in case it has gone unnoticed, you can read more about Null Safety in Kotlin here).

And here’s the loop in the main function, which now becomes:

/** main function */
fun main(args: Array<String>) {
    initInterpreter(args)
    while (true) {
        when (inp.nextChar) {
            inToken -> input();
            outToken -> output()
            else -> parseAssignment()
        }
        inp.skipNewline()
    }
}

We also need to add to our global definitions:

// input / output
val inToken = '?'
val outToken = '!'

Try it out. Use for example the below as your input program:

? BaseLength
? BaseWidth
BaseArea = BaseLength * BaseWidth
! BaseArea
? Height
Volume = BaseArea * Height
! Volume

Wow! We now have a fully functional calculator that can do addition, subtraction, multiplication and division, and process a series of any kind of complex numerical expressions including parenthesis. It also has a practically infinite number of “memory” locations where we can store anything we want during our calculation. Feel free to add real numbers, power of, square root, logarithms, sine, cosine, etc… and you have a proper scientific calculator. Those of you inclined towards mobile apps, can take this code, add a proper graphical user interface and create your own calculator app for Android and put it up in the AppStore (not for profit of course). If you would like to do this, please ensure you contact me first, and also mention my name and of course Jack’s name in your credits.

Coming up next: Control Structures

Below is the full code for this version of our Interpreter:

InterpreterMain.kt

package you.set.this

import kotlin.system.exitProcess

/**
 * A Simple Interpreter
 * Based on the Let's Build a Compiler! series by Jack Crenshaw
 * Version 1.0 21.10.2021
 */

// the input program scanner
lateinit var inp: InputProgramScanner

/** report an error */
fun error(errMsg: String) {
    System.err.println("Error: $errMsg")
}

/** abort compilation */
fun abort(errMsg: String): Int {
    error(errMsg)
    exitProcess(1)
}

/** print message and exit */
fun exit(msg: String) {
    System.err.println(msg)
    exitProcess(0)
}

/** report what was expected and abort */
fun expected(expMsg: String) = abort("Expected $expMsg \n found [${inp.nextChar}]")

/** compiler initialisation */
fun initInterpreter(args: Array<String>) {
    if (args.isEmpty())
        abort("no input file specified")
    // get a new input program scanner object - initialise from input file
    inp = InputProgramScanner(args[0])
}

/** main function */
fun main(args: Array<String>) {
    initInterpreter(args)
    while (true) {
        when (inp.nextChar) {
            inToken -> input();
            outToken -> output()
            else -> parseAssignment()
        }
        inp.skipNewline()
    }
}

InputProgramScanner.kt

package you.set.this

import java.io.File

/**
 * The input program scanner class
 * Performs the lexical scanner functions
 * Processes the char-by-char input and returns the tokens from the input stream
 */

/////////////////////////////////////////////////////////
// global special characters definitions
/////////////////////////////////////////////////////////
// numeric operators
val addOp: Char = '+'
val subOp: Char = '-'
val mulOp: Char = '*'
val divOp: Char = '/'
// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'
// equals
val equalsOp = '='
// input / output
val inToken = '?'
val outToken = '!'
/////////////////////////////////////////////////////////

/** this class implements the lexical scanner */
class InputProgramScanner(inputFile: String = "") {

    // the input program as string
    var inputProgram: String = ""
    var indx = 0

    // the next character from input
    // this is our lookahead character
    var nextChar: Char = ' '

    /** initialisation code */
    init {
        try {
            val f = File(inputFile)
            // read the whole program into a string
            // add a dummy \n at the end
            // this way the lookahead will work properly when we reach the end of input
            inputProgram = f.readText() + "\n"
            // set the lookahead character to the first input char
            getNextChar()
            skipNewline()
        } catch (e: Exception) {
            abort("could not open file [$inputFile]")
        }
    }

    /** get the next character from input after skipping white spaces */
    fun getNextChar() {
        skipNextChar()
        skipWhite()
    }

    /** set the lookahead character to the next char from input */
    fun skipNextChar() {
        if (indx < inputProgram.length)
            nextChar = inputProgram[indx++]
        else
            exit("end of input")
    }

    /** skip white spaces */
    fun skipWhite() {
        while (isWhite(nextChar))
            skipNextChar()
    }

    /** skip newline */
    fun skipNewline() {
        while (isEndOfLine(nextChar))
            getNextChar()
    }

    /** match a specific input char */
    fun match(x: Char) {
        // return a match when the next char matches x
        if (nextChar == x)
            getNextChar()
        else
            expected("'$x'")
    }

    /**
     * get an identifier
     * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] *
     */
    fun getName(): String {
        var token: String = ""
        if (!isAlpha(nextChar))
            expected("Identifier")
        else {
            while (isAlphanumeric(nextChar)) {
                token += nextChar.uppercase()
                skipNextChar()
            }
            skipWhite()
        }
        return token
    }

    /**
     * get a number
     * <number> ::= [ <digit> ] +
     */
    fun getNumber(): Int {
        var value: String = ""
        if (!isNumeric(nextChar)) {
            expected("Number")
        } else {
            while (isNumeric(nextChar)) {
                value += nextChar.toString()
                skipNextChar()
            }
            skipWhite()
        }
        return value.toInt()
    }

    /** check for an alpha char */
    fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z'

    /** check for a numeric char */
    fun isNumeric(c: Char): Boolean = c in '0'..'9'

    /** check for alphanumeric */
    fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'

    /** check for an "addop" (+,-) */
    fun isAddop(c: Char): Boolean = c == addOp || c == subOp

    /** check for a "mulop" (*,/) */
    fun isMulop(c: Char): Boolean = c == mulOp || c == divOp

    /** check for left parenthesis */
    fun isLeftParen(c: Char): Boolean = c == leftParen

    /** check for a white space */
    fun isWhite(c: Char): Boolean = c == ' ' || c == '\t'

    /** check for end of line */
    fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'
}

ProgramParser.kt

package you.set.this

// the variables' space
var variable: MutableMap<String, Int> = mutableMapOf()

/** parse an assignment */
fun parseAssignment(): Int {
    val identifier = inp.getName()
    inp.match(equalsOp)
    variable[identifier] = parseExpression()
    // return value used for testing purposes only
    return variable[identifier] ?: 0
}

/** parse a numeric expression */
fun parseExpression(): Int {
    var value: Int
    // provision for -Expression and +Expression
    if (inp.isAddop(inp.nextChar))
        value = 0
    else
        value = parseTerm()
    while (inp.isAddop(inp.nextChar)) {
        when (inp.nextChar) {
            addOp -> {
                inp.getNextChar()
                value += parseTerm()
            }
            subOp -> {
                inp.getNextChar()
                value -= parseTerm()
            }
        }
    }
    return value
}

/** parse e term */
fun parseTerm(): Int {
    var value: Int = parseFactor()
    while (inp.isMulop(inp.nextChar)) {
        when (inp.nextChar) {
            mulOp -> {
                inp.getNextChar()
                value *= parseFactor()
            }
            divOp -> {
                inp.getNextChar()
                value /= parseFactor()
            }
        }
    }
    return value
}

/** parse a factor */
fun parseFactor(): Int {
    val value: Int
    if (inp.isLeftParen(inp.nextChar)) {
        inp.getNextChar()
        value = parseExpression()
        inp.match(rightParen)
    }
    else if (inp.isAlpha(inp.nextChar)) {
        val varName = inp.getName()
        value = variable[varName] ?: abort("variable ${varName} not initialised")
    }
    else
        value = inp.getNumber()
    return value
}

/** input function */
fun input() {
    inp.getNextChar()
    val varName = inp.getName()
    println("enter value for [${varName}]")
    val s: String? = readLine()
    variable[varName] = s?.toInt() ?: 0
}

/** output function */
fun output() {
    inp.getNextChar()
    val varName = inp.getName()
    println("value of [${varName}] = ${variable[varName] ?: abort("variable ${varName} not initialised")}")
}

Chapter 3: More Expressions – Variables, Functions

27th Oct 202110th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

So far this tutorial has shown us how to parse and translate numeric expressions up to arbitrary levels of complexity. However, we still have a few restrictions:

no variables are allowed – only numeric factors
the numeric factors were limited to single digits
no white spaces allowed

In this third chapter we will remove these restrictions and will also add support for assignments and simple calls to functions. Please remember that the second restriction is self-imposed and I fully agree with Jack that it helps simplify the code so that we can focus on the important points first. When we are ready to get rid of it, it will be very easy.

Variables

Most expressions in real life programming involve variables such as:

b * b + 4 * a + c

If we go back to our current definition of factor

<factor> ::= ( <expression> ) | <number>

we can easily conclude that the variable references in the above expression are just factors. So the factor definition becomes now:

<factor> ::= ( <expression> ) | <number> | <variable>

So, from this we can see that we have an easy choice: if the next character is a digit we have a number, if it’s a letter we have a variable. Same as with the number, we will just have to set the Accumulator to the value of the variable. Sticking with the 68000 assembly code for now, we are learning form Jack’s tutorial that this is done in a position-independent manner, i.e. it’s relative to the program counter:

MOVE X(PC),D0

where X is the variable name. Based on the above the method parseFactor now becomes:

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number> | <variable>
 */
fun parseFactor() {
        if (inp.isLeftParen(inp.nextChar)) {
                // ( Expression )
                inp.getNextChar()
                parseExpression()
                inp.match(rightParen)
        }
        else if (inp.isAlpha(inp.nextChar))
                // Variable
                code.setAccumulatorToVar(inp.getName())
        else
                // Number
                code.setAccumulator(inp.getNumber())
}

Please remember to add the new 68000 instruction in M68000Instructions.kt:

/** set accumulator to variable */
fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0")

So, with just 2 extra lines of code in the parser we have added support for variables! And as before our parseFactor function matches exactly the BNF definition. As usual, try it with various input “programs” and ensure it does proper error checking.

Functions

Apparently functions is one of the few common things that is supported by many languages. Even though this is too early stage to introduce functions, because we have not addressed the question about passing parameters, Jack includes simple function calls without parameters here, as it brings up an important point.

Until now, the parsing was predictive, i.e. by looking at the next character we could tell what follows. However, with the introduction of functions, if the next character is a letter, we don’t really know whether it’s a variable or a function. This is solved by following the C language standard where a function name is always followed by () – empty if there are no parameters.

Given that we cannot handle parameters yet, our function calls will have the form:

f()

Without parameters to take care of, there is nothing else to do than call that function. Given that now we have two possibilities if we encounter an “alpha” in parseFactor, this will be dealt with in a separate parse function.

Please modify parseFactor as below and add the new function parseIdentifier:

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number> | <identifier>
 */
fun parseFactor() {
        if (inp.isLeftParen(inp.nextChar)) {
                // ( Expression )
                inp.getNextChar()
                parseExpression()
                inp.match(rightParen)
        }
        else
        if (inp.isAlpha(inp.nextChar))
                // Identifier
                parseIdentifier()
        else
                // Number
                code.setAccumulator(inp.getNumber())
}

/**
 * parse an identifier
 * <identifier> ::= <variable> | <function>
 */
fun parseIdentifier() {
        val identName: String = inp.getName()
        if (inp.isLeftParen(inp.nextChar)) {
                // function
                inp.getNextChar()
                inp.match(rightParen)
                code.callSubroutine(identName)
        }
        else
                // variable
                code.setAccumulatorToVar(identName)
}

And of course add the new instruction into M68000Instructions.kt

/** call a subroutine */
fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}")

Please compile and run this version and as always test the error handling.

What Jack has shown us here is that with very little added complexity we have increased the scope of our compiler tremendously. Where the predictive approach cannot tell us what follows (at the point where parseFactor sees a letter coming next) we just branch to another function and let it deal with it by applying predictive approach further down the line (read that identifier name and check if the next character is a left parenthesis). And this technique can extend many levels down.

More on Error Handling

At this point let me point out that our compiler is performing decent error handling, catching invalid expressions and aborts the compilation in these scenarios. As Jack mentioned, we don’t have any direct error checks in any of the parser functions, so we seem to be getting this error handling almost “for free”. And this is indeed the case. Our parser relies on the error handling done in match, getNumber and getName.

Having said that, those of you who are good at breaking programs must have noticed that the expressions:

1+2<space>3*4 or

1+2@3*4

both stop after processing 1+2. And if you think about it, this is what we have told our compiler to do: process an Expression and stop at the end of it (i.e. when the next character is not a recognised as part of the expression).

We temporarily can fix this gap in our error handling by agreeing that our expressions should end with a newline. To enforce this add the following code into our main function just after the call to parseExpression:

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseExpression()
    if (!inp.isEndOfLine(inp.nextChar))
        expected("Newline")
}

Don’t forget to add the corresponding function into InputProgramScanner.kt:

/** check for end of line */
fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'

As you can see the function that checks for end of line checks for either CR or LF to cover both Windows and Unix. However, you are free to fine-tune it to make sure it suits your environment.

As usual, compile it and try it. It should now recognise and flag expressions not properly terminated with a new line.

Assignment Statements

So, now that we have successfully parsed complex numerical expressions, let’s take the natural next step and save the output of the expression, which brings us to assignments. An assignment will usually have the form:

<assignment> ::= <identifier> = <expression>

In order to parse this we need another function that replicates the above BNF:

/**
 * parse assignment
 * <assignment> ::= <identifier> = <expression>
 */
fun parseAssignment() {
        var identName: String = inp.getName()
        inp.match(equalsOp)
        parseExpression()
        code.assignment(identName)
}

Also need to add one more global declaration in our InputProgramScanner.kt:

// equals
val equalsOp = '='

And of course the necessary 68000 code (this is very CPU specific so I trust Jack here):

/** assignment var to value */
fun assignment(identifier: String) {
    outputCodeNl("LEA $identifier(PC),A0")
    outputCodeNl("MOVE D0,(A0)")
}

Finally, replace the call to parseExpression in function main with a call to parseAssignement:

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseAssignment()
    if (!inp.isEndOfLine(inp.nextChar))
        expected("Newline")
}

Once again, by adding 6 lines of code in the parser we have extended the functionality of our compiler dramatically! And as before, error checking comes for free as the new parsing function relies on match and getName to do the error checking.

If expressions was the only thing we would need for a program then we would be done. However, programs also need control statements, if-then-else, loops, etc. These are covered in chapter 5. What we have achieved up to now though is quite important, as expressions are one of the most challenging topics in a programming language.

Multi-Character Tokens

As I have said already, I’m following Jack’s approach by sticking to the “keep it simple” concept, hence the restrictions of single character tokens. This approach is used in some of the following chapters as well in order to avoid complexity and focus on the important issue at hand. It is very easy to be removed though as you can see below.

So, if we think of the identifier as a letter followed by alphanumeric characters (letters or numbers) or '_', and a number as a series of numeric digits, then we can easily modify getName and getNumber as per below:

Rename getNextChar to skipNextChar and define a new function getNextChar that for now just calls skipNextChar. This will be used in the following topic (white spaces):

/** get the next character from input */
fun getNextChar() {
    skipNextChar()
}

/** set the lookahead character to the next char from input */
fun skipNextChar() {
    if (indx < inputProgram.length)
        nextChar = inputProgram[indx++]
    else
        exit("end of input")
}

Change getName and getNumber as per below:

/** get an identifier */
fun getName(): String {
    var token: String = ""
    if (!isAlpha(nextChar))
        expected("Identifier")
    else {
        while (isAlphanumeric(nextChar)) {
            token += nextChar.uppercase()
            skipNextChar()
        }
    }
    return token
}

/**
 * get a number
 * <number ::= [ <digit> ] +
 */
fun getNumber(): String {
    var value: String = ""
    if (!isNumeric(nextChar)) {
        expected("Number")
    } else {
        while (isNumeric(nextChar)) {
            value += nextChar.toString()
            skipNextChar()
        }
    }
    return value
}

Add the following as well to our InputProgramScanner class:

/** check for alphanumeric */
fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'

That seemed really easy. Compile and test this code. Now you can parse expressions like

var1=1234*x1+95-46*y1

Make sure it flags invalid characters in the expression.

White Spaces

The white space restriction is also one of those little things that annoy us, so let’s make our compiler more user-friendly by removing it. Not surprisingly Jack is suggesting that we find a simple rule to skip the white space and apply it everywhere. Until now, the lookahead character contained the next usable character in the input stream and this has always been true as there are no white spaces in our input code. The lookahead character is set by getNextChar.

So, naturally, we can enhance getNextChar to skip any white spaces and get us the next usable character from the input ignoring any white spaces.

Please add the new functions isWhite and skipWhite and modify getNextChar as below:

/** get the next character from input after skipping white spaces */
fun getNextChar() {
    skipNextChar()
    skipWhite()
}

/** skip white spaces */
fun skipWhite() {
    while (isWhite(nextChar))
        skipNextChar()
}

/** check for a white space */
fun isWhite(c: Char): Boolean = c == ' ' || c == '\t'

Add a call to skipWhite in getName and getNumber at the end of the loop as skipNextChar, which is used there, does not skip white spaces:

/**
 * get an identifier
 * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] *
 */
fun getName(): String {
    var token: String = ""
    if (!isAlpha(nextChar))
        expected("Identifier")
    else {
        while (isAlphanumeric(nextChar)) {
            token += nextChar.uppercase()
            skipNextChar()
        }
        skipWhite()
    }
    return token
}

/**
 * get a number
 * <number> ::= [ <digit> ] +
 */
fun getNumber(): String {
    var value: String = ""
    if (!isNumeric(nextChar)) {
        expected("Number")
    } else {
        while (isNumeric(nextChar)) {
            value += nextChar.toString()
            skipNextChar()
        }
        skipWhite()
    }
    return value
}

Here there is another difference between my code and Jack’s. I have used getNextChar where match was used when there was no need to match anything (in add, subtract, multiply and divide) and this has led to the need to create skipNextChar used in getNumber and getName. This function does not skip white spaces but the new version of getNextChar does, which means that by using skipWhite in it, I don’t have to do it in all the places where getNextChar is used. Given that I need to skip any white spaces every time getNextChar is called, in order to get the next usable character, it makes sense in my mind to do this inside getNextChar once. Of course the price I have to pay for this, is that I need two flavours of this function: skipNextChar that does not skip white spaces and getNexrtChar that skips white spaces.

Coming up next: Interpreters

Jack chose to dedicate a chapter to Interpreters, “even if you have no interest in interpreters”. With this next chapter he is giving us quite useful insight that will be used later in the following chapters of our compiler.

Given that there were quite a few changes in many parts of the compiler in this chapter, I’m listing the full code here.

If you look at the below code carefully, you will see that I have moved the initial call to getNextChar (the one that sets the first lookahead character to get our compiler going) from the initialisation code of our main program in CompilerMain.kt to the initialisation code of the class InputProgramScanner. I feel it’s place is there because this is the class that deals with our input stream so this where the lookahead character should be initiliased.

CompilerMain.kt

package you.set.this

import kotlin.system.exitProcess

/**
 * A Simple Compiler
 * Based on the Let's Build a Compiler! series by Jack Crenshaw
 * This version produces Assembly language code for the 68000 microprocessor
 * A future version will produce code for the TI-59 Calculator
 * It will produce output that uses the TI-59 mnemonics that can be keyed directly to the calculator
 * Version 1.0 01.10.2021
 */

// the input program scanner
lateinit var inp: InputProgramScanner
// the Motorola 68000 instruction set
val code = M68000Instructions()

/** report an error */
fun error(errMsg: String) {
    System.err.println("Error: $errMsg")
}

/** abort compilation */
fun abort(errMsg: String) {
    error(errMsg)
    exitProcess(1)
}

/** print message and exit */
fun exit(msg: String) {
    System.err.println(msg)
    exitProcess(0)
}

/** report what was expected and abort */
fun expected(expMsg: String) = abort("Expected $expMsg \n found [${inp.nextChar}]")

/** compiler initialisation */
fun initCompiler(args: Array<String>) {
    if (args.isEmpty())
        abort("no input file specified")
    // get a new input program scanner object - initialise from input file
    inp = InputProgramScanner(args[0])
}

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseAssignment()
    if (!inp.isEndOfLine(inp.nextChar))
        expected("Newline")
}

InputProgramScanner.kt

package you.set.this

import java.io.File

/**
 * The input program scanner class
 * Performs the lexical scanner functions
 * Processes the char-by-char input and returns the tokens from the input stream
 */

/////////////////////////////////////////////////////////
// global special characters definitions
/////////////////////////////////////////////////////////
// numeric operators
val addOp: Char = '+'
val subOp: Char = '-'
val mulOp: Char = '*'
val divOp: Char = '/'
// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'
// equals
val equalsOp = '='
/////////////////////////////////////////////////////////

/** this class implements the lexical scanner */
class InputProgramScanner(inputFile: String = "") {

    // the input program as string
    var inputProgram: String = ""
    var indx = 0

    // the next character from input
    // this is our lookahead character
    var nextChar: Char = ' '

    /** initialisation code */
    init {
        try {
            val f = File(inputFile)
            // read the whole program into a string
            // add a dummy \n at the end
            // this way the lookahead will work properly when we reach the end of input
            inputProgram = f.readText() + "\n"
            // set the lookahead character to the first input char
            getNextChar()
        } catch (e: Exception) {
            abort("could not open file [$inputFile]")
        }
    }

    /** get the next character from input after skipping white spaces */
    fun getNextChar() {
        skipNextChar()
        skipWhite()
    }

    /** set the lookahead character to the next char from input */
    fun skipNextChar() {
        if (indx < inputProgram.length)
            nextChar = inputProgram[indx++]
        else
            exit("end of input")
    }

    /** skip white spaces */
    fun skipWhite() {
        while (isWhite(nextChar))
            skipNextChar()
    }

    /** match a specific input char */
    fun match(x: Char) {
        // return a match when the next char matches x
        if (nextChar == x)
            getNextChar()
        else
            expected("'$x'")
    }

    /**
     * get an identifier
     * <identifier> ::= <alpha> [ <alphanumeric> | <_> ] *
     */
    fun getName(): String {
        var token: String = ""
        if (!isAlpha(nextChar))
            expected("Identifier")
        else {
            while (isAlphanumeric(nextChar)) {
                token += nextChar.uppercase()
                skipNextChar()
            }
            skipWhite()
        }
        return token
    }

    /**
     * get a number
     * <number> ::= [ <digit> ] +
     */
    fun getNumber(): String {
        var value: String = ""
        if (!isNumeric(nextChar)) {
            expected("Number")
        } else {
            while (isNumeric(nextChar)) {
                value += nextChar.toString()
                skipNextChar()
            }
            skipWhite()
        }
        return value
    }

    /** check for an alpha char */
    fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z'

    /** check for a numeric char */
    fun isNumeric(c: Char): Boolean = c in '0'..'9'

    /** check for alphanumeric */
    fun isAlphanumeric(c: Char): Boolean = isAlpha(c) || isNumeric(c) || c == '_'

    /** check for an "addop" (+,-) */
    fun isAddop(c: Char): Boolean = c == addOp || c == subOp

    /** check for a "mulop" (*,/) */
    fun isMulop(c: Char): Boolean = c == mulOp || c == divOp

    /** check for left parenthesis */
    fun isLeftParen(c: Char): Boolean = c == leftParen

    /** check for end of line */
    fun isEndOfLine(c: Char): Boolean = c == '\n' || c == '\r'

    /** check for a white space */
    fun isWhite(c: Char): Boolean = c == ' ' || c == '\t'
}

ProgramParser.kt

package you.set.this

/**
 * Program parsing - module
 *
 */

/**
 * parse assignment
 * <assignment> ::= <identifier> = <expression>
 */
fun parseAssignment() {
        var identName: String = inp.getName()
        inp.match(equalsOp)
        parseExpression()
        code.assignment(identName)
}

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        if (inp.isAddop(inp.nextChar))
        // "trick" to deal with -Expression or +Expression
                code.clearAccumulator()
        else
                parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

/**
 * parse a term
 * <term> ::= <factor> [ <mulop> <factor> ] *
 */
fun parseTerm() {
        parseFactor()
        while (inp.isMulop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        mulOp -> multiply()
                        divOp -> divide()
                }
        }
}

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number> | <identifier>
 */
fun parseFactor() {
        if (inp.isLeftParen(inp.nextChar)) {
                // ( Expression )
                inp.getNextChar()
                parseExpression()
                inp.match(rightParen)
        }
        else
        if (inp.isAlpha(inp.nextChar))
                // Identifier
                parseIdentifier()
        else
                // Number
                code.setAccumulator(inp.getNumber())
}

/**
 * parse an identifier
 * <identifier> ::= <variable> | <function>
 */
fun parseIdentifier() {
        val identName: String = inp.getName()
        if (inp.isLeftParen(inp.nextChar)) {
                // function
                inp.getNextChar()
                inp.match(rightParen)
                code.callSubroutine(identName)
        }
        else
                // variable
                code.setAccumulatorToVar(identName)
}

/** parse an addition */
fun add() {
        inp.getNextChar()
        parseTerm()
        code.addToAccumulator()
}

/** parse a subtraction */
fun subtract() {
        inp.getNextChar()
        parseTerm()
        code.subFromAccumulator()
}

/** parse a multiplication */
fun multiply() {
        inp.getNextChar()
        parseFactor()
        code.multiplyAccumulator()
}

/** parse a division */
fun divide() {
        inp.getNextChar()
        parseFactor()
        code.divideAccumulator()
}

M68000Instructions.kt

package you.set.this

class M68000Instructions {

    /** output code */
    fun outputCode(s: String) = print("\t$s")

    /** output code with newline */
    fun outputCodeNl(s: String) = outputCode("$s\n")

    /** set accumulator to a value */
    fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0")

    /** clear accumulator */
    fun clearAccumulator() = outputCodeNl("CLR D0")

    /** push accumulator to the stack */
    fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)")

    /** add top of stack to accumulator */
    fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0")

    /** subtract top of stack from accumulator */
    fun subFromAccumulator() {
        outputCodeNl("SUB (SP)+,D0")
        // negate accumulator as the result is the wrong way round
        outputCodeNl("NEG D0")
    }

    /** multiply accumulator by top of stack */
    fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0")

    /** divide accumulator by top of stack */
    fun divideAccumulator() {
        outputCodeNl("MOVE (SP)+,D1")
        outputCodeNl("DIVS D1,D0")
    }

    /** set accumulator to variable */
    fun setAccumulatorToVar(identifier: String) = outputCodeNl("MOVE ${identifier}(PC),D0")

    /** call a subroutine */
    fun callSubroutine(subroutine: String) = outputCodeNl("BSR ${subroutine}")

    /** assignment var to value */
    fun assignment(identifier: String) {
        outputCodeNl("LEA $identifier(PC),A0")
        outputCodeNl("MOVE D0,(A0)")
    }
}

Chapter 2: Numerical Expression Parsing

20th Oct 202110th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Getting Started

Those of you who have already read Chapter 1 will know what this is about. You will also have copied the cradle code in your Kotlin IDE, complied it and ready to go.

In this chapter Jack teaches us how to parse and translate numerical expressions. For purposes of definition a numerical expression is the right hand of an equation, as in:

x = 2 * y + 3 / (4 * z)

I will follow Jack’s method and will take this in *very* small steps.

Single Digits

In keeping with the simplicity theme, we are starting with the absolutely simplest case that we can think of: a numerical expression consisting of a single digit.

Let’s add this code:

Add the line parseExpression to the main function

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseExpression()
}

Add the code variable to the main program CompilerMain.kt in the global variables definition section. This object will have all the necessary functions to produce the assembly code for our target cpu.

// the input program scanner
lateinit var inp: InputProgramScanner
// the Motorola 68000 instruction set
val code = M68000Instructions()

New file ProgParser.kt

We will add all the expression parsing functions into this file.

package mpdev.compiler.m68kcompiler
/**
 * Program parsing - module
 *
 */

/** parse a numeric expression */
fun parseExpression() {
        code.setAccumulator(inp.getNumber())
}

New file M68000Instructions.kt

This is where we will produce the 68000 Assembly code for each operation. If we do this properly, we should be able to change the functions in this module and produce code for some other microprocessor (or the TI59 which is the end-goal of this blog).

As always I will keep with Jack’s standard and use the D0 register of the 68000 as the register that holds the result, which going forward I will call “Accumulator” to keep with the Z80 architecture.

Please note the functions outputCode and outputCodeNl : they send the resulting code to the output stream without or with a newline respectively.

package mpdev.compiler.m68kcompiler

class M68000Instructions {

    /** output code */
    fun outputCode(s: String) = print("\t$s")

    /** output code with newline */
    fun outputCodeNl(s: String) = outputCode("\t$s\n")

    /** set accumulator to a value */
    fun setAccumulator(value: String) = outputCodeNl("MOVE #${value},D0")

}

Now compile and run our compiler with a text file as the only command line argument containing our “program”: any single decimal digit. Observe the corresponding MOVE assembly instructions that will be produced on your screen. Also, try it with an input file that contains a letter, a special character or a space to test the error handling.

So, we now have a working translator!

OK, it’s pretty limited, but don’t be too quick to dismiss it. It does, on a very limited scale, what a proper compiler should do: it correctly recognises legal statements in the input language as per our definition, it produces correct, executable assembler code, suitable for assembling into object code, and, just as importantly, it correctly recognises statements that are not legal and produces a meaningful error message.

Binary (Two-operand) Expressions

Let’s expand a bit. Admittedly, an “expression” consisting of only one character is not going to take us anywhere far, so let’s see how we can extend it. Naturally we would want to handle expressions like:

1+2

5-3

or similarly to Backus-Naur Form:

<term> + | - <term>

For this, we will need a function that recognises a term and leaves the result somewhere, and a function that recognises and distinguishes between '+' and '-' and produces the appropriate code. Jack has chosen the 68000 register D1 for holding that temporary result.

So, here’s what’s needed:

in ProgParser.kt rename the current parseExpression function to parseTerm and create a new function parseExpression :

/** parse a term */
fun parseTerm() {
        code.setAccumulator(inp.getNumber())
}

/**
 * parse a numeric expression
 * <term> +/- <term>
 */
fun parseExpression() {
        parseTerm()
        code.saveAccumulator()
        when (inp.nextChar) {
                addOp -> add()
                subOp -> subtract()
                else -> expected("Addop")
        }
}

In the same file create the two functions that will implement addition and subtraction:

/** parse an addition */
fun add() {
        inp.getNextChar()
        parseTerm()
        code.addToAccumulator()
}

/** parse a subtraction */
fun subtract() {
        inp.getNextChar()
        parseTerm()
        code.subFromAccumulator()
}

In InputProgramScanned.kt add these global variables outside of class InputProgramScanner:

/////////////////////////////////////////////////////////
// global special characters definitions
/////////////////////////////////////////////////////////
// numeric operators
val addOp: Char = '+'
val subOp: Char = '-'
/////////////////////////////////////////////////////////

/** this class implements the lexical scanner */
class InputProgramScanner(inputFile: String = "") {
.......

Add this to file M68000Instructions.kt

/** save accumulator */
fun saveAccumulator() = outputCodeNl("MOVE D0,D1")

/** add temp result to accumulator */
fun addToAccumulator() = outputCodeNl("ADD D1,D0")

/** subtract temp result from accumulator */
fun subFromAccumulator()= outputCodeNl("SUB D1,D0")

Now run it. Try it with various expressions (combinations of two single digits separated by a + or a -) in your input file:

8+5 should give you:

    MOVE #8,D0
    MOVE D0,D1
    MOVE #5,D0
    ADD D1,D0

9-2 should give you

    MOVE #9,D0
    MOVE D0,D1
    MOVE #2,D0
    SUB D1,D0

Also, try some invalid expressions like 5+a or @-2 etc and see the error you are getting.

One thing to note here, is that our code is a bit inefficient. Storing 8 to D0 and then moving D0 to D1 could have been done in just one instruction (store 8 to D1). However, this is the drawback of the simple approach that Jack has chosen and I am following, which focuses on teaching us and on producing code that works.

Speaking of which, our code does not quite work! If you look at the subtraction example above and follow the assembly language instructions, you will see that it actually executes 2-9. We have to reverse the sign of the result in order to make it work. Please change the function subFromAccumulator as below:

/** subtract temporay result from accumulator */
fun subFromAccumulator() {
    outputCodeNl("SUB D1,D0")
    // negate accumulator as the result is the wrong way round
    outputCodeNl("NEG D0")
}

Try it now with a subtraction as input and verify the output.

At this point we have a parser that can recognise the sum or difference between two digits. This a big step ahead compared to earlier where we could only recognise a single digit. Try running it again with a single digit expression in the input file. Well… that did not work, did it? And why would it, we just told our compiler that an expression is two single-digit terms separated by a '+' or a '-'.

But in real life expressions can consist of one or more terms separated by addops ('+' or '-'). In BNF:

<expression> ::= <term> [ <addop> <term> ] *

Looks like the function parseExpression needs to be modified with addition of a loop to match the above definition:

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

You may have noticed that the line else -> expected("Addop") has been removed in this version. This is because it would have been an unreachable instruction.

Also note how well the new version of parseExpression matches the BNF definition and this is the beauty in Jack’s method, which I personally find very powerful. Also don’t forget to add this function into our InputProgramScanner.kt file:

/** check for an "addop" (+,-) */
fun isAddop(c: Char): Boolean = c == addOp || c == subOp

Compile this new version and try it. It should process any combination of any number of single-digit integers separated by '+' or '-'. Also verify that it handles properly error conditions.

Using the Stack

This is a great example of how powerful Jack’s method is, as he leads us from the simplest possible concept to more and more complex aspects.

So, imagine the expression:

2 + (3 + 4 * ( 8 – 6 ) / 5 )

In this example, we store the first intermediate result (2) in D1. Where will we store the 3 and so on? We will soon run out of CPU registers. The answer is, of course, the Stack.

To achieve this we will modify the functions saveAccumulator, addToAccumulator and subFromAccumulator as per below:

/** push accumulator to the stack */
fun saveAccumulator() = outputCodeNl("MOVE D0,-(SP)")

/** add top of stack to accumulator */
fun addToAccumulator() = outputCodeNl("ADD (SP)+,D0")

/** subtract top of stack from accumulator */
fun subFromAccumulator() {
    outputCodeNl("SUB (SP)+,D0")
    // negate accumulator as the result is the wrong way round
    outputCodeNl("NEG D0")
}

Please try the “compiler” again, observe the output and make sure we have not broken anything.

Multiplication and Division

And now it’s time to get more serious. As a primary school kid would point out, expressions can also have multiplications and divisions. They would also point out that we should execute these first and then any additions and subtractions. So, this raises the question of priority.

So in

2 + 3 * 4

we know that we need to multiply first and then add. And as you can guess, here’s where the stack comes handy.

Jack’s top-down parsing technique is ideal to support operator precedence. So, if we extend our current simplified definition of “term” from a single digit to “product of factors” or in BNF

<term> ::= <factor> [ <mulop> <factor> ] *

where factor here is what term used to be until now (i.e. a single digit), then we can see the amazing symmetry between term and expression. This makes it really simple to apply some copy-and-paste and produce the next set of functions to parse and handle multiplication and division.

Rename function parseTerm to parseFactor

/** parse a factor */
fun parseFactor() {
        code.setAccumulator(inp.getNumber())
}

Add the following to ProgParser.kt:

/**
 * parse a term
 * <term> ::= <factor> [ <mulop> <factor> ] *
 */
fun parseTerm() {
        parseFactor()
        while (inp.isMulop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        mulOp -> multiply()
                        divOp -> divide()
                }
        }
}

/** parse a multiplication */
fun multiply() {
        inp.getNextChar()
        parseFactor()
        code.multiplyAccumulator()
}

/** parse a division */
fun divide() {
        inp.getNextChar()
        parseFactor()
        code.divideAccumulator()
}

Add this to InputProgramScanner.kt in the global vars section

val mulOp: Char = '*'
val divOp: Char = '/'

And inside class InputProgramScanner:

/** check for a "mulop" (*,/) */
fun isMulop(c: Char): Boolean = c == mulOp || c == divOp

Add this to file M68000Instructions.kt

/** multiply accumulator by top of stack */
fun multiplyAccumulator() = outputCodeNl("MULS (SP)+,D0")

/** divide accumulator by top of stack */
fun divideAccumulator() {
    outputCodeNl("MOVE (SP)+,D1")
    outputCodeNl("DIVS D1,D0")
}

Now try it out, e.g. with input “program” 3 + 4 * 2 – 6 / 3 and verify the assembly output. Amazing result – with a few more lines that we added in around 10 minutes, we have doubled the scope of our “compiler”.

Parenthesis

At this stage Jack wraps up this part of the parser by adding parentheses. As we all know, parentheses are a mechanism to force a desired operator precedence:

2 * ( 3 + 4 )

is telling us to add first and then multiply. With parenthesis we can create expressions of any degree of complexity such as:

( 1 + 2 ) / ( ( 3 + 4 ) + ( 5 – 6 ) )

The key to easy incorporate parenthesis into the parser, as Jack pointed out, is to realise that no matter how complicated the expression enclosed by the parentheses may be, to the outside world it just looks like a factor. So, one of the forms of a factor can be:

<factor> ::= ( <expression> ) | <number>

And this is where the recursion comes into the picture: an expression can contain a term that can contain a factor, which contains another expression that contains another term that contains another factor, etc., to infinity (or more realistically, until your Kotlin runtime environment runs out of memory).

Please modify the function parseFactor as below:

/**
 * parse a factor
 * <factor> ::= ( <expression> ) | <number>
 */
fun parseFactor() {
        if (inp.isLeftParenthesis(inp.nextChar)) {
              inp.getNextChar()
              parseExpression()
              inp.match(rightParen)  
        }
        else
              code.setAccumulator(inp.getNumber())
}

Also add this to InputProgramScanner.kt in the global vars section:

// parentheses
val leftParen: Char = '('
val rightParen: Char = ')'

And inside class InputProgramScanner:

/** check for left parenthesis */
fun isLeftParenthesis(c: Char): Boolean = c == leftParen

Once more, we can clearly see the direct analogy between BNF and our (i.e. Jack’s) code. As usual, try it out and give it lots of wrong input to see if it works.

Unary Minus

Before pausing until the next chapter, we need to fix one thing: those of you who have done rigorous testing, would have noticed that our compiler does not recognise -1 as a valid expression. Our parseExpression function expects an expression to start with an integer. By the way, +2 does not work either.

A quick way (but not necessarily the best, as Jack admits) to fix this is to stick an imaginary leading zero in front of this kind of expressions, so that -1 becomes 0-1 and +2 becomes 0+2.

Please modify parseExpression as per below:

/**
 * parse a numeric expression
 * <expression> ::= <term> [ <addop> <term> ] *
 */
fun parseExpression() {
        if (inp.isAddop(inp.nextChar))
                // "trick" to deal with -Expression or +Expression
                code.clearAccumulator()
        else
                parseTerm()
        while (inp.isAddop(inp.nextChar)) {
                code.saveAccumulator()
                when (inp.nextChar) {
                        addOp -> add()
                        subOp -> subtract()
                }
        }
}

Also add the following function to our 68k Instruction set:

/** clear accumulator */
fun clearAccumulator() = outputCodeNl("CLR D0")

When you test your code please remember what it’s supposed to be doing at this stage: it can process any numerical expression (but remember, only one) consisting of a combination of single-digit integers with addition, subtraction, multiplication and division, including parentheses.

A note on my programming style. First, let say once more that I fully agree with Jack’s approach of keeping things simple. Nevertheless, I’ve made a couple of changes in his code:

I’ve declared global variables for the special symbols like numerical operators, parenthesis, etc. Even though this adds lines of code, I personally prefer not to have any direct reference to constant values anywhere in my program other than once as a global declaration. This way if e.g. you decide to use '&' instead of '+' or '{}' instead of '()' you just change it once in the global declaration and your compiler will start recognising '&' as the addition operator and '{' as the left parenthesis.
I have changed the call to match to a call to getNextChar in the beginning of add, subtract, multiply, etc. The reason is that the call to match was partly redundant as the symbol (e.g. the '+', '-', etc) has already been matched and the call to match only served the purpose of advancing to the next input character. Hence, I have replaced it with getNextChar. I would not want the reader to start wondering why we are calling match there and whether he or she is missing something. Jack had already mentioned this point in his original tutorial and made it clear that he prefers to always use match for standardisation. I hope my decision to change this will not cause any issues with following Jack’s code at a later stage. If this proves not to be the case then I would be more than happy to reverse my decision and reinstate the call to match.

And that’s it for this round. At this point Jack gave us in the original blog “a word about optimisation”, which I would invite you to read as it’s quite useful reading that adds more colour into the purpose and goals of Jack’s tutorial.

Coming up next: More Expressions – Variables, Functions

Chapter 1: Introduction

11th Oct 202110th Nov 2021 / Marinos / Leave a comment

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In this Kotlin adaptation of Jack’s original series I will mostly follow the structure that he followed, through which he teaches us in a comprehensive step-by-step tutorial how to design a new programming language and build a working compiler.

I will also follow Jack’s approach and will not publish any elements of compiler theory (I’m not qualified for this anyway). Instead, I will focus on the practical aspects that one needs to know to build a simple, working system.

The original series was written in Turbo Pascal 4.0, which at the time was one of the best software development tools. In this adaptation, the code is written in Kotlin. The IDE used is IntelliJ Community Edition by JetBrains, which comes with a Kotlin (as well as Java) compiler.

As Jack said, this is a “learn-by-doing” series. It is also *one* way of doing things, that works, but of course is not the only or the best way. The reader is encouraged to take this code and explore further options.

The big benefit of following Jack’s approach, is that the reader, once he/she has completed the series, will have good understanding of the internal mechanics of the compiler, as opposed to just copying and pasting a large block of code and using it without understanding of how it works. This way the reader can actually learn and then go off on their own and improve this code.

As this is a quite ambitious project, it cannot be completed in a couple of pages. I will again follow Jack’s approach and publish this work in a series of chapters, as I adapt Jack’s chapters one by one, so please be patient.

One point to note is that even if the final goal of this Blog is to build a compiler for the TI-59, in the initial stages our compiler will be producing Assembly language code for the Motorola 68000 microprocessor, exactly as in the original series. Once the compiler is completed, we will then change the output module (plus some specific bits in the expression scanning logic) to make it produce TI-59 code.

On this note, the Assembly code produced is not the world’s tightest code, but it works and this is where the emphasis of this series is. One of the late chapters however, deals with optimisation.

Finally, please note that everything in this Blog follows the key principle of Keep it Simple, exactly as Jack’s series did. In the interest of simplicity, we will use only single-character tokens and single-digit integers in the early stages of the design, without any embedded white spaces. As Jack wrote, if we can get our compiler to recognise and deal with I-T-L then we can easily get it to do the same with IF-THEN-ELSE.

The parsing of the source code is done in a top-down recursive approach which is a decent, simple and easy to implement methodology for a small project like this.

The error handling is effective (the compiler does not crash) but also simple and the compilation aborts at the first error that it encounters.

The Cradle

Every program requires some boiler plate… I/O routines, error handling, etc. As Jack did, I will keep it to a minimum in order to concentrate on the important stuff. The code below is the minimum that we need to get anything done. It consists of some I/O functions, a couple of error handling functions and a skeleton main function. This will be our Cradle. Please make a copy of these two files, save them, compile them and make sure you can run the main program.

Our compiler must be called with one command line argument, which is the text file containing the program to be compiled. So, please check the various cases:

run it without an argument
run it with a file name that does not exist as an argument
run it with an empty file as an argument
finally run it with a text file that contains something (just anything) as an argument

and see how it behaves in each case. This will test the error handling with regard to the input file.

CompilerMain.kt

This is our main program. Please remember to change the package name (and the corresponding directory structure) to something that suits you, or for simplicity, remove it altogether if you are having issues.

The inp variable is an object of the InputProgramScanner class which holds the program to be compiled and performs the lexical scanning (see below). Please note the lateinit modifier, as all Kotlin variables must be normally initialised at the time they are declared.

The functions error, abort and exit are self-explanatory.

The expected function prints a message with what was expected vs what was found in the input stream and aborts the compilation.

The function initCompiler first checks if there is at least one command line argument. It then creates a new object of the InputProgramScanner class and calls getNextChar to setup the lookahead character (see below).

package mpdev.compiler.m68kcompiler

import kotlin.system.exitProcess

/**
 * A Simple Compiler
 * Based on the Let's Build a Compiler! series by Jack Crenshaw
 * This version produces Assembly language code for the 68000 microprocessor
 * A future version will produce code for the TI-59 Calculator
 * It will produce output that uses the TI-59 mnemonics that can be keyed directly to the calculator
 * Version 1.0 01.10.2021
 */

// the input program scanner
lateinit var inp: InputProgramScanner

/** report an error */
fun error(errMsg: String) {
    System.err.println("Error: $errMsg")
}

/** abort compilation */
fun abort(errMsg: String) {
    error(errMsg)
    exitProcess(1)
}

/** print message and exit */
fun exit(msg: String) {
    System.err.println(msg)
    exitProcess(0)
}

/** report what was expected and abort */
fun expected(expMsg: String) = abort("Expected $expMsg \n found [${inp.nextChar}]")

/** output code */
fun outputCode(s: String) = print("\t$s")

/** output code with newline */
fun outputCodeNl(s: String) = outputCode("\t$s\n")

/** compiler initialisation */
fun initCompiler(args: Array<String>) {
    if (args.isEmpty())
        abort("no input file specified")
    // get a new input program scanner object - initialise from input file
    inp = InputProgramScanner(args[0])
    inp.getNextChar()
}

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
}

InputProgramScanner.kt

This class performs the lexical scanning functions. The key to keeping this simple is the use the lookahead character. This is easily achieved by reading the whole input program into a String variable (inputProgram) and using an index to point to the next character (nextChar) in the input. The function getNextChar advances the index and sets our lookahead character to the next character in the input.

The functions isAlpha and isNumeric are self-explanatory.

The function match checks if the the lookahead character matches a specific character that we expect and aborts the program if not.

The getName function returns the next alphabetic token, which at this stage is effectively the next single-character name, while getNumber returns the next integer, which at this stage is a single digit. Both functions have the necessary error checking.

package mpdev.compiler.m68kcompiler

import java.io.File

/**
 * The input program scanner class
 * Performs the lexical scanner functions
 * Processes the char-by-char input and returns the tokens from the input stream
 */
class InputProgramScanner(inputFile: String = "") {

    // the input program as string
    var inputProgram: String = "";
    var indx = 0;

    // the next character from input
    // this is our lookahead character
    var nextChar: Char = ' '

    /** initialisation code */
    init {
        try {
            val f = File(inputFile)
            // read the whole program into a string
            // add a dummy \n at the end
            // this way the lookahead will work properly when we reach the end of input
            inputProgram = f.readText() + "\n"    
        } catch (e: Exception) {
            abort("could not open file [$inputFile]")
        }
    }

    /** get next token from input */
    fun getNextChar() {
        if (indx < inputProgram.length)
            nextChar = inputProgram[indx++]
        else
            exit("end of input")
    }

    /** match a specific input char */
    fun match(x: Char) {
        // return a match when the next char matches x
        if (nextChar == x)
            getNextChar()
        else
            expected("'$x'")
    }

    /** get an identifier */
    fun getName(): String {
        var token: String = ""
        if (!isAlpha(nextChar))
            expected("Identifier")
        else {
            token = nextChar.uppercase()
            getNextChar()
        }
        return token
    }

    /**
     * get a number
     * <number ::= [ <digit> | decimal_dot ] +
     */
    fun getNumber(): String {
        var value: String = ""
        if (!isNumeric(nextChar)) {
            expected("Number")
        } else {
            value = nextChar.toString()
            getNextChar()
        }
        return value
    }

    /** check for an alpha char */
    fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z'

    /** check for a numeric char */
    fun isNumeric(c: Char): Boolean = c in '0'..'9'

}

Coming up next: Numerical Expression Parsing