Chapter VII: Lexical Scanning

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous 6 chapters we have covered the basic structures of a programming language, like numerical expressions, Boolean expressions and control structures, and by now we have managed to build a basic compiler that can parse something that looks like a programming language. The biggest restriction we still have in place is the single-character tokens, which make our programming language look a bit funny. And this restriction we will get rid of in this chapter.

Until now I have followed closely Jack’s original tutorial (apart from some minor exceptions) and that has given us a great benefit in terms of delivering a working “product”, by using small blocks and continuously building on what we’ve done so far, following Jack’s step-by-step approach, while at the same time, keeping it simple.

In this chapter I will change approach and will introduce some of my own enhancements. The reasons for this are (a) having benefited and learnt from Jack’s tutorial, I feel I can start making my own experiments and see where this can lead us (b) I’m keen to use the capabilities that a language like Kotlin gives us and (c) I want to start focusing on my original goal, which is to produce code for the TI-59. Hence the different numbering style from this chapter on, using Latin numbers.

Nevertheless, I will continue using ideas and concepts from Jack’s following chapters, and will be mentioning these clearly in my text. I strongly believe that the original “Let’s Build a Compiler!” series is a masterpiece, and this is proven by the fact that it has been adapted and reproduced numerous times.

The Various Parts of our Compiler

You may have noticed that our compiler code is split in various files. This is not by accident – the split represents the 3 basic modules a compiler consists of: the Lexical Scanner, the Parser and the Code Generator.

The file InputProgramScanner.kt has all the functions that implement the lexical scanning, i.e. the breaking down of the input stream to distinct tokens, which are then passed to the next stage. It is here that we deal with all the detail about handling white spaces, newlines, and other input format specifics.

The three files ProgParser*.kt implement the Parser. I have split this in 3 sub-modules to make it more structured and easier to follow (one for numerical expressions, one for Boolean expressions and one for control structures). It is in the parser where the language structures are recognised and processed so that they can be translated to object code in the next stage.

Finally, the file M68000Instructions.kt implements the code generator. this is where the specific instructions for the target machine are produced.

I used this separation in these 3 modules from the beginning by using common sense and by following the modular approach that I always like to follow – write small pieces of code that interact with each other, each of which does a small part of the whole job, and if this little part had to be changed, then only that corresponding small module would need need to be changed. E.g. if we want to produce code for 80×86 instead of M68000, we just have to replace the code generator, while everything else remains as is. Similarly, if we want to start using '.AND.' instead of '&' for the Boolean “and” operator we just have to change this in our lexical scanner so that it can recognise the correct token (and we will be doing this in the next chapter).

The Approach

The main concept in Jack’s tutorial is the lookahead character (where “character” means “token” in the single-character token world). By doing a simple check on this lookahead character, our parser knows the next token and can make a decision as to what follows and therefore take the appropriate path. And I intend to keep this concept that Jack taught us.

In order to convert form single-character to multi-character tokens we obviously need to change the value of the token from char to string. But by doing this, we are losing the benefit that the single character gave us, which was by doing one single comparison (character comparison) we knew exactly what followed (which keyword, which operator, identifier, number, etc). If we try to do this with a string instead, we will end up with quite complex and convoluted IF statements which would be against the concept of simplicity. E.g. if you see the character 'I' then you would naturally check the following character to see if it’s an 'F' so you have the string 'IF'. But in this case how do you know it’s the keyword 'IF' and not the beginning of a variable name called 'IFILE'? Similarly, if you see a '+', how do you know it’s the operator '+' and not '++' or '+='?

So, it looks like we need to extend the definition of our token, and, apart from it’s value (string), include other attributes, like an encoded value to make the comparison easier, and also a category to help us group tokens together (e.g. addops). This thinking leads us to the use of enumerated values for these attributes and also points towards using an object for the token as opposed to a single value (string).

This is more or less what Jack has done by using a pair of variables Token and Value. In Token he holds the encoded token (e.g. number, identifier, if-token, else-token, operator) while in Value he holds the actual value of it (e.g. the identifier name, or the actual number)

Token Redefined

So, based on this thinking, let’s see how our token definition will look like:

class Token(val value: String = NO_TOKEN,
            val encToken: Kwd = Kwd.none,
            val info: TokInfo = TokInfo.none,
            val type: TokType = TokType.none)

val NO_TOKEN = "No Token"

The value is holding the string value of the token as it’s received (actually in our case it’s been converted to upper case). The encToken is holding the encoded value of the token that is used to drive the decisions in the parser. The enumeration for this can be something like:

enum class Kwd { addOp, subOp, mulOp, divOp, equalsOp, ifToken, elseToken, endifToken, whileToken, endwhileToken, repeatToken, untilToken, ... ..., endOfInput, any, invalid, none }

By using this enumeration, a single comparison (similar to the character comparison we have used so far) is good enough for the parser, so that it knows exactly what follows.

The info will be something like:

enum class TokInfo { identifier, keyword, operator, number, invalid, none }

While the type can be something like:

enum class TokType { addOps, mulOps, booleanLit, endOfBlock, endOfPRogram, endOfInput, invalid, none }

You can see that by using the TokType enumeration, functions like isAddop, isEndOfBlock, etc. can now be replaced by a simple check of the type attribute.

At this stage I’m not entirely convinced we need the info attribute – we will figure this out by the end of the next chapter.

Having defined how our tokens will look like from now on (until now they were single characters), let’s see what we need to do in order to identify them. I’m definitely in favour of Jack’s design, and intend to keep match as our main function of getting (and checking) tokens in the parser. However, match has to be changed now to compare the encToken instead of the single character that it used to compare:

/**
 * get the next token from the input stream and advance the cursor
 * match this token against a specific given token 'x'
 * also produce a match if called with no token or if token is "any"
 * return the token object that has been matched
 * called by the parser functions
 */
fun match(x: Kwd = Kwd.any): Token {
    if (nextToken.encToken != x && x != Kwd.any)
        expected(decodeToken(x))
    val thisToken = nextToken
    nextToken = scan()
    return thisToken
}

In this version of the scanner, nextToken replaces nextChar (as a concept , that is – nextChar is still used internally in the scanner):

// the next token is here so that we can look ahead
private lateinit var nextToken: Token

There are two more changes in this version of match (apart from comparing the encToken instead of char): The first one is that it now returns the token that it matched (or any token found if it has been asked not to match any). The reason for this is that by returning the token to the caller, match can now be used as the main interface between the parser and the scanner (which was also Jack’s original intention). The second difference is that it now calls scan (instead of getNextChar) to advance to the next token. This was necessary as our tokens are not single characters anymore, so we need to do a bit more work in order to get the next token. I will speak about scan shortly – this is where the intelligence in identifying the next token is going to be.

We have also seen that the parser needs to occasionally check the value of the next token without advancing the “cursor” and so far this has been done by checking the value of nextChar. I would not want to directly access nextToken from the parser, I would prefer this, from the structural point of view, to be a private variable in our scanner class. So we need one more function in our scanner to return the value of the next token, to fulfil the lookahead function without advancing the cursor:

/** lookahead function - returns next token without advancing the cursor */
fun lookahead(): Token = nextToken

So with this, the only two points of interaction between the parser and the scanner are match (to match a token and advance to the next one or just advance to the next one) and lookahead (to check the next token but do not advance).

Scan

And now we are getting to scan which is the major change in our scanner. I’m using the same name and same concept as Jack here but my implementation of scan is slightly different, as my intention is to make it as generic as possible, so that it covers all possible scenarios.

Let’s go down the list of scenarios that scan has to go through, starting from the simplest case and going to more and more complex ones.

The first case is when we encounter the end of the input stream (marked by char(0) in nextChar if you remember). In this scenario scan returns a special token that indicates end-of-input:

if (nextChar == endOfInput)  // check for end of input
    return Token(END_OF_INPUT, Kwd.endOfInput, TokInfo.none, TokType.none)

The next obvious case is when nextChar is a number, in which case scan reads and returns that numeric value:

if (isNumeric(nextChar))            // check for number
    return Token(getNumber(), Kwd.number, TokInfo.number, TokType.none)

Then we have the case where the next character is alphabetic. In this case scan will read that alphanumeric token, but then what? Our gut feel tells that we should check first to see if it is one of the reserved keywords, and if not, it will be an identifier:

val indx: Int
if (isAlpha(nextChar)) {            // check for name
    val name = getName()
    indx = isKeyword(name)
    return (if (indx >= 0) languageTokens[indx]  /* keyword found */
            else Token(name, Kwd.identifier, TokInfo.identifier, TokType.none))  /* identifier found */
}

Having gone through these cases, we are now across a special character, most likely an operator but how can we identify it? The most generic approach I could think of (but maybe not the simplest one) is to check the input stream (starting at the next character) against the predefined reserved sequences and see if any of them appears at the start of what remains of the input stream. We need to compare first against the longest sequences so that if our next character is e.g. '/' we check first for '/=' and also for '/*' (if this is our comment delimiter) and then we check against '/' which will match. Same goes for '==' vs '=', '<=' vs '<' etc. And this is the most complex part of scan:

indx = specSeqPresent(languageTokens)  // we have a special character - check for special sequence
if (indx >= 0)
    return(getSpecSeq(languageTokens, indx))

I have tried to simplify this part by moving the checking for a special sequence and the retrieval of that special sequence, if one is found, to different routines (listed further down).

Finally, if everything so far has failed, then we’ve come across an invalid token and this is what scan will return:

// at this point we have an invalid token
val thisChar = nextChar
getNextChar()
return Token(thisChar.toString(), Kwd.invalid, TokInfo.invalid, TokType.invalid)

Here’s the full listing of scan and the three supporting functions isKeyword, specSeqPresent and getSpecSeq:

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    if (nextChar == endOfInput)  // check for end of input
        return Token(END_OF_INPUT, Kwd.endOfInput, TokInfo.none, TokType.none)

    if (isNumeric(nextChar))            // check for number
        return Token(getNumber(), Kwd.number, TokInfo.number, TokType.none)

    val indx: Int
    if (isAlpha(nextChar)) {            // check for name
        val name = getName()
        indx = isKeyword(name)
        return (if (indx >= 0) languageTokens[indx]  /* keyword found */
            else Token(name, Kwd.identifier, TokInfo.identifier, TokType.none))  /* identifier found */
    }

    indx = specSeqPresent()            // we have a special character - check for special sequence
    if (indx >= 0)
        return(getSpecSeq(indx))

    val thisChar = nextChar           // at this point we have an invalid token
    getNextChar()
    return Token(thisChar.toString(), Kwd.invalid, TokInfo.invalid, TokType.invalid)
}

/** check if a specific name is a keyword */
private fun isKeyword(name: String): Int {
    if (cursor >= inputProgram.length)         // check for end of input
        return -1
    for (i in languageTokens.indices) {
        if (languageTokens[i].value.equals(name, true))  // check for keyword match
            return i
    }
    return -1
}

/**
 * check the beginning of the remaining input for special sequence (e.g. operator)
 * returns the index in our keywords list if found or -1 if not
 */
private fun specSeqPresent(): Int {
    if (cursor >= inputProgram.length)         // check for end of input
        return -1
    for (i in languageTokens.indices) {
        val tokenValue = languageTokens[i].value
        if (inputProgram.substring(cursor).startsWith(tokenValue))  // check for keyword match
            return i
    }
    return -1
}

/** get a special sequence from input (keyword or operator  */
private fun getSpecSeq(indx: Int): Token {
    if (indx >= languageTokens.size)
        return Token(NO_TOKEN, Kwd.none, TokInfo.none, TokType.none)
    val t = languageTokens[indx]
    cursor = min(cursor+t.value.length, inputProgram.length)
    nextChar = inputProgram[cursor]
    skipWhite()
    return t
}

You may have noticed a new variable above, languageTokens. This is the list of all the keywords, operators and other special sequences used in our language. This is fully configurable and by changing the contents of this list we can change the look and feel of the language.

var languageTokens = mutableListOf<Token>()

The declaration and initialisation of this list is done in LanguageTokens.kt, which is effectively where we define all the keywords and operators of our language.

The Final Touches

On top of what we just mentioned, there are a couple of changes still required in our scanner:

Add this statement in the import section:

import kotlin.math.min

Modify slightly the init function of the scanner class:

/** initialisation code class InputProgramScanner */
init {
    try {
        val f = File(inputFile)
        // read the whole program into a string
        inputProgram = f.readText()
        // init the list of tokens for our language
        initKeywords()
        initOperators()
        // set the lookahead character to the first input char and skip any white spaces
        nextChar = inputProgram[0]
        skipWhite()
        // get the first token from input
        nextToken = scan()
    } catch (e: Exception) {
        abort("$e")
    }
}

Rename the indx variable to cursor (I think it’s more appropriate) and change the skipNextChar function as follows (cursor now pointing to the current character that’s coming up in the input stream as opposed to the one after):

/** set the lookahead character to the next char from input */
private fun skipNextChar() {
    if (cursor < inputProgram.length-1)
        nextChar = inputProgram[++cursor]
    else
        nextChar = endOfInput
}

Remove all the global definitions of all the keywords and operators and the relevant functions to recognise them. Also remove the declaration of the blockTerminators set.

Move the endOfInput variable from the global section to inside the InputProgramScanner class and make it private. Also make all other variables in this class private.

// end of input mark
private val endOfInput = nullChar.toChar()

Modify the function that checks for end of Program:

/** check for end of program - called by parseBlock */
fun isEndOfProgram(): Boolean = nextToken.encToken == Kwd.endOfProgram ||
nextToken.encToken == Kwd.endOfInput

Remove the function that checks for end of block – not needed anymore.

Add the decodeToken function that helps expected print the expected token value in case of error:

/** decode an encoded token to token name */
private fun decodeToken(token: Kwd): String {
for (i in languageTokens.indices)
if (languageTokens[i].encToken == token)
return languageTokens[i].value
return "*******"
}

Move the function expected from CompilerMain to InputProgramScanner as this the only place it is used. This function is now simplified:

/** report what was expected and abort */
private fun expected(expMsg: String) {
abort("Expected $expMsg \n found [${nextToken.value}]")
}

Remove the getBoolean function (it’s functionality is now covered by scan). Also isBoolean is not needed either.

To test these changes you can use only these three files CompilerMain.kt, InputProgramScanner.kt. If you have managed to follow these incremental changes I’ve described here, these two files should be in good shape – if not, you can take them from the GitHub repository. The third file is LanguageTokens.kt. Please take this from GitHub (URL below). Finally, you will need to change the main function as follows:

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    while(true) {
        val t: Token = inp.lookahead()
        println("[${t.value}], [${t.encToken}], [${t.info}], [${t.type}]")
        if (t.encToken == Kwd.endOfInput)
            break
        inp.match()
    }
}

You can also comment out the following line in CompilerMain.kt as we will not be producing code during this test.

//val code = M68000Instructions()

You can now compile it and run it with any input consisting of any combination of numbers, words, operators and other special characters. It will recognise the tokens in this input sequence as per definitions in LanguageTokens and will print them one by one. Give it an input like this:

Program Test

if condition1
    alpha = 5
    beta = 6
else
    x = 1
    y = 2
endif

pi=3.14

EndProgram

It should produce this output:

[PROGRAM], [startOfProgram], [keyword], [none]
[TEST], [identifier], [identifier], [none]
[IF], [ifToken], [keyword], [none]
[CONDITION1], [identifier], [identifier], [none]
[ALPHA], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[5], [number], [number], [none]
[BETA], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[6], [number], [number], [none]
[ELSE], [elseToken], [keyword], [endOfBlock]
[X], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[1], [number], [number], [none]
[Y], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[2], [number], [number], [none]
[ENDIF], [endifToken], [keyword], [endOfBlock]
[PI], [identifier], [identifier], [none]
[=], [equalsOp], [operator], [mulOps]
[3], [number], [number], [none]
[.], [invalid], [invalid], [invalid]
[14], [number], [number], [none]
[ENDPROGRAM], [endOfProgram], [keyword], [endOfPRogram]
[End of Input], [endOfInput], [none], [none]

You can see how each token is recognised and categorised and how it is translated from string to an encoded token so that it can be easily processed by the parser. Play with it, feel free to change the token definitions in LanguageTokens.kt and see how the tokens you define will be recognised. Try any combinations you can think of to see if you can break it.

I will pause here to let you digest and experiment with this and will be back shortly.

And as always, you can find the code in my GitHub repository.

Coming up next: Introducing TINSEL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s