Chapter IX: More TINSEL

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we Introduced the basic structure of TINSEL, our mini programming language. Here, we will continue defining these structures to make it even friendlier. The aim is to write a complete working program for x86 in the next chapter.

Commas

Looking at the 6 lines that were needed to define 6 variables in the TINSEL example at the end of the previous chapter, it is obvious we would benefit a lot by processing a comma separated list of variables:

    <variable declaration> ::= var <identifier> [ = <value> ] [ , <identifier> [ = <value> ] ] *

It should be very simple to extend our current parseVarDecl to support this. Given that this function must process at least one variable declaration, it makes sense to use a do - while loop:

/**
 * parse variables declarations
 * <variable declarations> ::= var <identifier> [ = <value> ] [ , <identifier> [ = <value> ] ] *
 */
fun parseVarDecl() {
    while (inp.lookahead().encToken == Kwd.varDecl) {
        do {
            inp.match()
            parseOneVarDecl()
        } while (inp.lookahead().encToken == Kwd.commaToken)
    }
}

/** parse one variable declaration */
fun parseOneVarDecl() {
    val varName = inp.match(Kwd.identifier).value
    var initValue = ""
    if (inp.lookahead().encToken == Kwd.equalsOp) {
        inp.match()
        initValue = inp.match(Kwd.number).value
    }
    declareVar(varName, initValue)
}

Don’t forget that in order to activate a new token, its encoded value must be added to the Kwd enumeration and the actual symbol/keyword (the ',' in this case) must be added to the languageTokens list.

enum class Kwd {
    ...
    commaToken,
    ...
}

languageTokens.add(Token(",", Kwd.commaToken, TokInfo.keyword, TokType.none))

That was a quick one! Change the previous TINSEL example so that all the vars or combinations of some vars are in a comma separated list and make sure it works.

Line Numbers

I believe you will agree with me that our compiler has decent error checking. But as our programs can now grow in size, it might be difficult to spot in our TINSEL input program where exactly the error was. I think a line number in the error message might be really helpful.

First we will need a global variable in the scanner, which will be increased by 1 every time we see a newline. I believe the best place to do this is in skipNextChar as this is the place where our code goes through the whole input program one character at a time (including white spaces and newlines):

// input program line number
var lineNumber = 1
/** set the lookahead character to the next char from input */
private fun skipNextChar() {
    if (nextChar == '\n')
        ++lineNumber
    nextChar = if (cursor < inputProgram.length-1)
        inputProgram[++cursor]
    else
        endOfInput
}

With this done, we just need to print this line number when we print an error message:

/** report what was expected and abort */
fun expected(expMsg: String) {
    abort("line $lineNumber: expected [$expMsg] found [${nextToken.value}]")
}

That was easy! But is it working? Let’s feed our compiler deliberately wrong input e.g. by removing an opening { or adding an unrecognised symbol at the end of a line. Now observe the error message and the line number. Is it correct? Well, it’s not far off, but it’s actually pointing to the line where the next token is instead of the line where the error occurred.

And there is a reason for this: every time we extract a token from the input (notably in getName, getNumber and getSpecSeq) we skip all the following whitespaces and newlines until we reach the next token, ready for the next match. And this is why our line number is a bit off.

There is an easy solution to this. Instead of advancing the cursor after we read each token and have it pointing to the next token, we will skip the white spaces and newlines just before we read the next token. This way the line number will always be pointing to the line of the last token we’ve just read.

To achieve this, remove the call to skipWhite from getName, getNumber, getNextChar, getSpecSeq and init, and add it just once at the top of scan as this is the function we call every time we want to return a token:

/** get the next token and advance the "cursor" */
private fun scan(): Token {
    // first skip white space
    skipWhite()

    if (nextChar == endOfInput)  // check for end of input
    ....

Try it again. You will see that the line number is now pointing to the line where the error is. There might be a few outlier cases where the line number might be off, but I believe in most cases it will be accurate.

After this change, if you look carefully, you will see that getNextChar has now been reduced to just one line that calls skipNextChar. Based on this, we can remove this version of getNextChar, and rename skipNextChar to getNextChar (which is exactly how Jack had it originally designed!).

In addition, if we look now at getName and getNumber, they can be further simplified. Both of them are called from only one place, from scan, and only when scan knows that there is an alpha or a numeric character ahead:

if (isNumeric(nextChar))            // check for number
    return Token(getNumber(), ...
...
if (isAlpha(nextChar)) {            // check for name
    val name = getName()
    ...

With this in mind, the check in the beginning of getName and getNumber, to abort if they don’t see an alpha or a numeric character respectively, is now redundant and can be removed.

With this change, all of our error checking at the scanner level is now done in match (with the help of scan of course). The syntax of our whole language is checked in just one line and any error is caught there. This is really neat and is based on the design from Jack.

I’ll let you manage these minor code changes, but if you need to verify, please check the source in the GitHub repository.

Input / Output

No program is any use unless it can communicate with the user. I will implement this through read and print.

read will be reading integer values from the standard input into a comma separated list of variables and print will be printing onto the standard output the integer result(s) of a comma separated list of expressions:

/**
* parse read statement
* <read> :: = read var [ , var ] *
*/
fun parseRead() {
var varName: String
do {
inp.match()
varName = inp.match(Kwd.identifier).value
code.dummyInstr("Read value into $varName")
code.assignment(varName)
} while (inp.lookahead().encToken == Kwd.commaToken)
}

/**
* parse print statement
* <print> ::= print expression [ , expression ] *
*/
fun parsePrint() {
do {
inp.match()
parseBooleanExpression()
code.dummyInstr("Print value")
} while (inp.lookahead().encToken == Kwd.commaToken)
}

Here you see some more dummy instructions but these will go away in the next chapter.

Remember, we also have to define the two new keywords (read and print) in the LanguageTokens file and also to recognise them and point to the two new parsers in parseStatement.

/** parse a statement */
fun parseStatement(breakLabel: String) {
when (inp.lookahead().encToken) {
Kwd.ifToken -> parseIf(breakLabel)
Kwd.whileToken -> parseWhile()
Kwd.repeatToken -> parseRepeat()
Kwd.forToken -> parseFor()
Kwd.breakToken -> parseBreak(breakLabel)
Kwd.retTok -> parseReturn()
Kwd.readToken -> parseRead()
Kwd.printToken -> parsePrint()
else -> parseAssignment()
}
}

It’s amazing how with an extra 7-8 lines of code we can add functionality like reading and printing a series of variables and expressions respectively. This is the beauty in Jack’s method.

Case Sensitive Names

To close this chapter, I’d like to ensure all our names and keywords are case-sensitive, same as with all modern languages. This can be achieved with some minor changes:

Remove the uppercase() conversion in getName:

token += nextChar.uppercase()

Remove the ignoreCase parameter (currently set to true) when we check for a keyword in isKeyword. By default it’s false, so the comparison is case-sensitive:

if (languageTokens[i].value.equals(name, true))  // check for keyword match

It can also be written like this (in Kotlin):

if (languageTokens[i].value == name)  // check for keyword match

With this, program is a valid keyword, while Program isn’t and the variable Alpha is different to alpha.

In the last three chapters I put emphasis on reshaping the lexical scanner (to make it recognise all possible tokens and also make it configurable), and also on defining the structures of our “toy” language to ensure we can write programs one can read and understand and which will be able to do something useful and produce some output. As a result I have neglected the part where the output code is generated, hence you see a number of dummy instructions. This will go away in the next chapter where I will start adapting the code module for x86-64.

And as always, you can find the code in my GitHub repository.

Coming up next: TINSEL for x86-64

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s