Chapter VIII: Introducing TINSEL

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we changed our lexical scanner so that it can now recognise any token and any operator that we have defined in our programming language definition. And we verified this, by seeing the input stream broken down into distinct tokens that were encoded and categorised and ready to be passed to the parser.

In this chapter we will use this capability and define the full specification of our own programming language that we will call TINSEL (Technology Integration and Next Step in Evolutionary Learning). I believe this is an appropriate name for our mini-language, given that this chapter is published in the beginning of December.

At the end of this chapter we will be able to write a complete TINSEL program that will be as easy to read as C or Java or Kotlin, and which we will be able to compile with decent error checking included.

But first, let’s complete the change we started in the previous chapter, by making the necessary changes in the 3 parser modules to make them compatible with the new scanner.

Connecting Parser and Scanner

Now that our tokens consist of multiple characters and are being returned as an enumerated value, our parser routines have to be changed accordingly. Please take the three files that were part of Chapter 7 and create a new package under your project together with the three ProgParser* and the M68000Instructions files from Chapter 6.

Please make the following changes:

  • Where we check against the value of a token (e.g. 'addOp' or 'ifToken') the corresponding enumerated value must be used (e.g. Kwd.addOp or Kwd.ifToken)
  • Where we check the value of inp.nextChar, we must now use the value of inp.lookahead().encToken and must check against the corresponding enumerated value (see point above)
  • Where match is called with a char parameter, it must now be called with the corresponding enumerated value (inp.match(endIfToken) becomes inp.match(Kwd.endIfToken))
  • Where the type of token is checked by using e.g. isEndOfBlock or isAddop or isOrOp the type attribute of the token must be checked instead: e.g. inp.lookahead().type == TokType.endOfBlock or inp.lookahead().type == TokType.addOps
  • Where we check for an Alpha or a Numeric token with inp.isAlpha(inp.nextChar) or inp.isNumber(inp.nextChar), we must now use inp.lookahead().encToken == Kwd.identifier or inp.lookahead().encToken == Kwd.number
  • The same goes for inp.isLeftParen and all the other similar checks
  • Replace any call to inp.getName() with inp.match(Kwd.identifier).value and any call to inp.getNumber() with inp.match(Kwd.number).value; these will fetch an identifier or a number from the input with the appropriate error checking
  • Replace the line with the call to inp.getBoolean() in parseBooleanFactor() with if (inp.match(Kwd.booleanLit).value == BOOLEAN_TRUE)
  • Reinstate the line val code = M68000Instructions() in CompilerMain.kt

For example, here’s how parseBlock and parseFactor will look after these changes:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock(breakLabel: String = "") {
    while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
        when (inp.lookahead().encToken) {
            Kwd.ifToken -> parseIf(breakLabel)
            Kwd.whileToken -> parseWhile()
            Kwd.repeatToken -> parseRepeat()
            Kwd.forToken -> parseFor()
            Kwd.breakToken -> parseBreak(breakLabel)
            else -> parseAssignment()
        }
    }
}
/**
* parse a factor
* <factor> ::= ( <expression> ) | <number> | <identifier>
*/
fun parseFactor() {
if (inp.lookahead().encToken == Kwd.leftParen) {
// ( Expression )
inp.match()
parseExpression()
inp.match(Kwd.rightParen)
}
else
if (inp.lookahead().encToken == Kwd.identifier)
// Identifier
parseIdentifier()
else
// Number
code.setAccumulator(inp.match(Kwd.number).value)
}

These changes would normally come to mind naturally, based on the new definition of Token and match(). However, this may turn into a delicate operation, so please compare the outcome of your changes with the files from GitHub (url at the bottom as usual) just in case you have missed something.

Let’s Define our Language

We will now start defining our language, TINSEL. We will take a top-down approach and will start from the top level, which is the Program itself. We will then continue down the hierarchy and define the building blocks one by one, until we reach the Statement level, which is what we can support already.

As Jack has shown us, the best tool for this is the BNF notation, which, as we’ve seen in previous chapters, we can translate directly to code in our parser.

The Program

The first programming language I learnt was FORTRAN, so I’d like to have a program header in the beginning, something like program calculate_differences, and an end of program mark at the end, something like endprogram.

Having clarified this, let’s think what other structural blocks our program will consist of.

Immediately after program... I’d like to have an optional section where the global variables are declared. These will all be integer for now.

After this, I would like to see a section (also optional) where functions can be declared. They will also return an integer for now.

Finally, the only mandatory section of our program will be the main section, which will be clearly marked (e.g. with the token main)

Based on this our program can be defined as:

    <program> ::= <prog header> [ <var declarations> ] [ <fun declarations> ] <main block> <prog end>

And the function to process this will naturally be:

fun parseProgram() {
parseProgHeader()
if (inp.lookahead().encToken == Kwd.varDecl)
parseVarDecl()
if (inp.lookahead().encToken == Kwd.funDecl)
parseFunDecl()
parseMainBlock()
parseProgEnd()
}

The functions to process the program header and program end will be:

fun parseProgHeader() {
    inp.match(Kwd.startOfProgram)
    println("program ${inp.match(Kwd.identifier).value}\n")
}

and

fun parseProgEnd() {
    inp.match(Kwd.endOfProgram)
    println("\nend program")
}

You will notice that the Kotlin compiler is now giving us a warning on all those when blocks. The reason is that we now use enumerations in the when block and it is advisable that we include an else case, even though it’s not 100% needed as long as the definitions of our tokens are correct. If however, there is an error in our tokens definitions, this “catch-all” else case will help us identify it. E.g. in parseExpression():

else -> inp.expected("add or subtract operator")

As a result, the function expected must now become public.

Please note, we will include all these functions in a new file called ProgParser0TopLevel.kt, so please remove the parseProgram() function from ProgParser1ControlStruct.kt.

Global Variables Declarations

Given that we have only numeric variables, we can declare them just by using the keyword var followed by the name (no need to specify the type – everything is integer for now). In addition we could have an optional initialisation of the variable. To do this properly we should be able to calculate the result of a numeric expression, which would potentially include variables that had been declared and initialised before. However, our compiler is not able to perform calculations at this stage (see chapter 4 – Interpreters) so for now we will keep it simple and support only simple numeric values. Our variable declaration will look like this:

    <variable declaration> ::= var <identifier> [ = <value> ]

We will keep processing variables declarations in a loop as long as we see the var token:

/** parse variables declarations */
fun parseVarDecl() {
    while (inp.lookahead().encToken == Kwd.varDecl) {
        inp.match()
        val varName = inp.match(Kwd.identifier).value
        var initValue = ""
        if (inp.lookahead().encToken == Kwd.equalsOp) {
            inp.match()
            initValue = inp.match(Kwd.number).value
        }
        declareVar(varName, initValue)
    }
}

You may have noticed the function declareVar above. This is how this will work: we will maintain all the declared variables in a Map with the variable name as the key. The data for each variable will be it’s type (something’s telling me that we may want later to extend our language to support more than just int) and an initialised flag. All modern compilers point out references to uninitialised variables, so why not prepare the ground for this? If our variable declaration contains an initialisation part, we will set this flag to true, otherwise it will be false.

Regarding variables initialisation, please note that even though the initialisation part is optional in TINSEL, every global variable has to have an initial value – more details on this in chapter 10. So for now let’s say that the initialised flag will be true if a variable has been specifically initialised by the programmer in the TINSEL program, and false if it hasn’t been initialised by the programmer but instead is initialised to the default value by the Compiler.

In declareVar we first check in case the variable has already been declared, and if not, we process the new declaration together with the initialisation. The initialised flag is set accordingly.

// the variables space
val variablesSpace = mutableMapOf<String,VariableDecl>()

/** our variable types */
enum class VarType { int }

/** our variables declaration */
class VariableDecl(var type: VarType, var initialised: Boolean = false)

/** declare a variable */
fun declareVar(name: String, initValue: String) {
    // check for duplicate var declaration
    if (variablesSpace[name] != null)
        abort ("variable $name already declared")
    variablesSpace[name] = VariableDecl(VarType.int, initValue!="")
    code.dummyInstr("declare variable $name initValue $initValue")
}

You may also have noticed the use of dummy code here. The actual assembly code for the variables declarations depends (a) on the specific assembly language and (b) on the specific assembler. So, I will give you the detailed full code for this in chapter 10, when we start producing actual code for x86.

Functions Declarations

Similarly to the variables, our functions will only return numeric result, so no type needed. We will also keep it simple by having no parameters for now. A function declaration will look like this:

    <function declaration> ::= fun <identifier> ( ) <block>

There are a couple or things to note above:

  • To parse the function body I’m using <block> which we already have a parser function for. We will see shortly whether our existing parseBlock will need any changes to support the parsing of the function body
  • A function must contain a return statement. We will see later on how our compiler will verify this

The parser for a function declaration will look like this:

/** parse functions declarations */
fun parseFunDecl() {
    while (inp.lookahead().encToken == Kwd.funDecl) {
        inp.match()
        val funName = inp.match(Kwd.identifier).value
        inp.match(Kwd.leftParen)
        inp.match(Kwd.rightParen)
        declareFun(funName)
        parseBlock()
    }
}

You can see that same as with variables, we are processing the function declarations in a loop as long as we see a fun keyword. Same as with declareVar above, we need a declareFun, which will produce the necessary assembly code for the function declaration. And same as with var, this function uses dummy code for now until chapter 10.

// the functions space
val functionsSpace = mutableMapOf<String,VarType>()

/** process a function declaration */
fun declareFun(name: String) {
    if (functionsSpace[name] != null)
        abort ("function $name already declared")
    functionsSpace[name] = VarType.int
    code.dummyInstr("declare function $name")
}

Return Statement

The return statement that must be included in every function, will require some new routines to process it. Its format will be:

    <return> ::= return <expression>

The parser for this will be:

fun parseReturn() {
    inp.match()
    parseExpression()
    code.returnFromCall()
}

And we need to activate it in the when block in parseBlock when the return keyword is recognised:

Kwd.retTok -> parseReturn()

We will also need the return statement in our assembly code module:

/** return from function call */
fun returnFromCall() = outputCodeNl("RET")

The Main Program

I would like to create a resemblance to C, so I will use the token main to mark the start of the main program bock:

    <main block> ::= main <block>

That was an easy one. The parser will be like this:

fun parseMainBlock() {
    inp.match(Kwd.mainToken)
    code.outputLabel("main")
    parseBlock()
    code.dummyInstr("stop running")
}

As you can see in the main block, I have added a label called main for now to mark the entry point to the main program and also the final instruction, the one that tells the target machine to stop running the program. The exact instructions will also be clarified in chapter 10.

The Block

Looking at our current definition of <block>, which is

    <block> ::= [ <statement> ] *

it looks like it’s not adequate for it’s new upgraded role. It will at least need a set of delimiters for start-block and end-block. I’d like to expand the definition to:

    <block> ::= { [ <statement> ] * }
    <statement> ::= <block> | if | while | repeat | for | break | 
                     return | <assignment>

which says that a block consists of 0 or more <statements> and, Yes, it’s surrounded by curly brackets. So, now we have a distinct terminator for the block and don’t need the ENDIF, ENDWHILE or ENDFOR anymore. The statement can be one of the statements we have already worked on or actually a block.

With the above definition in mind I would like to split our existing parseBlock as follows:

/** parse a block */
fun parseBlock(breakLabel: String = "") {
    inp.match(Kwd.startBlock)
    while (inp.lookahead().type != TokType.endOfBlock && !inp.isEndOfProgram()) {
        parseStatement(breakLabel)
    }
    inp.match(Kwd.endBlock)
}

/** parse a statement */
fun parseStatement(breakLabel: String) {
    when (inp.lookahead().encToken) {
        Kwd.startBlock -> parseBlock(breakLabel)
        Kwd.ifToken -> parseIf(breakLabel)
        Kwd.whileToken -> parseWhile()
        Kwd.repeatToken -> parseRepeat()
        Kwd.forToken -> parseFor()
        Kwd.breakToken -> parseBreak(breakLabel)
        Kwd.retTok -> parseReturn()
        else -> parseAssignment()
    }
}

By adding the block-start and block-end tokens (the curly brackets) inside parseBlock, we only do it once and don’t have to do it in every parser function that calls parseBlock (and by now, there’s quite a lot of them).

If – While – Repeat

In order to maintain the resemblance to C, I would like to have the condition in the if, while and repeat blocks surrounded by parentheses. This is a small change in the three parser functions to add the corresponding match calls before and after the call to parseBooleanExpression(). I’m leaving the for loop for later due to its complexity. I will fix this separately, together with the code that is needed to execute it properly (if you remember, for currently generates dummy code).

...
inp.match(Kwd.leftParen)
parseBooleanExpression()
inp.match(Kwd.rightParen)
...

With these changes, it looks like we are ready to write a program like this:

Program TestTinsel1

var a = 2
var b = 5
var x
var y
var i
var result

fun alpha() {
    x = a * b - (y-1)*y
    i = 1
    while (true) {
        i = i + 1
        if (i > 10) {
            return x+i
        }
    }
}

main {
    result = alpha()
    a = 20
    b = 80
    result = alpha()
}

EndProgram

Finally change the main function as per below and try it out!

/** main function */
fun main(args: Array<String>) {
initCompiler(args)
parseProgram()
}

The output code will still be a mixture of M68000 assembly and dummy instructions but I will soon produce x86 code that anyone will be able to run and test on Linux (and Windows with some minor modifications).

Try other structures like for and repeat, if - else, break, and try removing or changing things and see how the compiler will respond to invalid syntax.

Before I leave you, I realised that a minor bug has crawled in our scanner – if you run the compiler with an empty file as input, or if there is no newline character at the end of the last line of the program, it will crash with “string index out of bounds exception”. To stop this, we will bring back the extra newline character that we had added at the end of our program in one of the early chapters:

// read the whole program into a string
// add a newline at the end to deal with end of input easier
inputProgram = f.readText() + '\n'

Have fun.

And as always, you can find the code in my GitHub repository.

Coming up next: More TINSEL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s