Chapter 5a: Control Structures – part 1

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

This chapter brings us back to our compiler. Keep the interpreter that you have built in chapter 4 somewhere safe, as going forward you may have to refer to it and use some of those routines.

So far we have covered numerical expressions, so this is the natural next step where we add ifs, loops, breaks, etc. to make our experimental programming language a bit more structured. Jack promised in his original tutorial that having covered the expressions parsing, we will find control structures parsing very easy! Let’s see if you will agree with him.

As I have done in the first 4 chapters, I will stick with Jack’s approach: I will use single character tokens for the control statements to make things simpler while we concentrate on the parsing. I will however keep the handling of white spaces in the code, as it will make it a bit easier for us to construct the various input programs to test our compiler. Please note, I will also keep the handling of multi-character strings in getName and getNumber – they can process single-character tokens equally well as long as they are separated by white space or other special character. I will also keep the skipping of newlines as we have done in chapter 4, so that I can feed our compiler a multi-line program which will help us understand a bit better the flow of a control structure.

This chapter turned out to be quite long, so I will publish it in 2 parts.

The General Plan

We will use the version of our compiler from the end of chapter 3.

Given that we are sticking with single-character tokens for now, the input “code” will look a bit funny. We will use 'i' for IF, 'w' for WHILE, etc. To cheer you up a bit this restriction will be removed in chapter 7, at which point our programming language will be much easier to read.

Jack wants us to concentrate strictly on control statements in this chapter, so we will not be parsing any other statements here. For this purpose we need a dummy parse function that will print the next alphanumeric token to the output just for us to follow the code:

/** dummy parse function for anything else than control statements */
fun parseOther() {
    println("\t${inp.getName()}")
}

We now have to extend our scanner to be able to deal with more than one lines. This was done in the Interpreter chapter by adding the call to skipNewline to the main loop. We will do it a bit more formally here. Consider the following definitions:

<program> ::= <block> end

<block> ::= [ <statement> ] * end

…which say that a program is a block followed by the keyword 'end' while a block is any number of statements (including no statements at all – empty block) followed by the 'end' keyword. These definitions can be easily converted to parsing functions as we have been doing so far:

/**
 * parse a program
 * <program> ::= <block> end
 */
fun parseProgram() {
    parseBlock()
    inp.match(endProg)
    println("\nEND PROGRAM")
}

/**
 * parse a block
 * <block> ::= [ <statement> ] * end
 */
fun parseBlock() {
    while (!inp.isEndProgram(inp.nextChar))
        parseOther()
}

And with these new parsing functions in place, our main loop now becomes

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
    parseProgram()
}

We also need to add the global definition for our program terminator ('.' is used here – this will be changed to something more user-friendly in chapter 7) and the function that checks for end of program.

// tokens
val endProg = '.'

/** check for end of program */
fun isEndProgram(c: Char): Boolean = c == endProg

Finally, in order to process a multi-line program, we need to deal with newline characters. There are two schools: if we want our programming language to be line-oriented (e.g. like BASIC where the end of each line is also the end of the statement) then we would have to make changes in many parts of the code and add processing of the newline character. If however, we decide to make our language free-field (like C or Kotlin) then we have an easy solution: we just need to change the function skipWhite so it will skip both white spaces and line terminators. I will follow the latter, so in other words, I will be treating the line terminator as a white space. We will see later if we can keep this approach or may have to change the handling of newlines.

/** check for a white space */
fun isWhite(c: Char): Boolean = c == ' ' || c == '\t' || isEndOfLine(c)

With these changes our compiler will process an input program of as many lines as you want and will recognise and print any alphanumeric tokens. It will flag any number, operator or other special character as an error. It will also flag the absence of the end-program token ('.') as an error.

IF Statement

The syntax for the IF statement that we will support here is as follows:

IF <condition> <block> ENDIF

As you can see there is a specific token that terminates the IF-block. This will be the approach in this chapter where each of the control statements will have its own termination token.

Thinking of the flow around the IF statement and trying to convert this to Assembly-friendly format (we will convert to TI-59 at a later stage) then the below IF block

IF <condition> A ENDIF B

becomes

        Branch if NOT <condition> to L
        A
L:      B
        ...

Our assembly code that will be produced here, will be branching around the IF-block if the condition is false.

With the relaxed restrictions, as mentioned above, a block like IF Condition A ENDIF B ... can be written in our programming “language”:

i condition
      a
e
b
.

which is much easier to read than icaeb. (this is what our test “program” would look like if we did not process white spaces and newlines and returned single characters in getName)

First, the functions needed to produce the unique labels:

// global vars
var labelIndx: Int = 0

/** create a unique label*/
fun newLabel(): String = "L${labelIndx++}"

/** post a label to output */
fun postLabel(label: String) = code.outputLabel(label)

And in our M68000Instructions.kt:

/** output a label */
fun outputLabel(s: String) = println("$s:")

If we think of the actions that we need to take for each of the keywords in the IF-block, we have something like this:

  • IF: First, get the condition and issue the code for it.
  • Then, create a unique label and issue the code to branch to it if the condition is false
  • ENDIF: output the label

At this point Jack introduces us to what is called “syntax-directed translation”, which we have been doing anyway, but here’s how to be clear about it:

    IF
    <condition>    { condition
                     L = NewLabel
                     Code (Branch if False to L) }
    <block>
    ENDIF          { PostLabel (L) } 

Based on this, the code for our IF parser will look like this (please note we are using a dummy function for the condition as well – this will be covered fully in the next chapter):

/**
 * parse if statement
 * IF <condition> <block> ENDIF
 */
fun parseIf() {
    inp.getNextChar()
    parseCondition()
    val label = newLabel()
    code.branchIfFalse(label)
    parseBlock()
    inp.match(endifToken)
    postLabel(label)
}

/** dummy parse condition */
fun parseCondition() {
    println("\t${inp.getName()}")
}

We need to add the new instruction to M68000Instructions.kt:

/** branch if false */
fun branchIfFalse(label: String) = outputCodeNl("BEQ $label")

And parseBlock has to be changed to call parseIf when it sees the 'i' token:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (inp.nextChar != endifToken && !inp.isEndProgram(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            else -> parseOther()
        }
    }
}

Don’t forget to define the tokens for IF and ENDIF in our global definitions section:

val ifToken = 'i'
val endifToken = 'e'

Compile this code and run it with this IF statement as input:

i condition1
    statement1
e
statement2
.

It produces the following output:

	CONDITION1
	BEQ L0
	STATEMENT1
L0:
	STATEMENT2 

Note how the code branches around statement1 if condition1 is false. I’m not quite satisfied with our error checking in this version though. But I will address this very soon once we complete our IF statement with the addition of ELSE. In BNF this is:

IF <condition> <block> [ ELSE <block> ] ENDIF

And in syntax-directed translation we can write this as follows:

    IF
    <condition>    { L1 = NewLabel
                     L2 = NewLabel
                     code (Branch if False to L1) }
    <block>
    <ELSE>         { code (Branch to L2)
                     PostLabel (L1) }
    <block>
    <ENDIF>        { PostLabel (L2) }

If we combine this with the ELSE-less case that we dealt with further up then our parseIf function becomes (please remember, we are using 'i' for IF, 'e' for ENDIF and now 'l' for ELSE):

/**
 * parse if statement
 * IF <condition> <block> [ ELSE <block> ] ENDIF
 */
fun parseIf() {
    inp.getNextChar()
    parseCondition()
    val label1 = newLabel()
    code.branchIfFalse(label1)
    parseBlock()
    if (inp.nextChar == elseToken) {
        inp.getNextChar()
        val label2 = newLabel()
        code.branch(label2)
        postLabel(label1)
        parseBlock()
        postLabel(label2)
    }
    else
        postLabel(label1)
    inp.match(endifToken)
}

We need to remember to terminate the parseBlock loop when it sees the else token as well:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (inp.nextChar != endifToken &&
           inp.nextChar != elseToken && !inp.isEndProgram(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            else -> parseOther()
        }
    }
}

Don’t forget to define the else token in our global definitions:

val elseToken = 'l'

And of course the new instruction for our assembly code:

/** branch */
fun branch(label: String) = outputCodeNl("BRA $label")

Compile it and try it out. Give it the below input:

i condition1
    statement1
l
    statement2
e
statement3
.

It will give you this output:

	CONDITION1
	BEQ L0
	STATEMENT1
	BRA L1
L0:
	STATEMENT2
L1:
	STATEMENT3

As expected, the code branches around statement1 if false, and executes statement2 and statement3, while if true, it executes statement1 and then goes to statement3. Try it also without else to make sure this case is not broken. Also try it with nested ifs and verify it works. As an exercise you can try to add THEN as a mandatory keyword. I believe it can be done with just one line of code!

Error Handling

At this point I would like to improve our error handling. If you try it without the endif token, our error handling will correctly pick it up. If however you try it without the endif token and without the end-program token, it will just say “end of input” without pointing out the error. In fact, the end-program token (our '.' in this case) is not processed at all, unless there is some other non-white space character following it. Something is wrong here. It looks like the issue is with skipNextChar, which when it reaches the end of input, just aborts the compilation, so match does not get its chance to do its error checking.

To improve this, I’m changing skipNextChar to make it always return something in nextChar, even when it’s run out of characters in the input stream. For this I’ve picked the null character – this will be our end-of-input token.

// end of input mark
val nullChar = 0
val endOfInput = nullChar.toChar()

Our end of program check now becomes:

/** check for end of program */
fun isEndProgram(c: Char): Boolean = c == endProg || c == endOfInput

I will also define a function to check for end of block:

/** check for end of block */
fun isEndBlock(c: Char) = c == endifToken || c == elseToken || isEndProgram(c)

With this, the exit condition in the parseBlock loop can be changed to:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            else -> parseOther()
        }
    }
}

I’m also improving function expected to print a more meaningful message than trying to print the null character when we encounter end of input:

/** report what was expected and abort */
fun expected(expMsg: String) {
// setup a meaningful message if we have reached end of input
val foundStr: String = if (inp.nextChar==endOfInput) "end of input" else inp.nextChar.toString()
abort("Expected $expMsg \n found [${foundStr}]")
}

Note how in Kotlin the if statement actually returns a value, so there is no need for the ternary operator in this language.

This is the new version of skipNextChar:

/** set the lookahead character to the next char from input */
fun skipNextChar() {
if (indx < inputProgram.length)
nextChar = inputProgram[indx++]
else
nextChar = endOfInput
}

And with this improvement we don’t need to add a '\n' at the end of our input program that was meant to help with handling of the end of input:

/** initialisation code class InputProgramScanner */
init {
    try {
        val f = File(inputFile)
        // read the whole program into a string
        inputProgram = f.readText()
        // set the lookahead character to the first input char
        getNextChar()
    } catch (e: Exception) {
        abort("could not open file [$inputFile]")
    }
}

Try it again. It looks like the error handling is much improved now.

WHILE Statement

With the IF out of the way, it should be now easy to process a WHILE structure. The syntax of WHILE in our language will be:

WHILE <condition> <block> ENDWHILE

It’s not really necessary to use different end-tokens for each control statement. It does add clarity to the code though. However, at this stage where we are using single-character tokens and we will be using 'e' for both endif and endwhile.

Looking at the above definition, we will need a Label at the beginning of the loop to branch to, when we start a new iteration, and a Label just after the end of the loop to branch to, when the condition is false. This in syntax-directed translation becomes:

    WHILE          { L1 = NewLabel
                     PostLabel(L1) }
    <condition>    { (Branch if False to L2) }
    <block>
    ENDWHILE       { (Branch to L1) 
                     PostLabel(L2) }

And the parse function for while is like this:

/**
 * parse while statement
 * WHILE <condition> <block> ENDWHILE
 */
fun parseWhile() {
    inp.getNextChar()
    val label1 = newLabel()
    val label2 = newLabel()
    postLabel(label1)
    parseCondition()
    code.branchIfFalse(label2)
    parseBlock()
    inp.match(endwhileToken)
    code.branch(label1)
    postLabel(label2)
}

We need to remember to declare our while and endwhile tokens and to terminate the loop in parseBlock when we see the endwhile token. This is done easily by updating isEndBlock:

val whileToken = 'w'
val endwhileToken = 'e'

/** check for end of block */
fun isEndBlock(c: Char) = c == endifToken || c == elseToken || c == endwhileToken || isEndProgram(c)

And of course we need to tell parseBlock to check for a while statement:

/**
 * parse a block
 * <block> ::= [ <statement> ] *
 */
fun parseBlock() {
    while (!inp.isEndBlock(inp.nextChar)) {
        when (inp.nextChar) {
            ifToken -> parseIf()
            whileToken -> parseWhile()
            else -> parseOther()
        }
    }
}

As usual compile it and try it out.

An example of an input program with nested while loop can be:

w condition1
    statement1
    w condition2
        statement2
        statement3
    e
e

Give it various combinations of nested loops and ifs to make sure it works. Also give it invalid input, with missing ENDIF or ENDWHILE.

At this point I will pause, to give you time to experiment with these concepts and will come back soon with the 2nd part of Control Structures.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s