Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
This chapter brings us back to our compiler. Keep the interpreter that you have built in chapter 4 somewhere safe, as going forward you may have to refer to it and use some of those routines.
So far we have covered numerical expressions, so this is the natural next step where we add ifs, loops, breaks, etc. to make our experimental programming language a bit more structured. Jack promised in his original tutorial that having covered the expressions parsing, we will find control structures parsing very easy! Let’s see if you will agree with him.
As I have done in the first 4 chapters, I will stick with Jack’s approach: I will use single character tokens for the control statements to make things simpler while we concentrate on the parsing. I will however keep the handling of white spaces in the code, as it will make it a bit easier for us to construct the various input programs to test our compiler. Please note, I will also keep the handling of multi-character strings in getName
and getNumber
– they can process single-character tokens equally well as long as they are separated by white space or other special character. I will also keep the skipping of newlines as we have done in chapter 4, so that I can feed our compiler a multi-line program which will help us understand a bit better the flow of a control structure.
This chapter turned out to be quite long, so I will publish it in 2 parts.
The General Plan
We will use the version of our compiler from the end of chapter 3.
Given that we are sticking with single-character tokens for now, the input “code” will look a bit funny. We will use 'i'
for IF, 'w'
for WHILE, etc. To cheer you up a bit this restriction will be removed in chapter 7, at which point our programming language will be much easier to read.
Jack wants us to concentrate strictly on control statements in this chapter, so we will not be parsing any other statements here. For this purpose we need a dummy parse function that will print the next alphanumeric token to the output just for us to follow the code:
/** dummy parse function for anything else than control statements */ fun parseOther() { println("\t${inp.getName()}") }
We now have to extend our scanner to be able to deal with more than one lines. This was done in the Interpreter chapter by adding the call to skipNewline
to the main loop. We will do it a bit more formally here. Consider the following definitions:
<program> ::= <block> end
<block> ::= [ <statement> ] * end
…which say that a program is a block followed by the keyword 'end'
while a block is any number of statements (including no statements at all – empty block) followed by the 'end'
keyword. These definitions can be easily converted to parsing functions as we have been doing so far:
/** * parse a program * <program> ::= <block> end */ fun parseProgram() { parseBlock() inp.match(endProg) println("\nEND PROGRAM") } /** * parse a block * <block> ::= [ <statement> ] * end */ fun parseBlock() { while (!inp.isEndProgram(inp.nextChar)) parseOther() }
And with these new parsing functions in place, our main loop now becomes
/** main function */ fun main(args: Array<String>) { initCompiler(args) parseProgram() }
We also need to add the global definition for our program terminator ('.'
is used here – this will be changed to something more user-friendly in chapter 7) and the function that checks for end of program.
// tokens val endProg = '.' /** check for end of program */ fun isEndProgram(c: Char): Boolean = c == endProg
Finally, in order to process a multi-line program, we need to deal with newline characters. There are two schools: if we want our programming language to be line-oriented (e.g. like BASIC where the end of each line is also the end of the statement) then we would have to make changes in many parts of the code and add processing of the newline character. If however, we decide to make our language free-field (like C or Kotlin) then we have an easy solution: we just need to change the function skipWhite
so it will skip both white spaces and line terminators. I will follow the latter, so in other words, I will be treating the line terminator as a white space. We will see later if we can keep this approach or may have to change the handling of newlines.
/** check for a white space */ fun isWhite(c: Char): Boolean = c == ' ' || c == '\t' || isEndOfLine(c)
With these changes our compiler will process an input program of as many lines as you want and will recognise and print any alphanumeric tokens. It will flag any number, operator or other special character as an error. It will also flag the absence of the end-program token ('.'
) as an error.
IF Statement
The syntax for the IF statement that we will support here is as follows:
IF <condition> <block> ENDIF
As you can see there is a specific token that terminates the IF-block. This will be the approach in this chapter where each of the control statements will have its own termination token.
Thinking of the flow around the IF statement and trying to convert this to Assembly-friendly format (we will convert to TI-59 at a later stage) then the below IF block
IF <condition> A ENDIF B
…
becomes
Branch if NOT <condition> to L A L: B ...
Our assembly code that will be produced here, will be branching around the IF-block if the condition is false.
With the relaxed restrictions, as mentioned above, a block like IF Condition A ENDIF B ...
can be written in our programming “language”:
i condition a e b .
which is much easier to read than icaeb.
(this is what our test “program” would look like if we did not process white spaces and newlines and returned single characters in getName
)
First, the functions needed to produce the unique labels:
// global vars var labelIndx: Int = 0 /** create a unique label*/ fun newLabel(): String = "L${labelIndx++}" /** post a label to output */ fun postLabel(label: String) = code.outputLabel(label)
And in our M68000Instructions.kt
:
/** output a label */ fun outputLabel(s: String) = println("$s:")
If we think of the actions that we need to take for each of the keywords in the IF-block, we have something like this:
- IF: First, get the condition and issue the code for it.
- Then, create a unique label and issue the code to branch to it if the condition is false
- ENDIF: output the label
At this point Jack introduces us to what is called “syntax-directed translation”, which we have been doing anyway, but here’s how to be clear about it:
IF <condition> { condition L = NewLabel Code (Branch if False to L) } <block> ENDIF { PostLabel (L) }
Based on this, the code for our IF parser will look like this (please note we are using a dummy function for the condition as well – this will be covered fully in the next chapter):
/** * parse if statement * IF <condition> <block> ENDIF */ fun parseIf() { inp.getNextChar() parseCondition() val label = newLabel() code.branchIfFalse(label) parseBlock() inp.match(endifToken) postLabel(label) } /** dummy parse condition */ fun parseCondition() { println("\t${inp.getName()}") }
We need to add the new instruction to M68000Instructions.kt
:
/** branch if false */
fun branchIfFalse(label: String) = outputCodeNl("BEQ $label")
And parseBlock
has to be changed to call parseIf
when it sees the 'i'
token:
/** * parse a block * <block> ::= [ <statement> ] * */ fun parseBlock() { while (inp.nextChar != endifToken && !inp.isEndProgram(inp.nextChar)) { when (inp.nextChar) { ifToken -> parseIf() else -> parseOther() } } }
Don’t forget to define the tokens for IF and ENDIF in our global definitions section:
val ifToken = 'i'
val endifToken = 'e'
Compile this code and run it with this IF statement as input:
i condition1 statement1 e statement2 .
It produces the following output:
CONDITION1 BEQ L0 STATEMENT1 L0: STATEMENT2
Note how the code branches around statement1 if condition1 is false. I’m not quite satisfied with our error checking in this version though. But I will address this very soon once we complete our IF statement with the addition of ELSE. In BNF this is:
IF <condition> <block> [ ELSE <block> ] ENDIF
And in syntax-directed translation we can write this as follows:
IF <condition> { L1 = NewLabel L2 = NewLabel code (Branch if False to L1) } <block> <ELSE> { code (Branch to L2) PostLabel (L1) } <block> <ENDIF> { PostLabel (L2) }
If we combine this with the ELSE-less case that we dealt with further up then our parseIf
function becomes (please remember, we are using 'i'
for IF, 'e'
for ENDIF and now 'l'
for ELSE):
/** * parse if statement * IF <condition> <block> [ ELSE <block> ] ENDIF */ fun parseIf() { inp.getNextChar() parseCondition() val label1 = newLabel() code.branchIfFalse(label1) parseBlock() if (inp.nextChar == elseToken) { inp.getNextChar() val label2 = newLabel() code.branch(label2) postLabel(label1) parseBlock() postLabel(label2) } else postLabel(label1) inp.match(endifToken) }
We need to remember to terminate the parseBlock
loop when it sees the else token as well:
/** * parse a block * <block> ::= [ <statement> ] * */ fun parseBlock() { while (inp.nextChar != endifToken && inp.nextChar != elseToken && !inp.isEndProgram(inp.nextChar)) { when (inp.nextChar) { ifToken -> parseIf() else -> parseOther() } } }
Don’t forget to define the else token in our global definitions:
val elseToken = 'l'
And of course the new instruction for our assembly code:
/** branch */
fun branch(label: String) = outputCodeNl("BRA $label")
Compile it and try it out. Give it the below input:
i condition1 statement1 l statement2 e statement3 .
It will give you this output:
CONDITION1 BEQ L0 STATEMENT1 BRA L1 L0: STATEMENT2 L1: STATEMENT3
As expected, the code branches around statement1 if false, and executes statement2 and statement3, while if true, it executes statement1 and then goes to statement3. Try it also without else to make sure this case is not broken. Also try it with nested ifs and verify it works. As an exercise you can try to add THEN as a mandatory keyword. I believe it can be done with just one line of code!
Error Handling
At this point I would like to improve our error handling. If you try it without the endif token, our error handling will correctly pick it up. If however you try it without the endif token and without the end-program token, it will just say “end of input” without pointing out the error. In fact, the end-program token (our '.'
in this case) is not processed at all, unless there is some other non-white space character following it. Something is wrong here. It looks like the issue is with skipNextChar
, which when it reaches the end of input, just aborts the compilation, so match
does not get its chance to do its error checking.
To improve this, I’m changing skipNextChar
to make it always return something in nextChar
, even when it’s run out of characters in the input stream. For this I’ve picked the null character – this will be our end-of-input token.
// end of input mark val nullChar = 0 val endOfInput = nullChar.toChar()
Our end of program check now becomes:
/** check for end of program */ fun isEndProgram(c: Char): Boolean = c == endProg || c == endOfInput
I will also define a function to check for end of block:
/** check for end of block */ fun isEndBlock(c: Char) = c == endifToken || c == elseToken || isEndProgram(c)
With this, the exit condition in the parseBlock
loop can be changed to:
/** * parse a block * <block> ::= [ <statement> ] * */ fun parseBlock() { while (!inp.isEndBlock(inp.nextChar)) { when (inp.nextChar) { ifToken -> parseIf() else -> parseOther() } } }
I’m also improving function expected
to print a more meaningful message than trying to print the null character when we encounter end of input:
/** report what was expected and abort */
fun expected(expMsg: String) {
// setup a meaningful message if we have reached end of input
val foundStr: String = if (inp.nextChar==endOfInput) "end of input" else inp.nextChar.toString()
abort("Expected $expMsg \n found [${foundStr}]")
}
Note how in Kotlin the if
statement actually returns a value, so there is no need for the ternary operator in this language.
This is the new version of skipNextChar
:
/** set the lookahead character to the next char from input */
fun skipNextChar() {
if (indx < inputProgram.length)
nextChar = inputProgram[indx++]
else
nextChar = endOfInput
}
And with this improvement we don’t need to add a '\n'
at the end of our input program that was meant to help with handling of the end of input:
/** initialisation code class InputProgramScanner */ init { try { val f = File(inputFile) // read the whole program into a string inputProgram = f.readText() // set the lookahead character to the first input char getNextChar() } catch (e: Exception) { abort("could not open file [$inputFile]") } }
Try it again. It looks like the error handling is much improved now.
WHILE Statement
With the IF out of the way, it should be now easy to process a WHILE structure. The syntax of WHILE in our language will be:
WHILE <condition> <block> ENDWHILE
It’s not really necessary to use different end-tokens for each control statement. It does add clarity to the code though. However, at this stage where we are using single-character tokens and we will be using 'e'
for both endif and endwhile.
Looking at the above definition, we will need a Label at the beginning of the loop to branch to, when we start a new iteration, and a Label just after the end of the loop to branch to, when the condition is false. This in syntax-directed translation becomes:
WHILE { L1 = NewLabel PostLabel(L1) } <condition> { (Branch if False to L2) } <block> ENDWHILE { (Branch to L1) PostLabel(L2) }
And the parse function for while is like this:
/** * parse while statement * WHILE <condition> <block> ENDWHILE */ fun parseWhile() { inp.getNextChar() val label1 = newLabel() val label2 = newLabel() postLabel(label1) parseCondition() code.branchIfFalse(label2) parseBlock() inp.match(endwhileToken) code.branch(label1) postLabel(label2) }
We need to remember to declare our while and endwhile tokens and to terminate the loop in parseBlock
when we see the endwhile token. This is done easily by updating isEndBlock
:
val whileToken = 'w' val endwhileToken = 'e' /** check for end of block */ fun isEndBlock(c: Char) = c == endifToken || c == elseToken || c == endwhileToken || isEndProgram(c)
And of course we need to tell parseBlock
to check for a while statement:
/** * parse a block * <block> ::= [ <statement> ] * */ fun parseBlock() { while (!inp.isEndBlock(inp.nextChar)) { when (inp.nextChar) { ifToken -> parseIf() whileToken -> parseWhile() else -> parseOther() } } }
As usual compile it and try it out.
An example of an input program with nested while loop can be:
w condition1 statement1 w condition2 statement2 statement3 e e
Give it various combinations of nested loops and ifs to make sure it works. Also give it invalid input, with missing ENDIF or ENDWHILE.
At this point I will pause, to give you time to experiment with these concepts and will come back soon with the 2nd part of Control Structures.