Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the previous chapter we completed the handling of Control Structures and developed a rich set of parsing functions that can translate them and produce object code. However, that left a big gap in our compiler’s capabilities: we did not parse the conditions. To temporarily close that gap, we used a dummy parseCondition
function that is just a placeholder for the real thing.
The General Plan
So far, Jack has lead us straightaway into writing code that deals with the issue at hand, without spending too much time in planning. And this worked well so far, as the topics covered were well defined: we know exactly what a '+'
means and what we are supposed to do with it. The same is true for a while
or a repeat
structure: we know exactly how to handle it.
The way programming languages implement Boolean logic however, can vary quite a bit from language to language, so Jack’s advice at this point is, before we start any serious coding, let’s first make decisions as to what we want. And the way to do this, is at the level of the BNF syntax rules (the Grammar), which, as we’ve seen already, translates almost directly to code.
The Grammar
In chapters 2 and 3 we have been mentioning the BNF syntax for the various numerical expressions. It would be a good idea to summarize them here again:
<expression> ::= [ <unary op> ] <term> [ <addop> <term> ] * <term> ::= <factor> [ <mulop> <factor> ] * <factor> ::= <integer> | <identifier> | ( <expression> )
The good thing about this grammar is that it enforces the operator precedence hierarchy as we would expect. At this point Jack wants us to amend this grammar and suggests that it’s better written like this:
<expression> ::= <term> [ <addop> <term> ] * <term> ::= <signed factor> [ <mulop> <factor> ] * <signed factor> ::= [ <addop> ] <factor> <factor> ::= <integer> | <identifier> | ( <expression> )
This way the task to handle the unary minus falls on parseFactor
, which is where it really belongs. This will be the syntax that we will be using from now on and we will adapt the parsing routines as necessary.
Now, if we try to define the corresponding grammar for Boolean algebra, the result will be very similar to the above:
<b-expresssion> ::= <b-term> [ <orop> <b-term> ] * <b-term> ::= <not-factor> [ <andop> <not-factor> ] * <not-factor> ::= [ NOT ] <b-factor> <b-factor> ::= <b-literal> | <b-identifier> | ( <b-expression> )
In this grammar the <orop>
(e.g. OR or XOR) is analogous to '+'
, the <andop>
(e.g. AND) is analogous to '*'
and the NOT
is analogous to unary minus. There is a little difference though between the way the NOT and the unary minus are handled. In a numerical expression, the unary minus goes with the whole term, so it can only appear once in a give term. This means that an expression like
a * -b
or even worse
a - -b
are not allowed. However a Boolean expression like the below
a AND NOT b
makes perfect sense. And this is reflected in the difference between the <term>
and <b-term>
definitions above.
Relops
So far we have defined the syntax of numeric and Boolean expressions. The question is what happens when we combine the two. In case you are wondering, the reason why we need to combine them is the conditions in our control structures. In some scenarios the condition can consist of a pure Boolean expression like:
IF a AND NOT b THEN ...
However, more often we will see something like:
IF (x >=0) AND (x <=100) THEN ...
In this case the two terms in parenthesis are Boolean expressions but the individual terms compared (x, 0, 100) are numeric. The relational operators (>=
and <=
) are the catalysts
(as Jack called them) that help Boolean and Numeric ingredients merge together.
In the above example the two terms compared are just terms, but generally they could be expressions. So a relation in NBNF could be defined as:
<relation> ::= <expression> <relop> <expression>
A relop is any of the well know relational operators: ==, !=, <, >, <=, >=
Jack points out here that this relation has a single Boolean value TRUE or FALSE, so it’s effectively a Boolean factor. With this in mind we can expand the definition of b-factor:
<b-factor> ::= <b-literal> | <b-identifier> | ( <b-expression> ) | <relation>
And this is exactly where the connection between Numerical and Boolean expressions is! As the BNF notation implies a hierarchy, where the arithmetic expressions have higher precedence than the Boolean expressions, let’s list clearly the precedence levels:
Level Syntax Element Operator 0 factor literal, variable 1 signed factor unary minus 2 term * , / 3 expression + , - 4 b-factor b-literal, b-variable, relop 5 not-factor NOT 6 b-term AND 7 b-expression OR, XOR
At this point I was really excited by this crystal-clear definition and was getting ready to start writing my parsing routines, when Jack brought me back to the ground by pointing out that this will not work! Even though this grammar would be good in theory, it would not be any good for a top-down parser like ours (actually Jack’s). The problem comes from cases like this:
IF ((((A + B) < 0) AND ...
When the parser sees the IF it expects a Boolean expression (please go and check how our parseIf
calls parseCondition
). So parseCondition
would start parsing that Boolean expression but as it would start going down the list of '('
expecting to find a b-expression as per the above grammar, it would come across a numerical expression which of course would not be able to process. The only way to get out of this situation without changing the grammar would be to accept an amount of backtracking so that our parser could work its way out of bad guesses like this. This however, would be a nightmare.
To deal with this we will have to make a compromise so that our simple parser can handle the grammar without backtracking.
Fixing the Grammar
The problem we’ve just encountered is caused by the fact that both the numerical <factor>
and the Boolean <b-factor>
allow for parenthesised expressions. And this becomes even more challenging as, through recursion, we can go dawn multiple levels of parentheses, at which point our parser has no idea whether it’s dealing with a numerical or a Boolean expression.
And the obvious solution is to allow parenthesis only in one kind of factor – and this will be the numerical factor. There will be no parenthesis allowed in Boolean factors. With this in mind the combined numerical and Boolean grammar can be written like this:
<b-expression> ::= <b-term> [ <orop> <b-term> ] * <b-term> ::= <not-factor> [ <andop> <not-factor> ] * <not-factor> ::= [ NOT ] <b-factor> <b-factor> ::= <b-literal> | <b-identifier> | <relation> <relation> ::= <expression> [ <relop> <expression> ] <expression> ::= <term> [ <addop> <term> ] * <term> ::= <signed factor> [ <mulop> <factor> ] * <signed factor> ::= [ <addop> ] <factor> <factor> ::= <integer> | <identifier> | ( <expression> )
With this grammar we have the same 7 levels of precedence as mentioned above. The main difference is that there is no parenthesised b-expression as an option for the b-factor. There is also one subtle but crucial difference, which makes the whole thing work. Notice the square brackets in the definition of relation. This is telling us that a relation can also be a plain numerical expression. The result of this is that everything can potentially be treated as a Boolean expression. Our parser will start looking for a Boolean expression and as it goes down the grammar, it can settle for a numeric one. And this is how C works: any integer can be treated as Boolean and vice-versa. By the way, as Jack reminded us in the original, C has actually no less than 17 levels of precedence as it has many more operators (++, --, <<, >>,
etc).
So, with this plan now agreed, let’s go and build the parsing functions.
The Parser
Please start with a copy of the code as it evolved at the end of Chapter 5. Also copy the numeric parser functions that you have from the end of Chapter 3. I have gathered these numeric parsing functions in a file called ProgParser2NumericExpr.kt
. Please create a new file ProgParser3BooleanExpr.kt
where we will write all the Boolean parsing functions.
Jack guides us through the same steps as he did for the numeric parsing functions. So, first things first, let’s recognise a Boolean literal and let’s fetch one from the input stream (in our InputPtogramScanner.kt
):
/** check for a Boolean value (T,F) */ fun isBoolean(c: Char): Boolean = c.uppercaseChar() == 'T' || c.uppercaseChar() == 'F'
/** get a boolean literal */
fun getBoolean(): Boolean {
var value = false
if (!isBoolean(nextChar))
expected("Boolean Literal")
else
value = nextChar.uppercaseChar() == 'T'
skipNextChar()
skipWhite()
return value
}
To test these two routines please change parseOther
in ProgParser1ControlStruct.kt
as below:
/** dummy parse function for anything else than control statements */
fun parseOther() {
println("\t${inp.getBoolean()}")
}
Run our compiler with a file containing a series of Ts and Fs as input. The compiler should properly recognise and print these Boolean values and will stop with an error message at any other unrecognised name or number or other unrecognised character (please remember to use upper case F for “false” as the lower case “f” will trigger the parser function for the FOR loop – our compiler still recognises the control structures!).
Next step is to set the Accumulator (register D0 in the case of the 68000) to our Boolean value. Jack chose 0 for False and FFFF hex for True, which is more Assembly-language-friendly (the bitwise NOT also becomes a Boolean NOT). I will use 0 and 1 as it is more appropriate for the decimal arithmetic that the TI-59 is using. The first cut of our Boolean expression parser is:
/** parse a Boolean expression */ fun parseBooleanExpression() { if (inp.getBoolean()) code.setAccumulator("1") else code.clearAccumulator() }
To test it, change again the parseOther
function, this time to:
/** dummy parse function for anything else than control statements */
fun parseOther() {
parseBooleanExpression()
}
Our compiler will now produce the correct assembly instruction for each T and F in the input to set D0 to 1 or 0 respectively.
With the basics out of the way, we can now crack on and start writing the necessary parse functions for each level of our Boolean grammar. We just have to copy the corresponding numerical parsers and make some minor adjustments. For Boolean operators, given that we are still constraint by single-character tokens we will use '|'
for OR, '~'
for XOR, '&'
for AND, '!'
for NOT, '?'
for is equal to, '#'
for is not equal to, '<'
for is less than and '>'
for is greater than (we will expand to “less than or equal to” and “greater than or equal to” in the next chapter where we will be able to process multi-character operators).
We will start with the parsing of Boolean expressions and the “orops” OR and XOR. Please rename the version of parseBooleanExpression
to parseBooleanTerm
:
/** parse a boolean term */
fun parseBooleanTerm() {
if (inp.getBoolean())
code.setAccumulator("1")
else
code.clearAccumulator()
}
This is the new version of ParseBooleanExpression and also the or and xor functions:
/** parse a Boolean expression */
fun parseBooleanExpression() {
parseBooleanTerm()
while (inp.isOrop(inp.nextChar)) {
code.saveAccumulator()
when (inp.nextChar) {
orOp -> boolOr()
xorOp -> boolXor()
}
}
}
/** parse boolean or */
fun boolOr() {
inp.match()
parseBooleanTerm()
code.orAccumulator()
}
/** parse boolean xor */
fun boolXor() {
inp.match()
parseBooleanTerm()
code.xorAccumulator()
}
We need to define the two Boolean operators in our global definitions section and write the function that checks for them:
val orOp: Char = '|' val xorOp: Char = '~' /** check for an "orop" (|,~) */ fun isOrop(c: Char): Boolean = c == orOp || c == xorOp
Don’t forget the two new instructions in our code module:
/** or top of stack with accumulator */
fun orAccumulator() = outputCodeNl("OR (SP)+,D0")
/** exclusive or top of stack with accumulator */
fun xorAccumulator() = outputCodeNl("EOR (SP)+,D0")
Our compiler now will recognise OR and/or XOR operations between Boolean literals.
Next step, let’s add the parsing for a Boolean term and for the AND operation. First, rename the current version of parseBooleanTerm
to parseNotFactor
and write this version of parseBooleanTerm
:
/** parse a boolean term */ fun parseBooleanTerm() { parseNotFactor() while (inp.isAndop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { andOp -> boolAnd() } } } /** parse a not factor */ fun parseNotFactor() { if (inp.getBoolean()) code.setAccumulator("1") else code.clearAccumulator() }
Also, need the function to execute the Boolean AND:
/** parse boolean and */
fun boolAnd() {
inp.match()
parseNotFactor()
code.andAccumulator()
}
You may want to point out that we have just one “andop” (which is actually the AND operator) so why complicate parseBooleanTerm
by checking which andop
we have and why call a separate function (boolAnd
) to perform the AND operation instead of doing it all inside parseBooleanTerm
? And you would be right. I just chose to keep the standard we have followed in parseTerm
. Also, by putting a placeholder that checks which “andop” we have, if we later decide that we have another “andop” at the same precedence as the AND, it will be just one more line in parseBooleanTerm
and one new function like boolAnd
to activate it.
Same as before we also need to define the AND operator and the function that recognises it:
val andOp: Char = '&' /** check for an "andop" (&) */ fun isAndop(c: Char): Boolean = c == andOp
And of course the new instruction:
/** and top of stack with accumulator */
fun andAccumulator() = outputCodeNl("AND (SP)+,D0")
Verify it works, it processes a series of T and F literals and does OR, XOR and AND with the right level of precedence.
Next, let’s add the NOT operator. Please rename parseNotFactor
to parseBooleanFactor
and write the new version of parseNotFactor
:
/** parse a not factor */
fun parseNotFactor() {
if (inp.nextChar == notOp) {
inp.match()
parseBooleanFactor()
code.booleanNotAccumulator()
}
else
parseBooleanFactor()
}
/** parse a boolean factor */
fun parseBooleanFactor() {
if (inp.getBoolean())
code.setAccumulator("1")
else
code.clearAccumulator()
}
We also need to define the NOT operator:
val notOp: Char = '!'
and the corresponding instruction:
/** boolean not accumulator */
fun booleanNotAccumulator() = outputCodeNl("EOR #1,D0")
Now you can add some NOT factors in your test cases and make sure it works.
And now it’s time to add the relations. First we need to modify parseBooleanFactor
:
/** parse a boolean factor */
fun parseBooleanFactor() {
if (inp.isBoolean(inp.nextChar)) {
if (inp.getBoolean())
code.setAccumulator("1")
else
code.clearAccumulator()
}
else
parseRelation()
}
This is how to parse a relation:
/** parse a relation */ fun parseRelation() { parseExpression() if (inp.isRelop(inp.nextChar)) { code.saveAccumulator() when (inp.nextChar) { isEqual -> parseEquals() isNotequal -> parseNotEquals() isLess -> parseLess() isGreater -> parseGreater() } } }
Please note the call to parseExpression
above. This is where the circle closes, where our Boolean parser meets the Numerical one.
The 4 parsers for equals to, not equals to, less than and great than are here:
/** parse is equal to */
fun parseEquals() {
inp.match()
parseExpression()
code.compareEquals()
}
/** parse is not equal to */
fun parseNotEquals() {
inp.match()
parseExpression()
code.compareNotEquals()
}
/** parse is less than */
fun parseLess() {
inp.match()
parseExpression()
code.compareLess()
}
/** parse is greater than */
fun parseGreater() {
inp.match()
parseExpression()
code.compareGreater()
}
We need to define the relational operators in our global definitions section and write the function to recognise them:
val isEqual: Char = '?' val isNotequal: Char = '#' val isGreater: Char = '>' val isLess: Char = '<' /** check for a relop (=, #, <, >) */ fun isRelop(c: Char): Boolean = c == isEqual || c == isNotequal || c == isLess || c == isGreater
At this point Jack explains in detail how these operations should be translated to M68000 Assembly. The key here is that both the CPU flags must be set as they are used to control the flow in a control statement, but also the “Accumulator” must be set to the corresponding Boolean value as it can be used a Boolean variable. I will skip that detail, but feel free to read it in the original. I will just copy the Assembly instructions from Jack’s tutorial:
/** compare and set accumulator and flags - is equal to */
fun compareEquals() {
outputCodeNl("CMP (SP)+,D0")
outputCodeNl("SEQ D0")
outputCodeNl("TST D0")
}
/** compare and set accumulator and flags - is not equal to */
fun compareNotEquals() {
outputCodeNl("CMP (SP)+,D0")
outputCodeNl("SNE D0")
outputCodeNl("TST D0")
}
/** compare and set accumulator and flags - is less than */
fun compareLess() {
outputCodeNl("CMP (SP)+,D0")
outputCodeNl("SGE D0")
outputCodeNl("TST D0")
}
/** compare and set accumulator and flags - is greater than */
fun compareGreater() {
outputCodeNl("CMP (SP)+,D0")
outputCodeNl("SLE D0")
outputCodeNl("TST D0")
}
And with this we have the complete Grammar covered. Our compiler can now parse Boolean expressions like
x > 0 & x < 100
and produce the right Assembly code for them. Remember, no parentheses are needed in the above expression. If you use (x > 0) & (x < 100)
it will stop at the '>'
with an error as it will be processed by the numeric parser which of course does not accept a '>'
.
Adapting the Numeric Parsers to the New Grammar
If you remember, our grammar for the numeric parsing was slightly changed in the beginning of this chapter, by moving the handling of the unary minus from the Term to the Factor. Let’s modify our parsing routines to follow this improved grammar.
First parseExpression
is changed back to what it was initially before we added the “trick” for the unary minus:
/**
* parse a numeric expression
* <expression> ::= <term> [ <addop> <term> ] *
*/
fun parseExpression() {
parseTerm()
while (inp.isAddop(inp.nextChar)) {
code.saveAccumulator()
when (inp.nextChar) {
addOp -> add()
subOp -> subtract()
}
}
}
Next parseTerm
is changed to use parseSignedFactor
as per grammar:
/**
* parse a term
* <term> ::= <signed factor> [ <mulop> <factor> ] *
*/
fun parseTerm() {
parseSignedFactor()
while (inp.isMulop(inp.nextChar)) {
code.saveAccumulator()
when (inp.nextChar) {
mulOp -> multiply()
divOp -> divide()
}
}
}
We need the new function parseSignedFactor
:
/** * parse a signed factor * this can be only the first factor in a term * <signed factor> ::= [ addop ] <factor> */ fun parseSignedFactor() { if (inp.nextChar == addOp) inp.match() if (inp.nextChar == subOp) { inp.match() if (inp.isNumeric(inp.nextChar)) code.setAccumulator("-${inp.getNumber()}") else { parseFactor() code.negateAccumulator() } } else parseFactor() }
This function checks first for a leading '+'
and ignores it. Then it checks for a '-number'
which is processed straight away in one instruction. '-(expression)'
and '-identifier'
are processed by parseFactor
as usual and then negated. This way the output code is much more efficient.
We also need to add the new instruction to negate the accumulator:
/** negate accumulator */
fun negateAccumulator() = outputCodeNl("NEG D0")
And with that our numeric parser is now compliant with the new improved grammar. We will test this after we complete the merging of this code with the rest of the structures further down in this chapter.
Merging with Control Structures
Now it’s time to go back to the ProgPraser1ControlStruct.kt
that’s holding all the control structures parser functions. All we need to do is replace all the calls to parseCondition
with calls to parseBooleanExpression
. ParseCondition
, which was just a dummy placeholder, has served its purpose and can now be deleted.
Adding Assignments
And finally, to merge the numeric functions that we developed in chapters 2 and 3 we need to replace the call to parseOther
in praseBlock
with a call to parseAssignment
. ParseOther
can also be deleted.
You also need to replace the two calls to parseOther
in the parseFor
function with calls to parseExpression
for now.
ParseAssignment
needs a minor change as it now has to call parseBooleanExpression
.
/**
* parse assignment
* <assignment> ::= <identifier> = <b-expression>
*/
fun parseAssignment() {
val identName: String = inp.getName()
inp.match(equalsOp)
parseBooleanExpression()
code.assignment(identName)
}
This change is necessary to properly support our new grammar. If you remember, we realised above that any integer can be treated as Boolean and vice versa. Our numeric expression parser will start looking for a Boolean expression, and if it does not find one, it can settle for a numeric one.
And now we can write a (-n almost) proper program:
x = set_x() w x > 0 & x < 100 x = x + 1 i check_y() ? 0 b e e a = -(x+1) * 5 .
The output will be 46(!) lines of Assembly code but it will do the job. This program is now looking very realistic, the only “funny” bit about it being the single-character tokens.
As the code has significantly grown by now, you can find it in my GitHub repository.
Coming up next: Lexical Scanning
In the next chapter we will build a more advanced lexical scanner that will help us eliminate the single-character token constraint.