Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.
Introduction
In the previous chapter we produced our first real running program written in TINSEL, our own programming language. We could give it input from the keyboard and and see its output on the screen. Yes, I admit, it’s a bit basic, but we’ve come a long way from the one-character tokens and single-line programs from the beginning of the series that were producing a mixture of M68000 instructions and pseudo-code. This has been an amazing journey that we’ve managed to travel following on Jack’s footsteps. And in this chapter we will add a couple of features that will make TINSEL even better.
Semicolons
Same as Jack, I had to use semicolons since I learnt Algol! And I continued using them with Pascal, C and Java. But, guess what – I don’t have to use them anymore now that I have started programming in Kotlin!
So, does a modern language really need semicolons or not? The answer is “yes” and “no”. As Jack explained in detail, one school says that they are just a nuisance – they are adding nothing to the syntax, and a modern language can function perfectly well without them (and Kotlin does actually). Don’t forget that our compiler can compile the below statements correctly, without the need of any semicolons:
a = b * 2 + (x+y) / z i = 1
It can tell where each statement finishes and where the next one starts, even if they extend across multiple lines, so, as the minimalists would say, why use semicolons? Why add an extra mandatory keystroke in every statement, which will be a source of another potential mistake?
The other side of the coin is that semicolons add clarity and make the code less cryptic and much easier to read. Think of the case where we have 2 or 3 short statements in one line:
i=1; a=10; b=a+1;
Without semicolons this line would look a bit “funny”:
i=1 a=10 b=a+1
But, again, including multiple statements in one line is not good programming practice, and should we make semicolons mandatory for that rare case?
So, I will do exactly what Jack did: make them optional. The compiler will process one when it sees one, but will not ask for one if it’s not there.
Based on this, our BNF definition of statement changes as per below:
<statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> | <return> | <read> | <print> | <assignment> | null [ ; ]
which says that any statement can be optionally terminated by a semicolon and we can also have an empty statement. And the code to implement this is the following three lines in parseStatement
:
... Kwd.identifier -> parseAssignment() Kwd.semiColon -> inp.match() // semicolons are simply ignored else -> inp.expected("valid keyword, semicolon or identifier") }
The changes here are: (a) we only trigger parseAssignment
when we actually see an identifier (and not in the else
clause as before), (b) when we see a semicolon we recognise it and ignore it and (c) we catch any invalid tokens in the else
clause. And that was it!
Now, if you want to make semicolons mandatory, you will need to add the line
inp.match(Kwd.semiColon)
at the end of the parser functions for break, return, read, print and assignment. In that case, you may also want to add them to the parsers for ProgHeader, VarDecl, and ProgEnd for consistency. I’ll let you decide whether you want to make semicolons mandatory in your own version of TINSEL or not.
Comments
Some people are against comments, some are in favour. Some say the code speaks for itself, some see the need of a comment in certain cases to explain what is happening and why, as no matter how “clean” the code is, every now and then there will be an obscure point that needs a comment to clarify. I have decided to include comments in TINSEL so that you can use them if you want to.
The processing of comments can be as easy or as complicated as we make it.
Here’s my spec: I want both block comments like /* comment here... */
and also inline comments like this // comment inline
.
I will also add a little twist: I will define an extra comment token that will transfer those comments from the source code to the output code. For this I will use the tokens /*--
and //--
<block-comment> ::= /* <comment> */ | /*-- <comment> */ <inline-comment> ::= // <comment> <newline> | //-- <comment> <newline>
With these specs in mind, the next question is: where is the best place to process comments?
Jack has put his skipComment
function as low in the scanner as possible, inside skipWhite
. And this is an approach that served him well, especially in the case where he was working with single-character tokens. I tried this in my code but I had two issues with it: (a) it resulted in a very complicated function to detect the start of a comment (and the kind of the comment) – 27 lines – the reason for this is that by working inside skipWhite
, I had to go back to string comparisons and could not use the power of scan
that translates the character sequences to tokens, and (b) the comments that are transferred to the output code are printed in the wrong place (I will explain this in more detail shortly). So based on this, I abandoned the idea. If you want to see how I implemented it, I have saved it in the file InputProgramScanner_1.kt
in the GitHub repository.
Another idea would be to process comments as high as possible, in the parser. That would mean a parseComment
function that would have to be triggered when a start-of-comment token was returned by scan
(or actually match
). However, that parse function would have to be called in all the other parser functions where a comment could be seen and this is too many places. Also, we would have another issue in case of an in-line comment that would have to be transferred to the output – our parseComment
would not know where to stop in that case, as the newline is not visible at the parser level (it is skipped in skipWhite
). To resolve this we would have to treat newline as another token and have special logic in scan
to decide when to skip it and when to return it as the next token. For these reasons I abandoned this idea too.
So, I have decided to go for a solution in the middle – process comments in match
. This is the place where we match each token and then we advance to the next one (including skipping white spaces in scan
), so naturally this would be the right place to process any comments if the next token is the start of one. Normally we would want to process the comments in the beginning of match so that any comments would be processed (and printed in the right place) and then nextToken
would be advanced to the next actionable token to be matched. And this would be a good approach apart from one issue: match
is not the only place where we check the next token. We also check it via lookahead
, which returns the next token without advancing the cursor. So we have a discrepancy here where match
would point to the next actionable token skipping any comments, while lookahead
would point to the next token even if that was a comment. To fix this we would have to process the comments in lookahead
as well, but that would mean that lookahead would also advance the cursor and this would cause another discrepancy, as lookahead
is supposed to not move the cursor.
It looks like the processing of the comments has to be done at the end of match
. And this makes sense based on the logic of my code, given that this is where the previous token has been matched, we have called scan
, skipped white spaces and advanced to the next token. In my mind, this is the right place to check for any comments and process them, so that nextToken
will point to the next actionable token. We still have one issue though. Imagine this scenario:
program prog1 var a //-- function to calculate square fun square() { ...
Here we are processing the var
declaration, we have matched “var” and are calling match
to get the variable called “a”. This call, after matching “a” would process (and print) the comment that follows and then advance to the fun
token, ready for the next match. Given that the assembly code is transferred to the output after match
returns, while the comment would be printed from inside match
, the assembly code would look like this:
# program prog1 .data .align 8 # # function to calculate square a: .quad 0 .text .align 8 .global _start square: ...
As you can see the comment is transferred to the output code in the wrong place – it is printed just before the declaration of var a. There is an easy way to fix it. If we have a printable comment, we don’t actually print it, but keep it in a variable instead. We then check the contents of that variable in the beginning of match and print its contents if any. This way each comment that has been processed is printed in the output code in the beginning of the next match cycle, which is where it should be.
So, here’s the new function getComment
, which branches into two functions that process block and inline comments respectively.
/** get a comment */ private fun getComment() { while (nextToken.type == TokType.commentStart) when (nextToken.encToken) { Kwd.blockComment -> getCommentBlock() Kwd.blockCommentOut -> getCommentBlock(true) Kwd.inlineComment -> getCommentInline() Kwd.inlineCommentOut -> getCommentInline(true) else -> expected("start of comment") } } /** get a block comment */ private fun getCommentBlock(printToOut: Boolean = false) { var localCommentString = code.COMMENT val endComment: String = decodeToken(Kwd.commentEnd) while (!inputProgram.substring(cursor).startsWith(endComment) && nextChar != endOfInput) { localCommentString += nextChar if (nextChar == '\n') localCommentString += code.COMMENT getNextChar() } localCommentString += '\n' nextToken = scan() // nextToken now points to endComment or endOfInput if (nextToken.encToken == Kwd.endOfInput) expected(endComment) nextToken = scan() // nextToken now points to the next token after the comment if (printToOut) commentString += localCommentString } /** get an in-line comment */ private fun getCommentInline(printToOut: Boolean = false) { var localCommentString = code.COMMENT while (nextChar != '\n' && nextChar != endOfInput) { localCommentString += nextChar getNextChar() } localCommentString += '\n' nextToken = scan() if (printToOut) commentString += localCommentString }
getCommentBlock
reads the input until it finds the end-of-commend token, while getCommentInline
reads the input until it gets a newline. Both transfer the comment to the commentString
variable which is printed in the beginning of match
in the next call. This is the modified match
function:
fun match(keyWord: Kwd = Kwd.any): Token { printComment() // any comments found in the previous call must be printed in the output code now if (keyWord != Kwd.any && nextToken.encToken != keyWord) // check keyword to match expected(decodeToken(keyWord)) val thisToken = nextToken nextToken = scan() // advance to next token getComment() // process any comments return thisToken } private fun printComment() { if (commentString != "") { code.outputCode(commentString) commentString = "" } }
The only thing to add is that any comments before our first token (which is program
if you remember) will not be processed here. To fix this we just need to add one more call to getComment
at the end of the init
section of our scanner. This will take care of any comments before the program...
Also, please note, this version does not support nested comments.
I did not expect the processing of comments to be so complex, but it is there now and it’s working. As I spent quite some time on the scanner in this chapter, I have taken the opportunity and improved the scan
function, which is the brains and engine of our scanner. I have broken it down to smaller functions and made it easier to read (full details in the GitHub repository):
private fun scan(): Token { skipWhite() if (checkEndofInput()) return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none) if (checkNumeric()) return Token(getNumber(), Kwd.number, TokType.none) if (checkAlpha()) return keywordOrIdentifier(getName()) if (checkSpecialToken()) return getSpecialToken() return getInvalidToken() }
And in the spirit of tidying up, I have also removed the field info
from the Token
class. It has not been used and I don’t think it will be needed at all. With these enhancements in place, you can rewrite the TINSEL program from chapter X like this:
/*-- My First TINSEL program demonstrates the basic capabilities of the language Accepts an integer n as input and prints the squares of all the integers from 1 to n */ program tinselExample //-- variables declarations var i = 0, n /*-- calculate the square of i only for positive i */ fun i_square() { if (i <= 0) { return -1 } return i * i; } // main entry point main { read n; //-- main loop repeat { i = i + 1 print i_square(); } until (i >= n) } endprogram
As I said earlier in this chapter, it’s up to you how much you want to use comments in your code and whether you want to use semicolons and when.
And as always, you can find the code in my GitHub repository.
Coming up next: Improving our Error Checking