Chapter XI: Semicolons and Comments

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the previous chapter we produced our first real running program written in TINSEL, our own programming language. We could give it input from the keyboard and and see its output on the screen. Yes, I admit, it’s a bit basic, but we’ve come a long way from the one-character tokens and single-line programs from the beginning of the series that were producing a mixture of M68000 instructions and pseudo-code. This has been an amazing journey that we’ve managed to travel following on Jack’s footsteps. And in this chapter we will add a couple of features that will make TINSEL even better.

Semicolons

Same as Jack, I had to use semicolons since I learnt Algol! And I continued using them with Pascal, C and Java. But, guess what – I don’t have to use them anymore now that I have started programming in Kotlin!

So, does a modern language really need semicolons or not? The answer is “yes” and “no”. As Jack explained in detail, one school says that they are just a nuisance – they are adding nothing to the syntax, and a modern language can function perfectly well without them (and Kotlin does actually). Don’t forget that our compiler can compile the below statements correctly, without the need of any semicolons:

a = b * 2 
    + (x+y) / z
i = 1

It can tell where each statement finishes and where the next one starts, even if they extend across multiple lines, so, as the minimalists would say, why use semicolons? Why add an extra mandatory keystroke in every statement, which will be a source of another potential mistake?

The other side of the coin is that semicolons add clarity and make the code less cryptic and much easier to read. Think of the case where we have 2 or 3 short statements in one line:

    i=1; a=10; b=a+1;

Without semicolons this line would look a bit “funny”:

    i=1 a=10 b=a+1    

But, again, including multiple statements in one line is not good programming practice, and should we make semicolons mandatory for that rare case?

So, I will do exactly what Jack did: make them optional. The compiler will process one when it sees one, but will not ask for one if it’s not there.

Based on this, our BNF definition of statement changes as per below:

<statement> ::= <block> | <if> | <while> | <repeat> | <for> | <break> |
                <return> | <read> | <print> | <assignment> | null [ ; ]

which says that any statement can be optionally terminated by a semicolon and we can also have an empty statement. And the code to implement this is the following three lines in parseStatement:

    ...
    Kwd.identifier -> parseAssignment()
    Kwd.semiColon -> inp.match()    // semicolons are simply ignored
    else -> inp.expected("valid keyword, semicolon or identifier")
}

The changes here are: (a) we only trigger parseAssignment when we actually see an identifier (and not in the else clause as before), (b) when we see a semicolon we recognise it and ignore it and (c) we catch any invalid tokens in the else clause. And that was it!

Now, if you want to make semicolons mandatory, you will need to add the line

    inp.match(Kwd.semiColon)

at the end of the parser functions for break, return, read, print and assignment. In that case, you may also want to add them to the parsers for ProgHeader, VarDecl, and ProgEnd for consistency. I’ll let you decide whether you want to make semicolons mandatory in your own version of TINSEL or not.

Comments

Some people are against comments, some are in favour. Some say the code speaks for itself, some see the need of a comment in certain cases to explain what is happening and why, as no matter how “clean” the code is, every now and then there will be an obscure point that needs a comment to clarify. I have decided to include comments in TINSEL so that you can use them if you want to.

The processing of comments can be as easy or as complicated as we make it.

Here’s my spec: I want both block comments like /* comment here... */ and also inline comments like this // comment inline.

I will also add a little twist: I will define an extra comment token that will transfer those comments from the source code to the output code. For this I will use the tokens /*-- and //--

<block-comment> ::= /* <comment> */ | /*-- <comment> */
<inline-comment> ::= // <comment> <newline> | //-- <comment> <newline>

With these specs in mind, the next question is: where is the best place to process comments?

Jack has put his skipComment function as low in the scanner as possible, inside skipWhite. And this is an approach that served him well, especially in the case where he was working with single-character tokens. I tried this in my code but I had two issues with it: (a) it resulted in a very complicated function to detect the start of a comment (and the kind of the comment) – 27 lines – the reason for this is that by working inside skipWhite, I had to go back to string comparisons and could not use the power of scan that translates the character sequences to tokens, and (b) the comments that are transferred to the output code are printed in the wrong place (I will explain this in more detail shortly). So based on this, I abandoned the idea. If you want to see how I implemented it, I have saved it in the file InputProgramScanner_1.kt in the GitHub repository.

Another idea would be to process comments as high as possible, in the parser. That would mean a parseComment function that would have to be triggered when a start-of-comment token was returned by scan (or actually match). However, that parse function would have to be called in all the other parser functions where a comment could be seen and this is too many places. Also, we would have another issue in case of an in-line comment that would have to be transferred to the output – our parseComment would not know where to stop in that case, as the newline is not visible at the parser level (it is skipped in skipWhite). To resolve this we would have to treat newline as another token and have special logic in scan to decide when to skip it and when to return it as the next token. For these reasons I abandoned this idea too.

So, I have decided to go for a solution in the middle – process comments in match. This is the place where we match each token and then we advance to the next one (including skipping white spaces in scan), so naturally this would be the right place to process any comments if the next token is the start of one. Normally we would want to process the comments in the beginning of match so that any comments would be processed (and printed in the right place) and then nextToken would be advanced to the next actionable token to be matched. And this would be a good approach apart from one issue: match is not the only place where we check the next token. We also check it via lookahead, which returns the next token without advancing the cursor. So we have a discrepancy here where match would point to the next actionable token skipping any comments, while lookahead would point to the next token even if that was a comment. To fix this we would have to process the comments in lookahead as well, but that would mean that lookahead would also advance the cursor and this would cause another discrepancy, as lookahead is supposed to not move the cursor.

It looks like the processing of the comments has to be done at the end of match. And this makes sense based on the logic of my code, given that this is where the previous token has been matched, we have called scan, skipped white spaces and advanced to the next token. In my mind, this is the right place to check for any comments and process them, so that nextToken will point to the next actionable token. We still have one issue though. Imagine this scenario:

program prog1
var a
//-- function to calculate square
fun square() {
...

Here we are processing the var declaration, we have matched “var” and are calling match to get the variable called “a”. This call, after matching “a” would process (and print) the comment that follows and then advance to the fun token, ready for the next match. Given that the assembly code is transferred to the output after match returns, while the comment would be printed from inside match, the assembly code would look like this:

# program prog1
.data
.align 8
#
# function to calculate square 
	a:    .quad 0
.text
.align 8
.global _start
square: 
...

As you can see the comment is transferred to the output code in the wrong place – it is printed just before the declaration of var a. There is an easy way to fix it. If we have a printable comment, we don’t actually print it, but keep it in a variable instead. We then check the contents of that variable in the beginning of match and print its contents if any. This way each comment that has been processed is printed in the output code in the beginning of the next match cycle, which is where it should be.

So, here’s the new function getComment, which branches into two functions that process block and inline comments respectively.

/** get a comment */
private fun getComment() {
    while (nextToken.type == TokType.commentStart)
        when (nextToken.encToken) {
            Kwd.blockComment -> getCommentBlock()
            Kwd.blockCommentOut -> getCommentBlock(true)
            Kwd.inlineComment -> getCommentInline()
            Kwd.inlineCommentOut -> getCommentInline(true)
            else -> expected("start of comment")
        }
}

/** get a block comment */
private fun getCommentBlock(printToOut: Boolean = false) {
    var localCommentString = code.COMMENT
    val endComment: String = decodeToken(Kwd.commentEnd)
    while (!inputProgram.substring(cursor).startsWith(endComment) && nextChar != endOfInput) {
        localCommentString += nextChar
        if (nextChar == '\n')
            localCommentString += code.COMMENT
        getNextChar()
    }
    localCommentString += '\n'
    nextToken = scan()      // nextToken now points to endComment or endOfInput
    if (nextToken.encToken == Kwd.endOfInput)
        expected(endComment)
    nextToken = scan()      // nextToken now points to the next token after the comment
    if (printToOut)
        commentString += localCommentString
}

/** get an in-line comment */
private fun getCommentInline(printToOut: Boolean = false) {
    var localCommentString = code.COMMENT
    while (nextChar != '\n' && nextChar != endOfInput) {
        localCommentString += nextChar
        getNextChar()
    }
    localCommentString += '\n'
    nextToken = scan()
    if (printToOut)
        commentString += localCommentString
}

getCommentBlock reads the input until it finds the end-of-commend token, while getCommentInline reads the input until it gets a newline. Both transfer the comment to the commentString variable which is printed in the beginning of match in the next call. This is the modified match function:

fun match(keyWord: Kwd = Kwd.any): Token {
    printComment()  // any comments found in the previous call must be printed in the output code now
    if (keyWord != Kwd.any && nextToken.encToken != keyWord)    // check keyword to match
        expected(decodeToken(keyWord))
    val thisToken = nextToken
    nextToken = scan()  // advance to next token
    getComment()    // process any comments
    return thisToken
}

private fun printComment() {
    if (commentString != "") {
        code.outputCode(commentString)
        commentString = ""
    }
}

The only thing to add is that any comments before our first token (which is program if you remember) will not be processed here. To fix this we just need to add one more call to getComment at the end of the init section of our scanner. This will take care of any comments before the program... Also, please note, this version does not support nested comments.

I did not expect the processing of comments to be so complex, but it is there now and it’s working. As I spent quite some time on the scanner in this chapter, I have taken the opportunity and improved the scan function, which is the brains and engine of our scanner. I have broken it down to smaller functions and made it easier to read (full details in the GitHub repository):

private fun scan(): Token {
    skipWhite()
    if (checkEndofInput())
        return Token(END_OF_INPUT, Kwd.endOfInput, TokType.none)
    if (checkNumeric())
        return Token(getNumber(), Kwd.number, TokType.none)
    if (checkAlpha())
        return keywordOrIdentifier(getName())
    if (checkSpecialToken())
        return getSpecialToken()
    return getInvalidToken()
}

And in the spirit of tidying up, I have also removed the field info from the Token class. It has not been used and I don’t think it will be needed at all. With these enhancements in place, you can rewrite the TINSEL program from chapter X like this:

/*--
 My First TINSEL program
 demonstrates the basic capabilities of the language
 Accepts an integer n as input
 and prints the squares of all the integers from 1 to n
 */
program tinselExample

//-- variables declarations
var i = 0, n

/*--
 calculate the square of i
 only for positive i
 */
fun i_square() {
    if (i <= 0) { return -1 }
    return i * i;
}

// main entry point
main {
    read n;
    //-- main loop
    repeat {
        i = i + 1
        print i_square();
    } until (i >= n)
}

endprogram

As I said earlier in this chapter, it’s up to you how much you want to use comments in your code and whether you want to use semicolons and when.

And as always, you can find the code in my GitHub repository.

Coming up next: Improving our Error Checking

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s