Chapter XIV: Strings – Hello, World!

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In the last chapter I included an example TINSEL program that demonstrated the use of nested for loops. Those of you who tried it on your Linux box, I’m sure you are convinced it works. However, it brought up another limitation of the current version of TINSEL. The lack of string handling, which made the printout of that program look like an awkward series of numbers that we had to try hard to follow. Not to mention that we have no way to print a prompt to the user when the program is waiting for input for the variable n. To rectify these limitations I will introduce strings and the associated handling in this chapter. I consider this a major enhancement, given that until now TINSEL supported only integers, hence it will give us TINSEL v2.0. There will be significant changes across many parts of the compiler, so please bear with me.

Defining Different Types

Until now everything in TINSEL was of type integer and this made things very simple in terms of declaring variables, processing expressions, executing numerical operations, calling functions, retrieving their return value, etc. The introduction of Strings means that we need to support Types. So, let’s start with this part. A new file DataTypesAndDecl.kt is created where all the necessary structures to support the handling of various types will go:

The following enumeration is first declared; it defines the different types supported in TINSEL v2.0:

enum class DataType { int, string, void, none }

The class that defines the identifiers (variables/functions) and the map that actually holds the mapping of all these identifiers also go there:

/** the declaration space (variables and functions) */
class IdentifierDecl(var fv: TokType, var type: DataType, var initialised: Boolean = false, var size: Int = 0, var stackVar: Boolean = false, var stackOffset: Int = 0, var canAssign: Boolean = true)

// the identifiers space map
val identifiersMap = mutableMapOf<String,IdentifierDecl>()

The declareVar and declareFun functions that produce the assembly code for a variable or a function also go there, as well as two new functions, getType and getCanAssign that return the type of an identifier and whether it can be assigned a new value (if you remember from the previous chapter, the control variable in a for loop cannot be assigned a value in an assignment statement).

Variable Declarations

The next thing that has to change is how we define our variables. We now have to tell our compiler what type each variable is. The Integer variables must now be defined like this:

var i: int

and may optionally be followed by an initialisation clause:

var i: int = -1

As you may remember form previous chapters, TINSEL uses 64-bit integers, so each integer variable declaration will allocate 8 bytes in the assembly output for that variable.

Let’s get to Strings now. I have decided to deal with strings in the same way as C does, i.e. treat a string as a pointer in memory where the contents of the string is. Also, same as in C, strings in TINSEL will be null-terminated. With this in mind, when we declare a sting variable, we must allocate the necessary space in memory where its value will be stored. This is done in one of two ways:

– initialise the string with a specific value:

var s1: string = "1234567890"

This will allocate 11 bytes in memory containing the above 10 characters plus the null terminator, while s1 will be pointing to the first of those bytes.

– or, allocate a number of bytes in memory

var s2: string(20)

This will allocate 20 bytes in memory with s2 pointing to the first of those bytes.

Please note that these memory blocks allocated for the string variables are fixed size and cannot grow. This is important when we get to string expressions.

Function Declarations

Similarly to strings, our function declarations must now include a type:

fun square(): int
fun alower(): string

This type declaration will be checked later when we parse the return statement for each function, to ensure it returns the correct type (more on this later in this chapter).

Expressions

And now it’s time for the heavy lifting: we need to change the way we process expressions in order to support processing of strings.

First, let me summarise how string expressions will look like:

adding strings s1 + s2 will result in an operation same as the C strcat(s1,s2) i.e. will append the contents of the memory area where s2 is pointing to the end of the memory area where s1 is pointing (the end of the string being the position of the null character)

comparing strings s1 == s2 will result in an operation similar to the C strcmp(s1,s2) that will return 1 if the strings are equal and 0 if not.

comparing strings s1 != s2 will result in the same operation as above but with exactly the opposite result.

assigning a string variable s = "abcd" or s = s1 + s2 will result in an operation same as the C strcpy(s, “abcd”) or strcpy(s, s1) (after s2 has been concatenated at the end of s1).

And these are the only string operations that TINSEL 2.0 will support for now.

Please remember the point about declaring string variables: each string variable is allocated a fixed memory space at declaration time, so it’s important to ensure this space is adequate to hold the result of the strcat or the strcpy, otherwise other variables will be corrupted or the program will crash (exactly the same as in C).

Now it’s time to think how we deal with expressions that can potentially contain different types of variables or constants. We need to be able to validate whether an operation is valid of that specific type of variable or whether the two sides of an operation are valid for that operation. E.g. we can compare a string to a string and assign the result to an integer variable (remember TINSEL treats boolean vars as integers – same as C), we can obviously add or multiply an integer with an integer, we can add a string to a string as mentioned above, but we cannot multiply a string with a string or a string with an integer, we cannot do boolean xor between a string and a string, etc.

To achieve this level of control we need to change how the various parse functions deal with the expressions, the terms and the factors. Starting from the bottom, we have already the identifiersMap that contains the type of each identifier (int or string). In addition, when scan finds a quote (the start of a string), it will call the new function getString, that will return the string literal as the next token:

private function scan(): Token {
    ....    
    if (checkQuote())
        return Token(getString(), Kwd.string, TokType.none)
    ....

/** get a string literal */
private fun getString(): String {
    var value = ""
    if (nextChar != '"')
        return value
    getNextChar()
    while (nextChar != '"') {
        value += nextChar.toString()
        getNextChar()
    }
    getNextChar()
    return value
}

So now, when we get a string literal or a number, we can tell what the type is by looking at the type of the Token.

Working our way up, parseFactor now checks the type of the next token and calls parseNumber or parseStringiteral respectively.

fun parseFactor(): DataType {
    when (inp.lookahead().encToken) {
        Kwd.leftParen -> return parseParenExpression()
        Kwd.identifier -> return parseIdentifier()
        Kwd.number -> return parseNumber()
        Kwd.string -> return parseStringLiteral()
        else -> inp.expected("valid factor (expression, number or string)")
    }
    return DataType.void    // dummy instruction
}

In addition, it now returns the type of the factor as you can see above (int or string). This will be used when it comes to validating whether a specific type or type combination is allowed in a specific operation.

The parseNumber function is what we used to have

fun parseNumber(): DataType {
    code.setAccumulator(inp.match(Kwd.number).value)
    return DataType.int
}

The parseStringLiteral however, is a bit more complicated:

fun parseStringLiteral(): DataType {
    val stringValue = inp.match(Kwd.string).value
    // check if this string exists already in our map of constant strings and add it if not
    var stringAddress = ""
    stringConstants.forEach { (k, v) -> if (v == stringValue) stringAddress = k }
    if (stringAddress == "") {  // if not found
        // save the string in the map of constant strings
        stringAddress = STRING_CONST_PREFIX + (++stringCnstIndx).toString()
        stringConstants[stringAddress] = stringValue
    }
    code.getStringAddress(stringAddress)
    return DataType.string
}

Apart from producing the necessary assembly code, to load the Accumulator with the address of the string literal, it has to find some place to store that string literal. For this purpose it builds a table of all the string literals (stringConstants) which at the end of the compilation is added to the end of the assembly output in a separate .data section. As you can see above, if the same string literal is found multiple times, it is stored in the table only once and is reused.

Similarly, parseTerm has been modified to get the type of the first factor and pass it on as a parameter to multiply or divide. It also returns the type to the caller:

fun parseTerm(): DataType {
    val typeF1 = parseSignedFactor()
    while (inp.lookahead().type == TokType.mulOps) {
        if (typeF1 == DataType.string)
            code.saveString()
        else
            code.saveAccumulator()
        when (inp.lookahead().encToken) {
            Kwd.mulOp -> multiply(typeF1)
            Kwd.divOp -> divide(typeF1)
            else -> inp.expected("multiply or divide operator")
        }
    }
    return typeF1
}

And in the same way, parseExpression has been modified to get the type of the first term and pass it on as a parameter to add or subtract and, of course, to return it to the caller:

fun parseExpression(): DataType {
    val typeT1 = parseTerm()
    while (inp.lookahead().type == TokType.addOps) {
        if (typeT1 == DataType.string)
            code.saveString()
        else
            code.saveAccumulator()
        when (inp.lookahead().encToken) {
            Kwd.addOp -> add(typeT1)
            Kwd.subOp -> subtract(typeT1)
            else -> inp.expected("add or subtract operator")
        }
    }
    return typeT1
}

So, now, parseAssignment knows what it’s dealing with and can make a decision whether a string literal can be assigned to an integer variable or not:

fun parseAssignment() {
    val identName: String = inp.match(Kwd.identifier).value
    checkCanAssign(identName)
    val typeVar = getType(identName)
    inp.match(Kwd.equalsOp)
    val typeExp = parseBooleanExpression()
    checkOperandTypeCompatibility(typeVar, typeExp, ASSIGN)
    when (typeVar) {
        DataType.int -> parseNumAssignment(identName)
        DataType.string -> parseStringAssignment(identName)
        else -> {}
    }
}

You can see that it now calls parseNumAssignment or parseStringAssignment depending on the type of the variable being assigned. The former is simply what we used to have as assignment:

fun parseNumAssignment(varName: String) {
    code.assignment(varName)
}

However, the latter, even though still one line at the parser level, is a bit more complex when it comes to the code we generate as it involves a call to our own strpcy_ (see above).

fun parseStringAssignment(identName: String) {
    code.assignmentString(identName)
}

fun assignmentString(identifier: String) {
    outputCodeTab("movq\t%rax, %rsi\t\t")
    outputCommentNl("assign string - strcpy_(identifier, %rax)")
    outputCodeTabNl("lea\t${identifier}(%rip), %rdi")
    outputCodeTabNl("call\tstrcpy_")
}

The key change in our upgraded parseAssignemnt is the call to checkOperandTypeCompatibility with the two types and the operation as parameters. This is where the decision is made whether these two types are allowed in this operation. And this is how it works:

At the end of the new file (DataTypesAndDecl) you will see a map that defines the allowed combinations for each operation

val typesCompatibility = mapOf(
    // int with int allowed for all operations
    TypesAndOpsCombi(DataType.int, DataType.int, ALL_OPS) to true,
    TypesAndOpsCombi(DataType.int, DataType.none, ALL_OPS) to true,
    // string with string allowed only for assign, add, compare_eq compare_ne
    TypesAndOpsCombi(DataType.string, DataType.string, ASSIGN) to true,
    TypesAndOpsCombi(DataType.string, DataType.string, ADD) to true,
    TypesAndOpsCombi(DataType.string, DataType.string, COMPARE_EQ) to true,
    TypesAndOpsCombi(DataType.string, DataType.string, COMPARE_NE) to true,
    TypesAndOpsCombi(DataType.string, DataType.none, PRINT) to true,
    // all other combinations forbidden unless set here
)

In order to make this easier to manage, we have a “wildcard” operation, ALL_OPS. So, the first two lines say that all binary operations between integers are allowed and all unary operations on an integer are allowed. The following five lines say that we can assign a string to a string variable, we can add two strings, we can compare (equal or not-equal) two strings and we can print a string. No other combination is allowed. With this definition, the logic in checkOperandTypeCompatibility is very simple:

fun checkOperandTypeCompatibility(t1: DataType, t2: DataType, operation: String) {
    var typesAreCompatible = typesCompatibility[TypesAndOpsCombi(t1, t2, operation)] ?: false
    if (!typesAreCompatible)
        typesAreCompatible = typesCompatibility[TypesAndOpsCombi(t1, t2, ALL_OPS)] ?: false
    if (!typesAreCompatible) {
        var message = "line ${inp.currentLineNumber}: $operation $t1 "
        if (t2 != DataType.none)
            message += "with $t2 "
        message += "not supported"
        abort(message)
    }
}

This function is now called in every parse function that performs a binary or unary operation to decide whether the type(s) is/are compatible. If they are not, a compilation error is thrown.

Operations

Given that the execution of an operation is now different, depending on the type(s) involved, we have separate functions for operations on integers and on strings (which are called after checking the type compatibility):

fun add(typeT1: DataType) {
    inp.match()
    val typeT2 = parseTerm()
    checkOperandTypeCompatibility(typeT1, typeT2, ADD)
    when (typeT1) {
        DataType.int -> addNumber()
        DataType.string -> addString()
        else -> {}
    }
}

The same goes for subtract, multiply, and all the other operations.

These functions are now placed in two separate files OpsNumeric.kt and OpsStrings.kt.

Boolean Expressions

Similarly to the Numerical Expressions, our Boolean expressions module has to be changed to support strings. As mentioned early in this chapter, strings can only be used with the relational operators == and != and this is defined in the typesCompatibilityMap.

Our parseBooleanExpression, parseBooleanTerm, parseNotFactor, parseBooleanFactor and parseRelation have been changed to return the type of the entry they processed. Also, all the boolean operations functions are now called with a parameter that’s telling them what the type of the first operand was. This way they can check the type compatibility after retrieving the type of the second operand.

Notably, parseRelation, parseEquals and parseNotEquals have had their logic changed as this is where strings are getting into the picture:

fun parseRelation(): DataType {
    val typeE1 = parseExpression()
    if (inp.lookahead().type == TokType.relOps) {
        if (typeE1 == DataType.string)
            code.saveString()
        else
            code.saveAccumulator()
        when (inp.lookahead().encToken) {
            Kwd.isEqual -> parseEquals(typeE1)
            Kwd.isNotEqual -> parseNotEquals(typeE1)
            Kwd.isLess -> parseLess(typeE1)
            Kwd.isLessOrEq -> parseLessEqual(typeE1)
            Kwd.isGreater -> parseGreater(typeE1)
            Kwd.isGreaterOrEq -> parseGreaterEqual(typeE1)
            else -> inp.expected("relational operator")
        }
        return DataType.int
    }
    else
        return typeE1
}

fun parseEquals(typeE1: DataType) {
    inp.match()
    val typeE2 = parseExpression()
    checkOperandTypeCompatibility(typeE1, typeE2, COMPARE_EQ)
    when (typeE1) {
        DataType.int -> code.compareEquals()
        DataType.string -> code.compareStringEquals()
        else -> {}
    }
}

These changes are consistent with the changes in the Numeric expressions parsing, where different routines have to be called to execute the operation depending on whether it’s integer or string operation.

Return from Function

Given that our functions have to return a specific type, we need to check whether the return statement is returning the correct type. This is done quite easily with the below code:

fun parseReturn() {
    inp.match()
    if (labelPrefix == MAIN_BLOCK)
        abort("line ${inp.currentLineNumber}: return is not allowed in [main]")
    hasReturn = true       // set the return flag for this function
    val funType = getType(funName)
    if (funType != DataType.void) {
        val expType = parseExpression()
        if (expType != funType)
            abort("line ${inp.currentLineNumber}: $funType function cannot return $expType")
    }
    code.returnFromCall()
}

I’ve also taken the opportunity to simplify the logic that checks whether a function has a return statement in its block. If you remember, the initial implementation (Chapter 12) was quite complex: each block parsing function was returning a boolean value to the caller, which indicated whether it found a return or not, and when that value reached the top of the recursion, which is the parseBlock called by the parseFunctionBlock, it was checked and generated an error if that function had no return statement.

Even though that logic worked, it was unnecessarily complicated, because we can only declare a function at the top level only – we cannot declare a function within a function. This means that a global variable hasReturn, which is reset in the beginning of parseFunctionBlock, set in parseReturn (see code above) and checked at the end of parseFunctionBlock, can do the job in a much simpler way.

fun parseFunctionBlock() {
    hasReturn = false
    parseBlock()
    if (!hasReturn)
        abort("line ${inp.currentLineNumber}: function $funName has no ${inp.decodeToken(Kwd.retToken)}")
}

Code Generator

There is one more change that was needed in our code generator this time, in order to support string operations (+, ==, !=).

Let’s go back to how we calculate this expression: A + ( B + C )

We set the accumulator to the value of A and then push it to the stack. Then, we set the accumulator to the value of B and, given that we have parenthesis, push it also to the stack. Then, we set the accumulator to the value of C, pop the top of stack and add it to the accumulator (this gives us B+C). Finally, we pop the top of stack again and add it to the accumulator (and this gives us A + (B+C) ).

The above works well with numbers. In case of strings we will have to do the same – right? But, what is the value of A, B or C if these are strings? It’s the address of each string that points to the memory area where the actual contents of the string is stored. So, we are pushing the address of the string to the stack, but what about its contents? Technically we could push it to the stack too and that would have been fully consistent with the handling of integers. However, this would introduce unnecessary complexity that has to do with the fact that the strings are variable-length plus they are made of bytes, which means that a string is most likely not an exact multiple of a 64 or a 32-bit word. Even though we could use padding to make a string multiple of a 64 or 32-bit word and use the null terminator to know that we have reached the end of the string to stop popping from the stack, this is totally unnecessary for the simple reason, because A + (B + C) is the same as A + B + C.

The answer is that we will not allow parenthesis in strings expressions (try it and see the compiler error generated). This means that we don’t really need a stack to store the value of all the strings involved in the expression. We just need a temporary buffer to store the contents of the last string processed in the expression and retrieve it to add it (strcat_) to the next one and store the new result to the buffer again. Please note, the value of the accumulator (string address) is still stored in the stack to keep things simple. You can see how this logic works in the saveString function which is called when a Term is processed:

fun saveString() {
    outputCodeTab("movq\t%rax, %rsi\t\t")
    outputCommentNl("save string - strcpy_(string_buffer, %rax)")
    outputCodeTabNl("lea\t$STRING_BUFFER(%rip), %rdi")
    outputCodeTabNl("call\tstrcpy_")
    outputCodeTabNl("pushq\t%rax")
    includeStringBuffer = true
}

The buffer is a dedicated memory space called string_buffer_ that has a default length of 1024 bytes. This means that the maximum length of any string in our program can be 1023 characters (+1 for the null terminator). Any string operation that resulted in a string longer than that would most likely crash the program or, best case, would overwrite other variables. If however you need a string of higher length, you can simply call the compiler with this option:

-maxstring nnnnn

This will allocate a string_buffer of nnnnn bytes to prevent overflow and crash of the program.

If you write a program that involves string operations you will see the string_buffer space allocated at the end of the compiler output:

...

.data
   .align 8
# buffer for string operations - max str length limit
   string_buffer_:    .space 1024

Runtime Library

Finally, the TINSEL runtime library has also been enhanced to support Version 2.0. The following functions have been added:

strcpy_
strcat_
strlen_
streq_

These functions are not directly visible to the TINSEL program. They are called from the assembly program that the compiler produces.

You will need to download the files string_utils.s and readwrite.s from the directory libtinel/src in the repository and assemble them using the GNU assembler so that they can be linked to your TINSEL output program (see below and also Chapter 10).

And with that, we have completed the implementation of the first version of string support in TINSEL.

This chapter was quite extensive and this reflects the extent of the changes in our compiler. I have not described every little change in every detail here, because that would double the size of this chapter. I have described the concept though and I hope you can follow the changes by looking at the code.

We have made a significant improvement to our TINSEL language and can now, finally, write the “hello world” program! Here’s a slightly more “complex” version of it:

program hello_world
var s: string(20), s1: string(20), s2: string(20)
main {
    s1 = "Hello"
    print "please enter your name (or Enter to continue): "
    read s2
    if (s2 == "") { s2 = "World" }
    s = s1 + ", " + s2
    println s, "!"
}
endprogram

You can also try a more complex example here: diamond.tnsl

Remember, to run your own TINSEL programs, you need this compiler and a Linux computer. You will need to download the source libraries from the repository, which you will have to assemble and link with your TINSEL program as described at the end of Chapter 10 – Running our Code: Let’s Build a Compiler – Chapter X, TINSEL for x86-64

And as always, you can find the code in my GitHub repository.

Coming up next: Functions with Parameters

Let's Write Some Code

…and Keep it Simple

Chapter XIV: Strings – Hello, World!

Introduction

Defining Different Types

Variable Declarations

Function Declarations

Expressions

Operations

Boolean Expressions

Return from Function

Code Generator

Runtime Library

Leave a comment Cancel reply

Introduction

Defining Different Types

Variable Declarations

Function Declarations

Expressions

Operations

Boolean Expressions

Return from Function

Code Generator

Runtime Library

Share this:

Related

Leave a comment Cancel reply