Chapter 1: Introduction

Based on the original series “Let’s Build a Compiler!” by Jack Crenshaw.

Introduction

In this Kotlin adaptation of Jack’s original series I will mostly follow the structure that he followed, through which he teaches us in a comprehensive step-by-step tutorial how to design a new programming language and build a working compiler.

I will also follow Jack’s approach and will not publish any elements of compiler theory (I’m not qualified for this anyway). Instead, I will focus on the practical aspects that one needs to know to build a simple, working system.

The original series was written in Turbo Pascal 4.0, which at the time was one of the best software development tools. In this adaptation, the code is written in Kotlin. The IDE used is IntelliJ Community Edition by JetBrains, which comes with a Kotlin (as well as Java) compiler.

As Jack said, this is a “learn-by-doing” series. It is also *one* way of doing things, that works, but of course is not the only or the best way. The reader is encouraged to take this code and explore further options.

The big benefit of following Jack’s approach, is that the reader, once he/she has completed the series, will have good understanding of the internal mechanics of the compiler, as opposed to just copying and pasting a large block of code and using it without understanding of how it works. This way the reader can actually learn and then go off on their own and improve this code.

As this is a quite ambitious project, it cannot be completed in a couple of pages. I will again follow Jack’s approach and publish this work in a series of chapters, as I adapt Jack’s chapters one by one, so please be patient.

One point to note is that even if the final goal of this Blog is to build a compiler for the TI-59, in the initial stages our compiler will be producing Assembly language code for the Motorola 68000 microprocessor, exactly as in the original series. Once the compiler is completed, we will then change the output module (plus some specific bits in the expression scanning logic) to make it produce TI-59 code.

On this note, the Assembly code produced is not the world’s tightest code, but it works and this is where the emphasis of this series is. One of the late chapters however, deals with optimisation.

Finally, please note that everything in this Blog follows the key principle of Keep it Simple, exactly as Jack’s series did. In the interest of simplicity, we will use only single-character tokens and single-digit integers in the early stages of the design, without any embedded white spaces. As Jack wrote, if we can get our compiler to recognise and deal with I-T-L then we can easily get it to do the same with IF-THEN-ELSE.

The parsing of the source code is done in a top-down recursive approach which is a decent, simple and easy to implement methodology for a small project like this.

The error handling is effective (the compiler does not crash) but also simple and the compilation aborts at the first error that it encounters.

The Cradle

Every program requires some boiler plate… I/O routines, error handling, etc. As Jack did, I will keep it to a minimum in order to concentrate on the important stuff. The code below is the minimum that we need to get anything done. It consists of some I/O functions, a couple of error handling functions and a skeleton main function. This will be our Cradle. Please make a copy of these two files, save them, compile them and make sure you can run the main program.

Our compiler must be called with one command line argument, which is the text file containing the program to be compiled. So, please check the various cases:

  • run it without an argument
  • run it with a file name that does not exist as an argument
  • run it with an empty file as an argument
  • finally run it with a text file that contains something (just anything) as an argument

and see how it behaves in each case. This will test the error handling with regard to the input file.

CompilerMain.kt

This is our main program. Please remember to change the package name (and the corresponding directory structure) to something that suits you, or for simplicity, remove it altogether if you are having issues.

The inp variable is an object of the InputProgramScanner class which holds the program to be compiled and performs the lexical scanning (see below). Please note the lateinit modifier, as all Kotlin variables must be normally initialised at the time they are declared.

The functions error, abort and exit are self-explanatory.

The expected function prints a message with what was expected vs what was found in the input stream and aborts the compilation.

The function initCompiler first checks if there is at least one command line argument. It then creates a new object of the InputProgramScanner class and calls getNextChar to setup the lookahead character (see below).

package mpdev.compiler.m68kcompiler

import kotlin.system.exitProcess

/**
 * A Simple Compiler
 * Based on the Let's Build a Compiler! series by Jack Crenshaw
 * This version produces Assembly language code for the 68000 microprocessor
 * A future version will produce code for the TI-59 Calculator
 * It will produce output that uses the TI-59 mnemonics that can be keyed directly to the calculator
 * Version 1.0 01.10.2021
 */

// the input program scanner
lateinit var inp: InputProgramScanner

/** report an error */
fun error(errMsg: String) {
    System.err.println("Error: $errMsg")
}

/** abort compilation */
fun abort(errMsg: String) {
    error(errMsg)
    exitProcess(1)
}

/** print message and exit */
fun exit(msg: String) {
    System.err.println(msg)
    exitProcess(0)
}

/** report what was expected and abort */
fun expected(expMsg: String) = abort("Expected $expMsg \n found [${inp.nextChar}]")

/** output code */
fun outputCode(s: String) = print("\t$s")

/** output code with newline */
fun outputCodeNl(s: String) = outputCode("\t$s\n")

/** compiler initialisation */
fun initCompiler(args: Array<String>) {
    if (args.isEmpty())
        abort("no input file specified")
    // get a new input program scanner object - initialise from input file
    inp = InputProgramScanner(args[0])
    inp.getNextChar()
}

/** main function */
fun main(args: Array<String>) {
    initCompiler(args)
}

InputProgramScanner.kt

This class performs the lexical scanning functions. The key to keeping this simple is the use the lookahead character. This is easily achieved by reading the whole input program into a String variable (inputProgram) and using an index to point to the next character (nextChar) in the input. The function getNextChar advances the index and sets our lookahead character to the next character in the input.

The functions isAlpha and isNumeric are self-explanatory.

The function match checks if the the lookahead character matches a specific character that we expect and aborts the program if not.

The getName function returns the next alphabetic token, which at this stage is effectively the next single-character name, while getNumber returns the next integer, which at this stage is a single digit. Both functions have the necessary error checking.

package mpdev.compiler.m68kcompiler

import java.io.File

/**
 * The input program scanner class
 * Performs the lexical scanner functions
 * Processes the char-by-char input and returns the tokens from the input stream
 */
class InputProgramScanner(inputFile: String = "") {

    // the input program as string
    var inputProgram: String = "";
    var indx = 0;

    // the next character from input
    // this is our lookahead character
    var nextChar: Char = ' '

    /** initialisation code */
    init {
        try {
            val f = File(inputFile)
            // read the whole program into a string
            // add a dummy \n at the end
            // this way the lookahead will work properly when we reach the end of input
            inputProgram = f.readText() + "\n"    
        } catch (e: Exception) {
            abort("could not open file [$inputFile]")
        }
    }

    /** get next token from input */
    fun getNextChar() {
        if (indx < inputProgram.length)
            nextChar = inputProgram[indx++]
        else
            exit("end of input")
    }

    /** match a specific input char */
    fun match(x: Char) {
        // return a match when the next char matches x
        if (nextChar == x)
            getNextChar()
        else
            expected("'$x'")
    }

    /** get an identifier */
    fun getName(): String {
        var token: String = ""
        if (!isAlpha(nextChar))
            expected("Identifier")
        else {
            token = nextChar.uppercase()
            getNextChar()
        }
        return token
    }

    /**
     * get a number
     * <number ::= [ <digit> | decimal_dot ] +
     */
    fun getNumber(): String {
        var value: String = ""
        if (!isNumeric(nextChar)) {
            expected("Number")
        } else {
            value = nextChar.toString()
            getNextChar()
        }
        return value
    }

    /** check for an alpha char */
    fun isAlpha(c: Char): Boolean = c.uppercaseChar() in 'A'..'Z'

    /** check for a numeric char */
    fun isNumeric(c: Char): Boolean = c in '0'..'9'

}

Coming up next: Numerical Expression Parsing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s