Computer Science CPU Project programming

Writing the Compiler for the UniCPU Project

Our compiler is open-source, you can get the code here on GitHub.

Introduction – project background and groundwork

It’s been a little while since I’ve blogged about the UniCPU team project – which could be perceived as meaning that we’ve dropped the project.  Don’t worry, we haven’t!

Since my last post, we’ve (we = Andrei) continued to make progress on the hardware. Most of the recent work is just wiring up the various units to the multiplexer and wiring up the ROM, all of which is very simple and time consuming. Certainly not worthy of a video.

Whilst Andrei was making progress on the hardware, I got started on writing the compiler (don’t worry, if you don’t know what an compiler is, it’s explained below) for the instruction set we’re supporting. This came at a really timely moment, since I’d just finished studying a module on compilers at University (SYAC, at York), so I had the chance to put the theory into practise – but in a simplified form (since, in my view at-least, there’s no point wasting 10’s of hours writing a super flexible compiler when a simple one will do more than we need).

Before we get into a discussion on the actual compiler, there’s an important question: what is a compiler? Put simply, a compiler is a program that transforms code written in one language to code in another language. It’s typically the thing that transforms code in a high-level language (for example, Java or C) to machine code (a language that can be understood by the CPU in your computer).

STAGE 0 – Background: our language

The UniCPU language is one the simplest languages – it’s simply an instruction set for the CPU with only one feature – code labels (for easy jumping between sections). Code in the UniCPU language takes the form of an instruction/opcode, and then if an operand (parameter) is necessary for the instruction, a space followed by an operand:


For example, if I wanted to write a program that added the number 5 to a value input on the input board (think of it as the keyboard), I’d write:

INPUT // Load value from the input panel into the accumulator
ADD 5 // Add 5 to the value in the accumulator.

( // Represents comments.  Anything following “//” is ignored by the compiler until the line end).

Truncated Instruction Set Dictionary
Truncated Instruction Set Dictionary

If you remember from our earlier posts, everything within our CPU is represented as single bytes (i.e. an instruction is 8 bits, all operands/values are 8 bits). This means that our instruction memory is always expecting to receive two bytes per instruction – one for opcode and one for the operand – which will be nulled to 00000000 if the opcode doesn’t require a parameter (for example, INPUT). This means we need to have a binary representation for each opcode – which we have. A truncated representation of this is displayed on the right, but you can view the full logic behind this in a spreadsheet located here.

This means that the desired representation of the above program is:

00000000 00000000
00000100 00000101

So, first we put the opcode for INPUT in, then the null operand (since it doesn’t need a parameter). Then we put the opcode in for ADD, followed by the binary representation of the parameter (5 = 0b101).  Giving us the above block of binary.

Another feature of most programming language is the ability to jump from one section of code to another without having to use line numbers. This is because using line numbers is incredibly difficult to maintain – say your application jumps back to line 17 regularly, and then you add an extra instruction before line 17, you have to change every reference to line 17 to line 18 – a massive waste of programmer time). In the UniCPU language, this is implemented using labels. You insert a label in your code, and then you can jump to wherever that label is defined. In the below program, the path of execution is lines 1,2,3,4, followed by getting stuck in a loop through lines 2,3,4 for ever, constantly adding 8 to the accumulator.

For example:

load 0
add 8
add 8

The important question is: how do you write an application capable of doing this – turning one code in one language into code written in another language, and support things like jumps? First you need to select a language to write your compiler language in – independent of the language you are compiling from/to. Because a large part of compilation is string manipulation (i.e. looking at input text, splitting it into tokens, looking strings up in dictionaries etc) we need a language that makes string manipulation simple. Our compiler is quite simple (it’s simple as compilers go anyway) so I selected Python. It’s not the most efficient language, nor is it the language with the most explicit syntax – but it does make splitting Strings easy.

Whilst I’m explaining these steps, it might be helpful to view the code.  It’s available here on GitHub. The following description is more an over-arching theory of what happens in the compiler: do bear in mind that I haven’t tried to explain some parts that I’ve deemed ‘out of scope’ (for example converting base-10 to binary, some error handling). See the code for that 🙂

Stage 1 – Preparing and CLEANING input

Of course, the first step is to do your initial setup for your compiler.  In the UniCPU compiler, this involves initialising the compilerDictionary with the binary for each instruction, some helper lists for categories of Opcodes and the dictionary which will hold line numbers of code labels. We then call setupAndStart(), where we load in the input code and split it into a list of Strings, where each String is a line of code.

The next preparation step is to remove blank lines, or lines that only contain comments. This is done with a call to removeBlank(lines), as below:

def removeComment(instruction):
	''' str -> str '''
	return instruction.split("//")[0]

def removeBlank(lines):
	numberRemoved = 0
	for x in range(0, len(lines)):
		if (removeComment(lines[x-numberRemoved]) == ''):
			numberRemoved += 1

Here, we check if a line is empty after removing any comments (which is checked by checking if, after splitting the string by the comment symbol “//”, there is any text in the first block returned from the split call). If there is, there is code on that line and it’s not blank. Otherwise, it is blank or only had comments on it so we delete the line from lines. It’s important to remove blank lines before working on labels/jumping/looping as this is dependent on line numbers within the code, and by removing lines we change the line number of some lines.

STAGE 2 – compiling individual lines

During compilation, we keep a list of 2-item lists – each 2-item list represents a line of code with two operands. This is because, in this compiler, there are two main stages to compilation, both of which need access to individual tokens (a token is just a representation of 1 byte, whether an instruction opcode or an operand/value) – these are compiling (converting the human-friendly opcodes to binary and converting the base-10 numbers to binary) and linking (replacing labels in JUMP commands (used for looping and iterating) with the final line numbers). At the end of the process, that dictionary is all concatenated into one long binary string – the program.

The compilation stage is really quite simple – in Stage 0 above, I talked about setting up a dictionary mapping instructions onto the binary we defined for them in the hardware. In the compileLine(inputLine, lineNumber) function, we first check the line for whether it contains one token or two. Ignoring LABELs for a moment, we then look at each token separately.

The first token is passed to the compileInstruction(instruction, lineNumber) function where we simply look up the token from the compilerDictionary defined at the top of the programme to obtain the binary representation of that opcode.

def compileInstruction(instruction, lineNumber):
	return compilerDictionary[instruction]

The second token (if there is one) will be the operand for the opcode that we just compiled. We want to convert the operand (a number) to binary, so we pass it to the function compileValue(value, lineNumber) which returns a String of binary digits representing the base-10 number that was input. In the event that there was no operand (there was only one token, a ‘standalone opcode’, such as ‘INPUT’, ‘SHIFTL’) we set the second token to just be 00000000 (a null byte because the value will not be used).

The compileLine(inputLine, lineNumber) function then returns both the compiled Opcode and Operand in a 2-item list, as described above, to be added to the list of compiledLines:

return [compileInstruction(tokens[0], lineNumber), compileValue(int(tokens[1]), lineNumber)]

Stage 3 – compiling jumps and labels

Compiling JUMPs and LABELs was the most complex part of this compiler. On a high level, it is done in two stages:

  1. Locating and listing all of the LABELs into a dictionary called labels
  2. In a separate pass, finding all of the compiled JUMP instructions and then compiling the operands (the LABELs) using the labels dictionary.

Locating and listing all of the LABELs is done during the initial compileLine step in Stage 2. When a LABEL Opcode is found, a record is added to the ‘labels’ dictionary, using the LABEL name as the key. The value for that key is the line number that will be executed when a jump to that label is called, minus the length of the ‘labels’ dictionary. This subtract is made because lines containing a LABEL will be removed (by not being added to the list of compiled lines ‘compiled’), because they don’t actually cause the CPU to execute anything (they’re just a short-cut for the programmer that the compiler than expands).

After Stage 2 is complete and all lines have been compiled, we then execute point 2 in the above: finding the JUMP instructions and compiling the operands using the labels dictionary we just built up. In the code, this is described as ‘Linking’ (this is in the setupAndStart() function, after the compilation stage):

for lineNumber in range(0, len(compiled)):
	if (compiled[lineNumber][0] in jumpOpcodesCompiled):
			compiled[lineNumber][1] = compileValue(labels[compiled[lineNumber][1]], None)
		except KeyError:
			print "ERROR: LABEL: ", compiled[lineNumber][1], " is never defined."

In the above code example, we are checking each line of code in the list of ‘compiled’ lines to see if it contains a JUMP instruction. If it does, we identify the second item in the 2-item list (the operand) and look it up in the ‘labels’ dictionary (the dictionary containing all the LABELs and the corresponding line number they should point to). We then convert convert the base-10 integer to binary and return the compiled line.

Stage 4 – Outputting finished product

Finally – the ‘compiled’ list now contains the whole programme, compiled to binary. We just have to convert each 8-character string of binary characters to a byte, concatenate them together and then write them out as a binary file and it’s ready to go onto the CPU!

The easiest way to do this in Python 2.7 is using the chr(int) function. The chr(int) function takes an integer as a parameter and returns a character according to the ASCII scheme. Since the ASCII system stores characters in 8 bits (the same as our CPU), outputting the characters into a binary ASCII file is all we have to do – that file can then be written to the memory of our CPU, and voila! The CPU will execute our code.

def outputToBinary(compiledCode):
	outputData = ''
	while (len(compiledCode) > 0):
		temp = int(compiledCode[0:8], 2)
		outputData += chr(temp)
		compiledCode = compiledCode[8:]
	f = open(outputFileName, 'wb') #b means binary mode on Windows
	return True

In the above code, we first initialise an empty string which will contain our output data (the compiled code). While there are still bytes to be compiled (that the length of compiledCode is still greater than 0), we isolate the next byte, convert it to an Integer, and then convert it to a binary ASCII character using the chr function. We then append that to the outputData, and remove the byte (8-bits) we just processed from compiledCode (so that the next byte can be processed). When all the bytes have been compiled, we open a file in write mode and write the data to that file.

Our compiler also has the ability to output ‘human-readable’ code, i.e. code that is separated by line and includes a space in-between the opcodes and operands. This is controlled by toggling a boolean, and makes debugging much easier.


Let’s take the following sample program:

input // load the accumulator from the input panel

load 0 // overwrite the value in the accumulator to 0
LABEL AddEight // label this piece of code 'AddEight'
add 8

JUMP AddEight

// this block never runs
sub 7
subcy -7

Next we perform Stage 1: after comments, blank lines and labels are removed (with labels, after their line numbers have been recorded), the code looks like:

load 0
LABEL AddEight
add 8
JUMP AddEight
sub 7
subcy -7

After completing Stage 2: (compiling all operands and opcodes except for JUMP LABELs), the code looks like:

00000001 00000000
00000000 00000000
00100100 00001000
00010000 AddEight
00100101 00000111
00100111 11111001

Now we just fill in the LABELs in the compiled code (Stage 3)

Finally we just have to output the finished Binary. In debug mode, it’s easy to see which byte corresponds to which token in the original code:

00000001 00000000
00000000 00000000
00100100 00001000
00010000 00000010
00100101 00000111
00100111 11111001

less so when it’s all concatenated together:


When compiled to binary ASCII it’s then not human readable or recognisable: you can try opening it in an editor (as I did with the above), which gives:

Compiled Program, opened in Vim
Compiled Program, opened in Vim

That’s it! A long post, but hopefully it was an interesting read!