Computer Science CPU Project programming

Writing the Compiler for the UniCPU Project

Our compiler is open-source, you can get the code here on GitHub.

Introduction – project background and groundwork

It’s been a little while since I’ve blogged about the UniCPU team project – which could be perceived as meaning that we’ve dropped the project.  Don’t worry, we haven’t!

Since my last post, we’ve (we = Andrei) continued to make progress on the hardware. Most of the recent work is just wiring up the various units to the multiplexer and wiring up the ROM, all of which is very simple and time consuming. Certainly not worthy of a video.

Whilst Andrei was making progress on the hardware, I got started on writing the compiler (don’t worry, if you don’t know what an compiler is, it’s explained below) for the instruction set we’re supporting. This came at a really timely moment, since I’d just finished studying a module on compilers at University (SYAC, at York), so I had the chance to put the theory into practise – but in a simplified form (since, in my view at-least, there’s no point wasting 10’s of hours writing a super flexible compiler when a simple one will do more than we need).

Before we get into a discussion on the actual compiler, there’s an important question: what is a compiler? Put simply, a compiler is a program that transforms code written in one language to code in another language. It’s typically the thing that transforms code in a high-level language (for example, Java or C) to machine code (a language that can be understood by the CPU in your computer).

STAGE 0 – Background: our language

The UniCPU language is one the simplest languages – it’s simply an instruction set for the CPU with only one feature – code labels (for easy jumping between sections). Code in the UniCPU language takes the form of an instruction/opcode, and then if an operand (parameter) is necessary for the instruction, a space followed by an operand:


For example, if I wanted to write a program that added the number 5 to a value input on the input board (think of it as the keyboard), I’d write:

INPUT // Load value from the input panel into the accumulator
ADD 5 // Add 5 to the value in the accumulator.

( // Represents comments.  Anything following “//” is ignored by the compiler until the line end).

Truncated Instruction Set Dictionary
Truncated Instruction Set Dictionary

If you remember from our earlier posts, everything within our CPU is represented as single bytes (i.e. an instruction is 8 bits, all operands/values are 8 bits). This means that our instruction memory is always expecting to receive two bytes per instruction – one for opcode and one for the operand – which will be nulled to 00000000 if the opcode doesn’t require a parameter (for example, INPUT). This means we need to have a binary representation for each opcode – which we have. A truncated representation of this is displayed on the right, but you can view the full logic behind this in a spreadsheet located here.

This means that the desired representation of the above program is:

00000000 00000000
00000100 00000101

So, first we put the opcode for INPUT in, then the null operand (since it doesn’t need a parameter). Then we put the opcode in for ADD, followed by the binary representation of the parameter (5 = 0b101).  Giving us the above block of binary.

Another feature of most programming language is the ability to jump from one section of code to another without having to use line numbers. This is because using line numbers is incredibly difficult to maintain – say your application jumps back to line 17 regularly, and then you add an extra instruction before line 17, you have to change every reference to line 17 to line 18 – a massive waste of programmer time). In the UniCPU language, this is implemented using labels. You insert a label in your code, and then you can jump to wherever that label is defined. In the below program, the path of execution is lines 1,2,3,4, followed by getting stuck in a loop through lines 2,3,4 for ever, constantly adding 8 to the accumulator.

For example:

load 0
add 8
add 8

The important question is: how do you write an application capable of doing this – turning one code in one language into code written in another language, and support things like jumps? First you need to select a language to write your compiler language in – independent of the language you are compiling from/to. Because a large part of compilation is string manipulation (i.e. looking at input text, splitting it into tokens, looking strings up in dictionaries etc) we need a language that makes string manipulation simple. Our compiler is quite simple (it’s simple as compilers go anyway) so I selected Python. It’s not the most efficient language, nor is it the language with the most explicit syntax – but it does make splitting Strings easy.

Whilst I’m explaining these steps, it might be helpful to view the code.  It’s available here on GitHub. The following description is more an over-arching theory of what happens in the compiler: do bear in mind that I haven’t tried to explain some parts that I’ve deemed ‘out of scope’ (for example converting base-10 to binary, some error handling). See the code for that ūüôā

Stage 1 – Preparing and CLEANING input

Of course, the first step is to do your initial setup for your compiler.  In the UniCPU compiler, this involves initialising the compilerDictionary with the binary for each instruction, some helper lists for categories of Opcodes and the dictionary which will hold line numbers of code labels. We then call setupAndStart(), where we load in the input code and split it into a list of Strings, where each String is a line of code.

The next preparation step is to remove blank lines, or lines that only contain comments. This is done with a call to removeBlank(lines), as below:

def removeComment(instruction):
	''' str -> str '''
	return instruction.split("//")[0]

def removeBlank(lines):
	numberRemoved = 0
	for x in range(0, len(lines)):
		if (removeComment(lines[x-numberRemoved]) == ''):
			numberRemoved += 1

Here, we check if a line is empty after removing any comments (which is checked by checking if, after splitting the string by the comment symbol “//”, there is any text in the first block returned from the split call). If there is, there is code on that line and it’s not blank. Otherwise, it is blank or only had comments on it so we delete the line from lines. It’s important to remove blank lines before working on labels/jumping/looping as this is dependent on line numbers within the code, and by removing lines we change the line number of some lines.

STAGE 2 – compiling individual lines

During compilation, we keep a list of 2-item lists – each 2-item list represents a line of code with two operands. This is because, in this compiler, there are two main stages to compilation, both of which need access to individual tokens (a token is just a representation of 1 byte, whether an instruction opcode or an operand/value) – these are compiling (converting the human-friendly opcodes to binary and converting the base-10 numbers to binary) and linking (replacing labels in JUMP commands (used for looping and iterating) with the final line numbers). At the end of the process, that dictionary is all concatenated into one long binary string – the program.

The compilation stage is really quite simple – in Stage 0 above, I talked about setting up a dictionary mapping instructions onto the binary we defined for them in the hardware. In the compileLine(inputLine, lineNumber) function, we first check the line for whether it contains one token or two. Ignoring LABELs for a moment, we then look at each token separately.

The first token is passed to the compileInstruction(instruction, lineNumber) function where we simply look up the token from the compilerDictionary defined at the top of the programme to obtain the binary representation of that opcode.

def compileInstruction(instruction, lineNumber):
	return compilerDictionary[instruction]

The second token (if there is one) will be the operand for the opcode that we just compiled. We want to convert the operand (a number) to binary, so we pass it to the function compileValue(value, lineNumber) which returns a String of binary digits representing the base-10 number that was input. In the event that there was no operand (there was only one token, a ‘standalone opcode’, such as ‘INPUT’, ‘SHIFTL’) we set the second token to just be 00000000 (a null byte because the value will not be used).

The compileLine(inputLine, lineNumber) function then returns both the compiled Opcode and Operand in a 2-item list, as described above, to be added to the list of compiledLines:

return [compileInstruction(tokens[0], lineNumber), compileValue(int(tokens[1]), lineNumber)]

Stage 3 – compiling jumps and labels

Compiling JUMPs and LABELs was the most complex part of this compiler. On a high level, it is done in two stages:

  1. Locating and listing all of the LABELs into a dictionary called labels
  2. In a separate pass, finding all of the compiled JUMP instructions and then compiling the operands (the LABELs) using the labels dictionary.

Locating and listing all of the LABELs is done during the initial compileLine step in Stage 2. When a LABEL Opcode is found, a record is added to the ‘labels’ dictionary, using the LABEL name as the key. The value for that key is the line number that will be executed when a jump to that label is called, minus the length of the ‘labels’ dictionary. This subtract is made because lines containing a LABEL will be removed (by not being added to the list of compiled lines ‘compiled’), because they don’t actually cause the CPU to execute anything (they’re just a short-cut for the programmer that the compiler than expands).

After Stage 2 is complete and all lines have been compiled, we then execute point 2 in the above: finding the JUMP instructions and compiling the operands using the labels dictionary we just built up. In the code, this is described as ‘Linking’ (this is in the setupAndStart() function, after the compilation stage):

for lineNumber in range(0, len(compiled)):
	if (compiled[lineNumber][0] in jumpOpcodesCompiled):
			compiled[lineNumber][1] = compileValue(labels[compiled[lineNumber][1]], None)
		except KeyError:
			print "ERROR: LABEL: ", compiled[lineNumber][1], " is never defined."

In the above code example, we are checking each line of code in the list of ‘compiled’ lines to see if it contains a JUMP instruction. If it does, we identify the second item in the 2-item list (the operand) and look it up in the ‘labels’ dictionary (the dictionary containing all the LABELs and the corresponding line number they should point to). We then convert convert the base-10 integer to binary and return the compiled line.

Stage 4 – Outputting finished product

Finally – the ‘compiled’ list now contains the whole programme, compiled to binary. We just have to convert each 8-character string of binary characters to a byte, concatenate them together and then write them out as a binary file and it’s ready to go onto the CPU!

The easiest way to do this in Python 2.7 is using the chr(int) function. The chr(int) function takes an integer as a parameter and returns a character according to the ASCII scheme. Since the ASCII system stores characters in 8 bits (the same as our CPU), outputting the characters into a binary ASCII file is all we have to do – that file can then be written to the memory of our CPU, and voila! The CPU will execute our code.

def outputToBinary(compiledCode):
	outputData = ''
	while (len(compiledCode) > 0):
		temp = int(compiledCode[0:8], 2)
		outputData += chr(temp)
		compiledCode = compiledCode[8:]
	f = open(outputFileName, 'wb') #b means binary mode on Windows
	return True

In the above code, we first initialise an empty string which will contain our output data (the compiled code). While there are still bytes to be compiled (that the length of compiledCode is still greater than 0), we isolate the next byte, convert it to an Integer, and then convert it to a binary ASCII character using the chr function. We then append that to the outputData, and remove the byte (8-bits) we just processed from compiledCode (so that the next byte can be processed). When all the bytes have been compiled, we open a file in write mode and write the data to that file.

Our compiler also has the ability to output ‘human-readable’ code, i.e. code that is separated by line and includes a space in-between the opcodes and operands. This is controlled by toggling a boolean, and makes debugging much easier.


Let’s take the following sample program:

input // load the accumulator from the input panel

load 0 // overwrite the value in the accumulator to 0
LABEL AddEight // label this piece of code 'AddEight'
add 8

JUMP AddEight

// this block never runs
sub 7
subcy -7

Next we perform Stage 1: after comments, blank lines and labels are removed (with labels, after their line numbers have been recorded), the code looks like:

load 0
LABEL AddEight
add 8
JUMP AddEight
sub 7
subcy -7

After completing Stage 2: (compiling all operands and opcodes except for JUMP LABELs), the code looks like:

00000001 00000000
00000000 00000000
00100100 00001000
00010000 AddEight
00100101 00000111
00100111 11111001

Now we just fill in the LABELs in the compiled code (Stage 3)

Finally we just have to output the finished Binary. In debug mode, it’s easy to see which byte corresponds to which token in the original code:

00000001 00000000
00000000 00000000
00100100 00001000
00010000 00000010
00100101 00000111
00100111 11111001

less so when it’s all concatenated together:


When compiled to binary ASCII it’s then not human readable or recognisable: you can try opening it in an editor (as I did with the above), which gives:

Compiled Program, opened in Vim
Compiled Program, opened in Vim

That’s it! A long post, but hopefully it was an interesting read!

Apps Computer Science Entrepreneurship

TalkTalk Digital Heroes – I won!

I’m delighted to say that I’ve won the TalkTalk Digital Heroes Next Generation award! After having¬†been nominated early in the year and a month long period of public voting, I was invited to the¬†House of Lords yesterday¬†for the awards ceremony. The ceremony was wonderful, and it was fantastic to meet the winners from other categories in such a special location – the¬†view of The Thames was very special!

Picking up the award at the House of Lords. With: Peter Willis, myself, Rachel Neaman, Jessica Lennard (left-right)
Picking up the award at the House of Lords. With: Peter Willis, myself, Rachel Neaman, Jessica Lennard (left-right)

Below is a video of me that TalkTalk produced for all the category winners.

Computer Science

A Quick Trip to The Sudan

A couple of months ago I saw a post for Kairos Fellows about an all-expenses-paid trip to the Sudan to mentor on a Sudanese TV programme that’s a cross between The Apprentice and Dragons Den.

Free trip, I heard? Count me in!¬†I wasn’t entirely sure it was genuine, that it wasn’t an elaborate plan to lure some British idiots out to Sudan, but a quick phone call to the British Council confirmed it was indeed sponsored by them, so I decided to drop the organiser an email who, after some back and forth, offered me place.

Getting the Visa was incredibly straightforward: a couple of weeks before leaving, we were told to go to the Sudanese Embassy and tell them we were part of The Mashrouy Programme, which certainly speeded things along. We met the ambassador, and 24 hours later our visas were ready!

Six weeks later, having just finished the TechStars demo day that morning, I jumped on an almost entirely empty flight to Khartoum, changing in Jordan. Because of the demo day, I was arriving a couple of days after the start.

My¬†first moments in Sudan were really¬†eye-opening. I’d been told to meet the British Council’s driver, Mr. Khalifa, who would take me to the hotel when I arrived at 2am. The journey, although only a few minutes long, was incredible as Sudanese roads are very different to those in the UK: there’s large rocks on the motorways, and it’s not¬†considered abnormal to do a u-turn¬†on them.¬†I even learnt some Arabic!

The next morning (Wednesday) I met the rest of the team who had come out on the project a couple of days before: Peter McFarlane, Sophie Dundovic and Anthony Catt, who filled me in on the different sessions and the presentations and talks we would be giving. The first event was with the Mashrouy top 100, who were presenting their projects (Mashrouy means ‘My Project’) to the mentors (local mentors and us) at large round tables in the Hotel.

It was there that I had my first experience of working through a translator Рan experience I will never forget! Obviously communicating is much slower and more effort, but as a result you think through your words and the points you wish to make much more carefully and I found it led to a more directed and concentrated conversation.

The Mashrouy 100
The Mashrouy 100

In the following days we spoke to classes from the 100 about our story, our companies and lessons we have learnt from our experience. On the Thursday I spoke to two classes about how I got involved in Entrepreneurship, my startup and what business was like in Europe at a local business school. After each session I was swamped with questions, and then when I left entrepreneurs wanting to tell me about their companies Рit was really surreal!

On the Friday we had the Mashrouy Conference and Panel, a public event for a few hundred local business leaders, the Mashrouy 100,¬†and other young people to come and hear us lecture on a topic of our choice and then ask us questions in a panel discussion. This was another¬†translator affair, but a wonderful experience: whilst what you are saying is being¬†translated you have some time to think about the next couple of minutes of your talk until the next translation. There’s also an amusement aspect as well: there were times when a block would¬†change in length¬†(3 minutes to 30 seconds or 2 minutes to 5 minutes), or the audience laughed¬†when I was¬†sure¬†I¬†didn’t say anything funny! One topic of particular interest was what business was like in the UK and Europe vs Sudan/Africa, which turned into a number of really interesting conversations.

On the Saturday we took a day off to go and see some of Sudan, including the Pyramids of Mero√ę, the old Royal City and a rural village on the Nile. The pyramids were incredible: untouched and unspoilt (there was absolutely no tourist industry), and we were able to explore them on camels. The village on the Nile was very thought-provoking as well: the poverty was heartbreaking – it doesn’t come any worse. But the people were so happy and welcoming: we were welcomed into the village and the central buildings, everybody wanted to be in our photos, and they even took us out on a local fishing boat on some Nile rapids.

On Sunday the British Council held Khartoum’s First Entrepreneurship and Responsible Business Conference. Hundreds of business leaders from across Sudan came to the event to hear lectures varying from insights and deep-dives into existing large responsible companies, different ways in which business can be conducted¬†responsibly, the idea of Social Impact business and more. I spoke about the Intersection of Social Impact Business and Technology.

Me Speaking at the Khartoum Responsible Business Conference
Me Speaking at the Khartoum Responsible Business Conference

There were a couple more events that I’ve skimmed over but I’ve¬†tried to cover a range of different events vs covering all of the events. All in all, my time in Sudan was fantastic, enlightening, and even a little bit life changing. I met new people as different to myself as possible, was able to both share my experience and help others trying to start or run a business, and had the most amazing time visiting a wonderful and totally different country!

Computer Science Entrepreneurship

It’s been a long time…

From the week after¬†Yacht Hack in September last year, I’ve been pretty full-on busy for the last academic year, focussing on achieving a First (Summa Cum Laude for any Americans reading) (successfully achieved!) and co-founding a new startup¬†called pingWHEN¬†(made all the more difficult by a 4500 mile gap between my co-founder and myself).

Some of my earlier blogs refer to my undertaking a 4-year Masters degree course. This changed during Yacht Hack. I’ve been interested in Entrepreneurship for a long time, running many small businesses from aged about 13. During my first two years at University I forgot all about this passion and was just focussed on software engineering, academia, internships, and hiking mountains. Yacht Hack reminded me that I should be working in the entrepreneurial space, and after a conversation with my now co-founder and business partner, Julie Markham, I decided to drop my masters year, thus leaving myself with just one year of study (the year I have just completed) and a Bachelors degree.

Now that I’m done I’ll be focussing 100% of my energy into growing pingWHEN into a viable business. More posts to follow!