Designing a trivial CPU

What’s this about?

I’m about 45 years into my computer-fondling journey, and I’ve been wondering how these wonderful things actually work. Like, down to the transistor level sort of thing.

In the past, when I’ve tried reading books on the subject (OK, maybe not the right books, but I run with what I can find in Hay-on-Wye), the book chapters go something like this:

  1. This is boolean logic.
  2. This is how you can glue logic together.
  3. Here’s some combinatoric logic: a half-adder, a multiplexer, and a demuxer.
  4. (If you’re lucky) Here’s a latch, a couple of types of flip-flop, and a register.
  5. This is a high-level description of some existing CPU on the market, and this is how to program it in assembly language.

You might spot that there’s a bit of a gap between chapters 4 and 5. How do you combine registers and the combinatoric logic into something that looks like a CPU? How do I make an instruction work? How do I make several instructions work? Can I actually build one of these things?

For the last couple of years, I’ve been playing with this stuff and trying to work it out for myself. I’ve designed, on paper, a few processors. Usually involving lots of nights trying to get to sleep, with a brain that’s intent on building circuits or rearranging bit patterns instead of resting. This is the first post in what I hope will be a series documenting a journey of designing a CPU from scratch.

The design should work at several levels of technology – from individual switches to integrated logic to FPGAs. But I’m going to be getting down into the weeds at some point, and talking about the physical hardware (I’m favouring electromechanical relays at this point, but CMOS transistors or 74-series logic might make an appearance later).

A word of caution, though: I’m just some random software-based dude with poor research motivation on the intertubes. There is absolutely no way that what I’m going to talk about is state of the art, or best practice, or even remotely related to sanity. It’s just what I’ve worked out for myself, while playing around.

What am I building?

I’m a’gonna make a CPU. Let’s say, [rolls D12] an 8-bit CPU. That seems perfectly normal and sensible. (An earlier iteration of this had a 6-bit CPU with 12-bit address registers. Let’s just not go there, OK?) It’s going to have 8 bits worth of RAM, so that’s 256 bytes. Small, but not impossibly so. You won’t be running Linux on it (or even CP/M), but it’ll do for now.

Let’s also add 256 bytes of instructions, separate from the RAM. That makes it a Harvard architecture: instructions are not the same things as data, and are held in two different stores, with disjoint address spaces.

There’s some philosophical choices involved in the Harvard architecture that might seem odd to modern eyes. Primarily, the architecture’s designed to write to the data store, with the instruction store being effectively a second-class citizen. It’s harder (or impossible) to modify the code while the code’s running. It’s like the code’s stored on a completely disjoint ROM, and you need special instructions to read that ROM (or modify it, if modifying ROM is even possible).

Registers and instruction set

Let’s start making some choices. I want some registers to work with, and some instructions to operate on them. I might also want some way of interacting with the store (RAM). I’m going to ignore interactions with the instruction store at this point – I don’t care about loading code, or self-modifying code. Let’s keep it as simple as possible.

Instructions

So: instructions. What should this architecture do? We probably want to be able to add two things. That’s about as basic as it gets. You can add – can you subtract? Yeah. throw that on the pile. Moving stuff around in registers? Sure, can’t be hard. We’re going to need to get data in and out of RAM. For Turing completeness, we’re probably going to need some kind of conditional (either conditional instructions, like ARM has, or conditional jumps).

What else could you think of to do? Increment and decrement registers, maybe? Bit shifts, left and right. Boolean and bitwise logic operations. I/O operations. Multiply, divide. Stack operations. Support for function calls. Add them all to the mix, and let’s see what we can fit in.

So, as a basic idea, let’s consider some instruction mnemonics to play with. We won’t be implementing all of these, for various reasons, but we can handle some of them:

Addressing modes

One of the things that shows up in a lot of early designs is addressing modes. This is where the CPU has, say, an addition instruction, but as well as being able to add two registers together, it’s able to take one (or both) operands from non-register sources. For example, you could take a constant value as an operand. Going a bit further, you could use the data in memory pointed at by an address in a register, or even combine the two and use the memory pointed at by a constant address. (Some CPUs go a lot further than this – the PDP-11 has pointer-to-pointer-to-data; the ARM supports accessing RAM via a pointer in one register and an offset in another register; and the 68000 goes even further than that).

Registers

How many registers do we want? It kind of depends on what we can afford to build, and on what we can fit into an instruction. On the latter point, let’s say we’re going to constrain ourselves to single-byte instructions, so we’ve got 8 bits to play with. Now, for example, RISC-V has 32 bits per instruction, so it’s got room to fit in three registers out of 32 (5 bits each) for each instruction, so the binary operations like ADD can have three operands:

ADD x1, x2, x3

That is, add x2 and x3 together, and put the result in x1. If we try to squeeze that into 8 bits, we could manage a choice of four registers for each, but we’d be limited to only two bits left for saying what we wanted to do. That’s only four instructions to play with. It’s a bit tight.

On the other end of the spectrum, the Z80 used in the ZX Spectrum (pun intended) tended to push all of its operations through one privileged register, A. Numerical operations take one register, plus A as parameters, and update A with the result. (e.g. ADD E sets A := A+E). This can itself be constraining, because everything has to be pushed through A.

Then, going back to the idea of addressing modes, if we’re going to allow data to be read from addresses in registers, is that from any register? Or can we privilege just some registers for that purpose? (The Z80, for example, usually allows the use of the 7 standard registers, or the data pointed at by HL in its addressing modes).

I’m going to pick some arbitrary values and see where it takes us:

Let’s call these x and y for the general registers, and i and j for the address registers. With four registers, 2 bits suffices to identify a register. So…

Operation styles

With 2 bits to identify a register, we can fit, say, two register IDs into a single instruction, with 4 bits left over to specify what the instruction is doing. That feels reasonably comfortable (although obviously we’re not going to squeeze all 27 of the instructions above into that space).

With two registers available in an instruction, we’re not limited to the Z80’s single-reg instructions, but we’re not as free as the ARM with its 32 bits and complex addressing modes. But we can do, for example:

r1 = r1 + r2

That is, one of the operands is also updated with the result of the operation. Let’s write this example in our assembly language as ADD r1, r2.

Back to addressing modes

We’ve decided arbitrarily on having four registers, of which two are “addressish”. So, if we’re writing ADD r1, r2, either of these could be a plain register. But we might also want to add some data in RAM to a register: ADD r1, (r2). That is, read the data pointed to by r2, and add it to r1. Let’s make the “addressish” registers capable of doing that as a data source. That gives us 6 possible instructions for each target register r1:

Powers of two are good, so we still have two more slots to fill. We could accept a constant value, ADD r1, #0. And we could accept a constant address to read from, ADD r1, (#0). For both of these, we need some way of specifying an 8-bit constant, but I’m going to ignore that for now, and come back to it in a later post.

Selecting the opcodes to fit

So, if we’re using 4 registers, and another 4 possible sources for the extended addressing modes ((i), (j), k and (k), where k is a constant), what can we do in terms of operations?

Moving data in registers

Let’s start with MOV. We can move data between any pair of normal registers. That’s 2 bits for each of source and destination, for 16 instructions total.

Basic arithmetic

Next up, ADD and SUB. We’d like to be a bit more capable with these. The destination can be just a register, but let’s extend the source (second operand) to include the further addressing modes, such as

ADD x, (i)

or

SUB i, #2

That’s two bits worth for the destination, three bits for the second operand, and one bit to distinguish ADD from SUB. I’m also going to take a punt, and say that being able to include (or exclude) a carry from a previous operation is useful here, too. That’s another bit’s worth of instruction. That makes for 7 bits in total, or fully half of the available opcodes. But I think it’s worth it.

We could also consider some speed/convenience operations to increment or decrement a register by 1: INC and DEC. These should only operate on a register, so there’s four opcodes for each of these.

Branching/jumping

What else do we need? We need flow control. That’s the JMP instruction. It’s going to need to know where to jump to. With an 8-bit code space, that’s one byte, so we’re going to be doing JMP k, where k is a one-byte constant (and not part of the main opcode). But we also want to do this conditionally, as well as unconditionally. Let’s grab [rolls D12 again] four options for that:

So jump takes only four opcodes, for the four conditions.

We could consider jumping to an address in a register, in which case there’d be a further 16 opcodes, for 4 registers × 4 conditions. The use case for those instructions is much less compelling, though.

Memory access

We can already do some kinds of memory access through the maths operations and the extended addressing modes, but if you just want to load a constant into a register, say, you’d have to do something like: SUB x, x; ADD x, #5 (to clear x, then set the value). Let’s introduce a couple of instructions just to move data in and out of RAM: LDR and STR. Each of these works with a plain register and one of the extended addressing modes. e.g.:

For LDR, there are four sources (the four addressing modes) and four destinations (the four registers), for a total of 16 opcodes. For STR, there are four sources (the four registers), and three destinations ((i), (j) and (k)), because it makes no sense to store a value in a constant, for 12 more opcodes. There are thus 28 memory access opcodes.

Shifts

Shifting bits around is, on the face of it, fairly simple, but has a lot of options. Left, right? Arithmetic, logical? Shift or rotate? Interaction with other shifts? How far are we shifting?

Following the principle of keeping it simple, I’m only going to consider shifting one bit at a time. It’s fairly easy to build a barrel shifter capable of shifting any number of bits, but it takes up a lot of extra opcodes to define the shift amount. Even with single-bit shifts, we can do most of the useful left shifts using other operations:

So I’m only going to consider right shifts here. I’m also going to constrain it to only shifting the four main registers – no extended addressing modes here, because that would mean reading and writing memory with one instruction. That leaves us with four plausible operations. For the bits in register r:

r h g f e d c b a
ASR r h h g f e d c b
LSR r 0 h g f e d c b
ROR r a h g f e d c b
RRC r C h g f e d c b

In the RRC (rotate right with carry) operation, the C in bit 7 is the carry bit from the processor status register. In all four cases, the a from bit 0 is moved to the carry bit.

With four operations and four registers, that’s 16 opcodes for shifts.

Booleans

There are three commonly-used binary boolean operations (AND, OR, XOR), and one unary operation (NOT). Given the same structure as ADD, that’s 16 (or 32, if supporting the extended addressing modes) opcodes for each binary op, and 4 opcodes for NOT, for a total of 52 (or 100) opcodes.

Input/output

Modern hardware tends to use memory-mapped I/O – a piece of hardware will have its internal registers or state injected into the same address space as the RAM, and we can talk to it by writing to this “RAM-like” space. However, since there’s only 256 bytes of RAM address space in this design, that could get wasteful of a scarce resource (RAM addresses). Instead, I’m going to introduce a set of I/O ports with their own distinct address space, and special instructions to talk to them. IN r, (k) and OUT (k), r read and write data on the I/O port numbered k, so there’s four opcodes for each (plus the constant k).

Multiply / Divide

While these operations are massively useful in some circumstances, they’re also massively complicated to implement, given that I’m trying to make a dead simple CPU. I could easily double (or quadruple) the transistor/relay budget just implementing a hardware multiply unit. I’m going to leave these out of the design from the outset. If you want to count opcodes, it gets large pretty quickly if you offer options: MUL has two inputs and two registers of output. DIV may have three inputs (a 16-bit value and an 8-bit value) and two outputs (quotient and remainder).

Stack ops

PUSH and POP (operating on registers), and the stack-based branch ops CALL and RET all rely on a special stack register. Registers are expensive – more expensive, each, than any other single components of the CPU. If we can avoid holding a separate stack pointer, that’s probably a good thing for minimising the CPU size and complexity. But if we’re going to implement these, then PUSH and POP are 4 opcodes each, and CALL and RET are another four each (assuming that each one has a conditional field).

Summary

The table below summarises how many opcodes we’re considering for each instruction (I’ve omitted MUL and DIV here).

Instruction Number of opcodes
ADD, SUB, ADC, SBC 128
MOV 16
INC, DEC 8
JMP 4
LDR 16
STR 12
ASR, LSR, ROR, RRC 16
AND, ORR, XOR 96
NOT 4
IN, OUT 8
PUSH, POP 8
CALL, RET 8

That’s 324 opcodes, if we pick everything. But there’s only 256 possible opcodes with 1-byte instructions, so we’re going to have to lose something on the way. I’m going to drop the bitwise boolean operations, because they take up a lot of opcodes. I’m also going to drop the stack operations, because they need an extra register for storing the stack pointer, and registers are expensive to build. That leaves us with 208 opcodes, which will fit into a single byte instruction.

Conclusion

We’re building an 8-bit CPU with a (mostly) 8-bit instruction encoding. It’s not totally minimal – it’s not a BrainFuck CPU – but it’s going to be a very limited set of operations.

In the next post, I’m going to talk about the instruction encoding. How do all these instructions fit into the 8 bit opcodes? What decisions could we take about doing it, and why?