Designing a trivial CPU
What’s this about?
I’m about 45 years into my computer-fondling journey, and I’ve been wondering how these wonderful things actually work. Like, down to the transistor level sort of thing.
In the past, when I’ve tried reading books on the subject (OK, maybe not the right books, but I run with what I can find in Hay-on-Wye), the book chapters go something like this:
- This is boolean logic.
- This is how you can glue logic together.
- Here’s some combinatoric logic: a half-adder, a multiplexer, and a demuxer.
- (If you’re lucky) Here’s a latch, a couple of types of flip-flop, and a register.
- This is a high-level description of some existing CPU on the market, and this is how to program it in assembly language.
You might spot that there’s a bit of a gap between chapters 4 and 5. How do you combine registers and the combinatoric logic into something that looks like a CPU? How do I make an instruction work? How do I make several instructions work? Can I actually build one of these things?
For the last couple of years, I’ve been playing with this stuff and trying to work it out for myself. I’ve designed, on paper, a few processors. Usually involving lots of nights trying to get to sleep, with a brain that’s intent on building circuits or rearranging bit patterns instead of resting. This is the first post in what I hope will be a series documenting a journey of designing a CPU from scratch.
The design should work at several levels of technology – from individual switches to integrated logic to FPGAs. But I’m going to be getting down into the weeds at some point, and talking about the physical hardware (I’m favouring electromechanical relays at this point, but CMOS transistors or 74-series logic might make an appearance later).
A word of caution, though: I’m just some random software-based dude with poor research motivation on the intertubes. There is absolutely no way that what I’m going to talk about is state of the art, or best practice, or even remotely related to sanity. It’s just what I’ve worked out for myself, while playing around.
What am I building?
I’m a’gonna make a CPU. Let’s say, [rolls D12] an 8-bit CPU. That seems perfectly normal and sensible. (An earlier iteration of this had a 6-bit CPU with 12-bit address registers. Let’s just not go there, OK?) It’s going to have 8 bits worth of RAM, so that’s 256 bytes. Small, but not impossibly so. You won’t be running Linux on it (or even CP/M), but it’ll do for now.
Let’s also add 256 bytes of instructions, separate from the RAM. That makes it a Harvard architecture: instructions are not the same things as data, and are held in two different stores, with disjoint address spaces.
There’s some philosophical choices involved in the Harvard architecture that might seem odd to modern eyes. Primarily, the architecture’s designed to write to the data store, with the instruction store being effectively a second-class citizen. It’s harder (or impossible) to modify the code while the code’s running. It’s like the code’s stored on a completely disjoint ROM, and you need special instructions to read that ROM (or modify it, if modifying ROM is even possible).
Registers and instruction set
Let’s start making some choices. I want some registers to work with, and some instructions to operate on them. I might also want some way of interacting with the store (RAM). I’m going to ignore interactions with the instruction store at this point – I don’t care about loading code, or self-modifying code. Let’s keep it as simple as possible.
Instructions
So: instructions. What should this architecture do? We probably want to be able to add two things. That’s about as basic as it gets. You can add – can you subtract? Yeah. throw that on the pile. Moving stuff around in registers? Sure, can’t be hard. We’re going to need to get data in and out of RAM. For Turing completeness, we’re probably going to need some kind of conditional (either conditional instructions, like ARM has, or conditional jumps).
What else could you think of to do? Increment and decrement registers, maybe? Bit shifts, left and right. Boolean and bitwise logic operations. I/O operations. Multiply, divide. Stack operations. Support for function calls. Add them all to the mix, and let’s see what we can fit in.
So, as a basic idea, let’s consider some instruction mnemonics to play with. We won’t be implementing all of these, for various reasons, but we can handle some of them:
ADD
,SUB
MOV
(between registers)JMP
(control flow – simple jumps and branches)INC
,DEC
LDR
,STR
(register ⟷ RAM)ASR
,LSR
,ROR
,RRC
(right shifts)LSL
,ROL
,RLC
(left shifts)AND
,ORR
,XOR
,NOT
(bitwise booleans)IN
,OUT
(I/O operations)MUL
,DIV
PUSH
,POP
(data stack ops)CALL
,RET
(function calls)
Addressing modes
One of the things that shows up in a lot of early designs is addressing modes. This is where the CPU has, say, an addition instruction, but as well as being able to add two registers together, it’s able to take one (or both) operands from non-register sources. For example, you could take a constant value as an operand. Going a bit further, you could use the data in memory pointed at by an address in a register, or even combine the two and use the memory pointed at by a constant address. (Some CPUs go a lot further than this – the PDP-11 has pointer-to-pointer-to-data; the ARM supports accessing RAM via a pointer in one register and an offset in another register; and the 68000 goes even further than that).
Registers
How many registers do we want? It kind of depends on what we can
afford to build, and on what we can fit into an instruction. On the
latter point, let’s say we’re going to constrain ourselves to
single-byte instructions, so we’ve got 8 bits to play with. Now, for
example, RISC-V has 32 bits per instruction, so it’s got room to
fit in three registers out of 32 (5 bits each) for each instruction,
so the binary operations like ADD
can have three operands:
ADD x1, x2, x3
That is, add x2
and x3
together, and put the result in x1
. If we
try to squeeze that into 8 bits, we could manage a choice of four
registers for each, but we’d be limited to only two bits left for
saying what we wanted to do. That’s only four instructions to play
with. It’s a bit tight.
On the other end of the spectrum, the Z80 used in the ZX Spectrum (pun
intended) tended to push all of its operations through one privileged
register, A
. Numerical operations take one register, plus A
as
parameters, and update A
with the result. (e.g. ADD E
sets A := A+E
).
This can itself be constraining, because everything has to be
pushed through A
.
Then, going back to the idea of addressing modes, if we’re going to
allow data to be read from addresses in registers, is that from any
register? Or can we privilege just some registers for that purpose?
(The Z80, for example, usually allows the use of the 7 standard
registers, or the data pointed at by HL
in its addressing modes).
I’m going to pick some arbitrary values and see where it takes us:
- Four 8-bit general purpose registers
- Of which two registers can be used for memory access
Let’s call these x
and y
for the general registers, and i
and
j
for the address registers. With four registers, 2 bits suffices to
identify a register. So…
Operation styles
With 2 bits to identify a register, we can fit, say, two register IDs into a single instruction, with 4 bits left over to specify what the instruction is doing. That feels reasonably comfortable (although obviously we’re not going to squeeze all 27 of the instructions above into that space).
With two registers available in an instruction, we’re not limited to the Z80’s single-reg instructions, but we’re not as free as the ARM with its 32 bits and complex addressing modes. But we can do, for example:
r1 = r1 + r2
That is, one of the operands is also updated with the result of the
operation. Let’s write this example in our assembly language as ADD r1, r2
.
Back to addressing modes
We’ve decided arbitrarily on having four registers, of which two are
“addressish”. So, if we’re writing ADD r1, r2
, either of these could
be a plain register. But we might also want to add some data in RAM to
a register: ADD r1, (r2)
. That is, read the data pointed to by r2
,
and add it to r1
. Let’s make the “addressish” registers capable of
doing that as a data source. That gives us 6 possible instructions for
each target register r1
:
ADD r1, x
ADD r1, y
ADD r1, i
ADD r1, j
ADD r1, (i)
ADD r1, (j)
Powers of two are good, so we still have two more slots to fill. We
could accept a constant value, ADD r1, #0
. And we could accept a
constant address to read from, ADD r1, (#0)
. For both of these, we
need some way of specifying an 8-bit constant, but I’m going to ignore
that for now, and come back to it in a later post.
Selecting the opcodes to fit
So, if we’re using 4 registers, and another 4 possible sources for the
extended addressing modes ((i)
, (j)
, k
and (k)
, where k
is a
constant), what can we do in terms of operations?
Moving data in registers
Let’s start with MOV
. We can move data between any pair of normal
registers. That’s 2 bits for each of source and destination, for 16
instructions total.
Basic arithmetic
Next up, ADD
and SUB
. We’d like to be a bit more capable with
these. The destination can be just a register, but let’s extend the
source (second operand) to include the further addressing modes, such as
ADD x, (i)
or
SUB i, #2
That’s two bits worth for the destination, three bits for the second
operand, and one bit to distinguish ADD
from SUB
. I’m also going
to take a punt, and say that being able to include (or exclude) a
carry from a previous operation is useful here, too. That’s another
bit’s worth of instruction. That makes for 7 bits in total, or fully
half of the available opcodes. But I think it’s worth it.
We could also consider some speed/convenience operations to increment
or decrement a register by 1: INC
and DEC
. These should only
operate on a register, so there’s four opcodes for each of these.
Branching/jumping
What else do we need? We need flow control. That’s the JMP
instruction. It’s going to need to know where to jump to. With an
8-bit code space, that’s one byte, so we’re going to be doing JMP k
,
where k
is a one-byte constant (and not part of the main opcode).
But we also want to do this conditionally, as well as unconditionally.
Let’s grab [rolls D12 again] four options for that:
AL
: always jumpEQ
/ZO
: jump if the last result was “equal” (or zero)LT
/MI
: jump if the last result was “less than” (or negative)CS
: jump if the last result had a carry set
So jump takes only four opcodes, for the four conditions.
We could consider jumping to an address in a register, in which case there’d be a further 16 opcodes, for 4 registers × 4 conditions. The use case for those instructions is much less compelling, though.
Memory access
We can already do some kinds of memory access through the maths
operations and the extended addressing modes, but if you just want to
load a constant into a register, say, you’d have to do something like:
SUB x, x; ADD x, #5
(to clear x
, then set the value). Let’s
introduce a couple of instructions just to move data in and out of
RAM: LDR
and STR
. Each of these works with a plain register and
one of the extended addressing modes. e.g.:
LDR x, (i)
: Load the data pointed to byi
intox
LDR j, k
: Load the constantk
intoj
STR y, (k)
: Storey
into the RAM addressk
For LDR
, there are four sources (the four addressing modes) and four
destinations (the four registers), for a total of 16 opcodes. For
STR
, there are four sources (the four registers), and three
destinations ((i)
, (j)
and (k)
), because it makes no sense to
store a value in a constant, for 12 more opcodes. There are thus 28
memory access opcodes.
Shifts
Shifting bits around is, on the face of it, fairly simple, but has a lot of options. Left, right? Arithmetic, logical? Shift or rotate? Interaction with other shifts? How far are we shifting?
Following the principle of keeping it simple, I’m only going to consider shifting one bit at a time. It’s fairly easy to build a barrel shifter capable of shifting any number of bits, but it takes up a lot of extra opcodes to define the shift amount. Even with single-bit shifts, we can do most of the useful left shifts using other operations:
- shift left (
LSL
) is equivalent toADD r, r
- rotate left with carry (
RLC
) isADC r, r
- rotate left (
ROL
) isADD r, r; ADC r, #0
So I’m only going to consider right shifts here. I’m also going to
constrain it to only shifting the four main registers – no extended
addressing modes here, because that would mean reading and writing
memory with one instruction. That leaves us with four plausible
operations. For the bits in register r
:
r | h | g | f | e | d | c | b | a |
---|---|---|---|---|---|---|---|---|
ASR r | h | h | g | f | e | d | c | b |
LSR r | 0 | h | g | f | e | d | c | b |
ROR r | a | h | g | f | e | d | c | b |
RRC r | C | h | g | f | e | d | c | b |
In the RRC
(rotate right with carry) operation, the C
in bit 7 is
the carry bit from the processor status register. In all four cases,
the a
from bit 0 is moved to the carry bit.
With four operations and four registers, that’s 16 opcodes for shifts.
Booleans
There are three commonly-used binary boolean operations (AND
, OR
,
XOR
), and one unary operation (NOT
). Given the same structure as
ADD
, that’s 16 (or 32, if supporting the extended addressing modes)
opcodes for each binary op, and 4 opcodes for NOT
, for a total of 52
(or 100) opcodes.
Input/output
Modern hardware tends to use memory-mapped I/O – a piece of hardware
will have its internal registers or state injected into the same
address space as the RAM, and we can talk to it by writing to this
“RAM-like” space. However, since there’s only 256 bytes of RAM address
space in this design, that could get wasteful of a scarce resource
(RAM addresses). Instead, I’m going to introduce a set of I/O ports
with their own distinct address space, and special instructions to
talk to them. IN r, (k)
and OUT (k), r
read and write data on the
I/O port numbered k
, so there’s four opcodes for each (plus the
constant k
).
Multiply / Divide
While these operations are massively useful in some circumstances,
they’re also massively complicated to implement, given that I’m trying
to make a dead simple CPU. I could easily double (or quadruple) the
transistor/relay budget just implementing a hardware multiply unit.
I’m going to leave these out of the design from the outset. If you
want to count opcodes, it gets large pretty quickly if you offer
options: MUL
has two inputs and two registers of output. DIV
may
have three inputs (a 16-bit value and an 8-bit value) and two outputs
(quotient and remainder).
Stack ops
PUSH
and POP
(operating on registers), and the stack-based branch
ops CALL
and RET
all rely on a special stack register. Registers
are expensive – more expensive, each, than any other single
components of the CPU. If we can avoid holding a separate stack
pointer, that’s probably a good thing for minimising the CPU size and
complexity. But if we’re going to implement these, then PUSH
and
POP
are 4 opcodes each, and CALL
and RET
are another four each
(assuming that each one has a conditional field).
Summary
The table below summarises how many opcodes we’re considering for each
instruction (I’ve omitted MUL
and DIV
here).
Instruction | Number of opcodes |
---|---|
ADD , SUB , ADC , SBC |
128 |
MOV |
16 |
INC , DEC |
8 |
JMP |
4 |
LDR |
16 |
STR |
12 |
ASR , LSR , ROR , RRC |
16 |
AND , ORR , XOR |
96 |
NOT |
4 |
IN , OUT |
8 |
PUSH , POP |
8 |
CALL , RET |
8 |
That’s 324 opcodes, if we pick everything. But there’s only 256 possible opcodes with 1-byte instructions, so we’re going to have to lose something on the way. I’m going to drop the bitwise boolean operations, because they take up a lot of opcodes. I’m also going to drop the stack operations, because they need an extra register for storing the stack pointer, and registers are expensive to build. That leaves us with 208 opcodes, which will fit into a single byte instruction.
Conclusion
We’re building an 8-bit CPU with a (mostly) 8-bit instruction encoding. It’s not totally minimal – it’s not a BrainFuck CPU – but it’s going to be a very limited set of operations.
In the next post, I’m going to talk about the instruction encoding. How do all these instructions fit into the 8 bit opcodes? What decisions could we take about doing it, and why?