< Computer Architecture Lab < WS2008

Another Useless Architecture (AUA) is a RISC microprocessor architecture developed by Jakob Wilhelm, Stefan Rottensteiner and Stefan Tauner as a course project for the Computer Architecture Lab at the Vienna University of Technology in winter 2008/2009. It has been implemented in VHDL and tested in a Cyclone II FPGA from Altera on a DE2 development board from Terasic Technology.

Description

AUA is a 16bit RISC architecture providing 32 registers apparently having similarities to MIPS. E.g. MIPS branches are supported by individual compare and branch instructions (i.e. you have to compare two values/registers and use the result as input for the branch decision). The branch delay is 0 for untaken and 1 cycle for taken branches. The branch decision and branch address calculation is done in the ID stage. An additional cycle is lost if the branch uses the result of the preceding instruction. It is not possible to circumvent this stall with forwarding as this would introduce a new longest path (from the ID/EX pipeline registers through the alu, through the branch decision unit in ID, to the PC register in IF). The EX stage forwards its results to resolve data hazards in all other instruction sequences though (i.e. if instruction uses the (EX) result of instruction as (EX) input). We used on-chip memory of the FPGA (M4K blocks) for the register file. As the memory requires registered inputs, this registers can be regarded as pipeline registers for the WB stage. To be able to read two registers (the two operands) and write one (the result of the previous instruction), we had to duplicate the register file into two "simple dual-port mode" memories. Another possibility would be to clock the memory twice as fast, but we did prefer a single clock domain. In dual-port mode the on-chip memory has a latency of 1.5 cycles, which made another forwarding necessary if instr. uses the result of instr. . Beside the branch stall, where IF and (partially) ID get locked, the EX stage needs to lock all stages, when a load/store operation takes more than one cycle.

Schematic of the AUA pipeline

Memory and I/O

Instructions, data and memory-mapped I/O devices all share the same address space. The MMU is responsible for handling all memory transactions and transfers the data to the instruction cache and from/to the load/store unit in the EX stage. LD/ST transactions have priority over instruction fetches, so that the pipeline can progress normally. IF will schedule a nop if it is blocked because the current instr. is not available.

The MMU can access on-board SRAM, on-chip ROM (e.g. for a bootloader) and an unlimited number of SimpCon devices. They all share the same address space, which is divided in parts by the MMU. The SimpCon bus is specified as a point-to-point connection, but it is possible to create a shared SimpCon bus by connecting every SimpCon slave to an independant 'RD' and 'WR' line. In the opposite direction (i.e. from the SC slave to the master) it is necessary to multiplex 'WR_DATA' and 'RDY_CNT', because there are no tri-state buses in FPGAs (see Tri-State Buses in Altera Devices). Both tasks are fullfilled by a multiplexer process in the top entity. It matches the 'ADDRESS' similar to IP network masks and selects the corresponding signals.

Pipeline stages

AUA has three stages (four if you include write-back): IF, ID, EX/MEM. IF requests the instruction at the current address/PC from the cache/MMU and extracts the various fields (opcode, immediate, operand and destinstation register addresses). These fields get registered and feed into ID.

The immediate field gets expanded and sign extended before a logic block controls a mux according to the current opcode to choose the correct form of immediate or drop it altogether in favor of a value read from the register file. ID also decides, if a branch is taken and if so, which address is next to be fetched by IF. It is not possible to read the PC directly, but to support returning from functions, ID can store the PC into a register by scheduling an ordinary MOV and overriding the source operand with the current PC value.

EX contains the ALU and a small control curcuit to process LD/ST instructions in cooperation with the MMU. The result of either the ALU or the MMU is selected and transfered to the register file (through the ID unit). If the destination is the zero register, EX enforces a value of 0 to not alter the register.

Instruction Cache

It is possible to plug in different instruction cache implementations between the IF stage and the MMU. Currently a dummy cache (that caches nothing and just connects the signals between IF and the MMU) and a generic direct-mapped cache are provided.

The number of lines cached by the direct-mapped cache can be configured. Currently only on word per cache line is working, because there is no prefetching done by the MMU. It is planned to change this soon. It is also planned to use on-chip memory instead of dedicated logic registers for the cache.


Instruction Set Architecture

The ISA of AUA is typical for a RISC. It was designed with the decoding in mind: All types of fields (opcode, immediates etc.) have a fixed location in the fixed-sized instruction word of 16b. One attribute of our architecture particularly constrained the design of the ISA: We decided that we want 32 registers. This means that out of the 16 bits of one instruction word, 5 bits are needed to address a register.

Most commands need two input registers and one destination register. We use 2 address encoding, where one source register is used as the destination too. This leaves us with at most 6 bits for the opcode, or a maximum of 64 instructions. This number gets reduced by the ldi instruction, because 8b are needed for the immediate itself plus 5b for the destination register, leaving 3b for the ldi opcode. Or in other words are wasted for a part of the immediate value. These 3 bits can't be used for other instructions. Since we could not think of 56 useful instructions anyway, we introduced another type of instruction, that uses immediates.(addi,muli) These use 7 bits to encode the immediate, wasting 2b of opcode space each. So the total number of instructions is limited to .


Instructions

The following table lists all instructions currently implemented in hardware (see the pseudo instruction section below for the rest). There are still some white spots, which are reserved for later use. Cells filled with a s form an encoded source register. d, i likewise for destination registers and immediates. a denotes that the memory address (e.g. for a branch) is taken from the register encoded by these cells. The Imm. column indicates how immediates are interpreted (e.g. the values used with branches get sign-extended and can represent forward as well as backward (=negative) jumps).

Table of AUA instructions and their encoding
Mnemo. Imm. OP B OP A hex op notes
1514131211109876543210
ldiunsigned000iiiiiiiiddddd0x00-0x07
0010000x08
0010010x09
0010100x0A
0010110x0B
0011000x0C
jmpl001101aaaaa0x0D
brez001110aaaaasssss0x0E
brnez001111aaaaasssss0x0F
brezisigned0100iiiiiiisssss0x10-0x13
brnezisigned0101iiiiiiisssss0x14-0x17
addisigned0110iiiiiiiddddd0x18-0x1Bsets carry
mulisigned0111iiiiiiiddddd0x1C-0x1F
add100000sssssddddd0x20sets carry
addc100001sssssddddd0x21sets carry
sub100010sssssddddd0x22sets carry
subc100011sssssddddd0x23sets carry
mul100100sssssddddd0x24
mulu100101sssssddddd0x25
mulh100110sssssddddd0x26
mulhu100111sssssddddd0x27
or101000sssssddddd0x28
and101001sssssddddd0x29
xor101010sssssddddd0x2A
not101011sssssddddd0x2B
neg101100sssssddddd0x2C
asr101101sssssddddd0x2D
lsl101110sssssddddd0x2E
lsr101111sssssddddd0x2F
lsliunsigned110000iiiiddddd0x30
lsriunsigned110001iiiiddddd0x31
scbunsigned110010kiiiiddddd0x32k=set/!clear
rotiunsigned110011kiiiiddddd0x33k=left/!left
cmplt110100ssssssssss0x34
cmpltu110101ssssssssss0x35
cmplte110110ssssssssss0x36
cmplteu110111ssssssssss0x37
cmpe111000ssssssssss0x38
cmpeiunsigned111001iiiiiddddd0x39
1110100x3A
mov111011sssssddddd0x3B
ld111100aaaaaddddd0x3C
ldb111101aaaaaddddd0x3D
st111110aaaaasssss0x3E
stb111111aaaaasssss0x3F
1514131211109876543210
OP B OP A


ldi – Load Immediate (signed, byte)

OperationSyntaxOperands
Rd <-- Imm(byte)LDI Rd, ImmImm ... 8bit signed Immediate
Rd ... Destinationregister

Load an 8 bit signed immediate to the Low Byte of Register Rd


jmpl – Jump and Link

OperationSyntaxOperands
PC <-- Addr, R31 <-- PCJMPL RsRs... Register contains the address to jump

Jump to the Addr


brez – Branch if Equal Zero

OperationSyntax
if Rs = 0 then PC <-- Addr
else PC <-- PC + 16
BREZ Addr, Rs

Branch to the Addr if the content of Rs is equal to zero


brnez – Branch if Not Equal Zero

OperationSyntax
if Rs != 0 then PC <-- Addr
else PC <-- PC + 16
BRNEZ Addr, Rs

Branch to the Addr if the content of Rs is not equal to zero


brezi – Branch if Equal Zero Immediate

OperationSyntaxOperands
if Imm = 0 then PC <-- Addr
else PC <-- PC + 16
Brezi Addr, ImmImm ... 7 bit signed Immediate

Branch to the Addr if the imm is equal to zero


brnezi – Branch if Not Equal Zero Immediate

OperationSyntaxOperands
if Imm != 0 then PC <-- Addr
else PC <-- PC + 16
BRNEZI Addr ImmImm...7 bit signed Immediate

Branch to the Addr if the imm is not equal to zero


Addi – Add Immediate

OperationSyntaxOperands
Rd <-- Rd + ImmADDI Rd, ImmImm ... 7 bit signed Immediate

Add an 7 bit signed Immediate to Rd and wirite the result back to Rd


Muli – Multiply Immediate

OperationSyntaxOperands
Rd <-- Rd * ImmMULI Rd,ImmImm ... 7 bit signed Immediate

Multiply an 7 bit signed Immediate with Rd and wirite the result back to Rd After execution, Rd holds the low word of the multiplication.


Add – Add without carry

OperationSyntaxOperands
Rd <-- Rd + RsADD Rd, RsRd, Rs signed interpreted

Add the content of register Rs to register Rd and write the result back to Rd
Carrybit is set, if an overflow occurred


Addc – Add with Carry

OperationSyntaxOperands
Rd <-- Rd + Rs + CADDC Rd, RsRd, Rs signed interpreted

Add Carrybit and content of Rs to Rd and write the result back to Rd
Carrybit is set if result greater than max. value otherwise cleared.


Sub – Sub without carry

OperationSyntaxOperands
Rd <-- Rd - RsSUB Rd, RsRd, Rs signed interpreted

Subtract the content of register Rs from register Rd and write the result back to Rd
Carrybit is set, if an overflow occurred


Subc – Sub with Carry

OperationSyntaxOperands
Rd <-- Rd - Rs - CSUBC Rd, RsRd, Rs signed interpreted

Subtract Carrybit and content of Rs from Rd and write the result back to Rd
Carrybit is set if result less than min. value, otherwise cleared.


Mul – Multiply low word signed

OperationSyntaxOperands
Rd <-- Rd x RsMUL Rd, RsRd, Rs signed interpreted

Multiply the content of Rs and Rd signe-interpreted and write the result back to Rd.
After execution, Rd holds the low word of the multiplication.


Mulu – Multiply low word Unsigned

OperationSyntaxOperands
Rd <-- Rd x RsMULU Rd, RsRd, Rs unsigned interpreted

Multiply content of Rs and Rd unsigne-interpreted and write the result back to Rd.
After execution, Rd holds the low word of the multiplication.


Mulh – Multiply High word signed

OperationSyntaxOperands
Rd <-- Rd x RsMULH Rd, RsRd, Rs signed interpreted

Multiply the content of Rs and Rd signe-interpreted and write the result back to Rd.
After execution, Rd holds the high word of the multiplication.


Mulhu – Multiply High word Unsigned

OperationSyntaxOperands
Rd <-- Rd x RsMULHU Rd, RsRd, Rs unsigned interpreted

Multiply content of Rs and Rd unsigne-interpreted and write the result back to Rd.
After execution, Rd holds the high word of the multiplication.


Or – Logical Or

OperationSyntax
Rd <-- Rd or RsOR Rd, Rs

Bitwise or combination of the two registers content, and write result back to Rd


And – Logical And

OperationSyntax
Rd <-- Rd and RsAND Rd, Rs

Bitwise and combination of the two registers content, and write result back to Rd


Xor – Logical Xor

OperationSyntax
Rd <-- Rd xor RsXOR Rd, Rs

Bitwise xor combination of the two registers content, and write result back to Rd


Not – Invert register

OperationSyntax
Rd <-- not RsNOT Rd, Rs

Invert the bits of Rs, and write the result to Rd.


Neg – Negate (2er-Complement)

OperationSyntax
Rd <-- +/- RsNEG Rd, Rs

Calculate the Two'ers Complement of Rs and write the result to Rd.
For the max. negative value, the result is the max. negative value again


Asr – Aritmethic Shift Right

OperationSyntax
Rd <-- (b15),(1>>Rs)ASR Rd

Hold MSB and shift all bits in Rd one bit to the right.
LSB is shifted out.


Lsl – Logical Shift Left

OperationSyntax
Rd <-- (Rd << 1)LSL Rd

Shift all bits in Rd to the left. MSB is shifted out, LSB is cleared.


Lsr – Logical Shift Right

OperationSyntax
Rd <-- (1 >> Rd)LSR Rd

Shift all bits in Rd to the right. MSB is cleared, LSB is shifted out.


Lsli – Logical ShiftLeft Immediate

OperationSyntaxOperands
Rd <-- (Rd << Imm)LSLI Rd,ImmImm is unsigned interpreted

Shift all bits about the given value to the left. MSB is shifted out, LSB is cleared.
Immediate is 4 bits long and interpreted as unsigned


Lsri – Logical Shift Right Immediate

OperationSyntaxOperands
Rd <-- (Imm >> Rd)LSRI Rd,ImmImm is unsigned interpreted

Shift all bits about the given value to theright. MSB is cleared, LSB is shifted out.
Immediate is 4 bits long and interpreted as unsigned


Scb – Set/Clear Bit

OperationSyntaxOperands
Rd <-- Rd(bitx)<--1/0SCB Rd,ImmImm is unsigned interpreted
Bit(4) = 1 = set, Bit(4) = 0 = clear
Bit(3to0) = position

Bit 3 to bit 0 of immediate indicates which bit in Rd should be set or cleared
If bit 4 is 1, bit is set otherwise cleared.


Roti – Rotate right/left about an Immediate

OperationSyntaxOperands
Rd(b14..b0,b15) <--Rd(b15...b0) if dir = 0ROTI Rd, ImmImm is unsigned interpreted
Rd(b0..b15,b1) <--Rd(b15...b0) if dir = 0Bit(4) = 1 = right, Bit(4) = 0 = left
Bit(3to0) = digits to move

Bit 3 to bit 0 of immediate indicates the length to rotate bits in Rd.
if bit4 is 1, rotate right, otherwise rotate left.


Cmplt – Compare Less Than

OperationSyntaxOperands
Rd <-- Rd < Rs <--1/0CMPLT Rd, RsRd, Rs signed interpreted

Test if Rd < Rs
If true write 1 to Rd otherwise 0


Cmpltu – Compare Less Than Unsigned

OperationSyntaxOperands
Rd <-- Rd < Rs <--1/0CMPLTU Rd, RsRd, Rs unsigned interpreted

Test if Rd < Rs
If true write 1 to Rd otherwise 0


Cmplte – Compare Less Than Equal

OperationSyntaxOperands
Rd <-- Rd <= Rs <--1/0CMPLTE Rd, RsRd, Rs signed interpreted

Test if Rd <= Rs
If true write 1 to Rd otherwise 0


Cmplteu – Compare Less Than Equal Unsigned

OperationSyntaxOperands
Rd <-- Rd <= Rs <--1/0CMPLTEU Rd, RsRd, Rs unsigned interpreted

Test if Rd <= Rs
If true write 1 to Rd otherwise 0


Cmpe – Compare Equal

OperationSyntax
Rd <-- Rd = Rs <--1/0CMPE Rd, Rs

Test if Rd = Rs
If true write 1 to Rd otherwise 0


Cmpei – Compare Equal Immediate

OperationSyntax
Rd <-- Rd = IMM <--1/0CMPEI Rd, Imm

Test if Rd = Imm
If true write 1 to Rd otherwise 0


Mov – Copy Registercontent

OperationSyntax
Rd <-- RsMOV Rd, Rs

Copy content of Rs to Rd


Ld – Load word

OperationSyntax
Rd <-- AddrLD Rd, Addr

Load a word from Addr. to Rd


Ldb – Load Byte

OperationSyntax
Rd <-- Addr (byte)LDB Rd, Addr

Load a byte from Addr. to the lower byte of Rd


St – Store word

OperationSyntax
Addr <-- RsST Addr, Rs

Store content of Rs to the Addr.


Stb – Store Byte

OperationSyntax
Addr <-- Rd (byte)LDB Rd, Addr

Store the lower byte of Rs to the Addr.

Assembler

Before we wrote our own assembler, we tried to use an assembler generator that takes an architecture definition (how many registers; definition and encoding of the instructions etc) and outputs an assembler, a disassembler, a high level simulator etc. We tried ArchC, a brazilian project that patches the GNU binutils according to your architecture. Sounds great? Theoretically, yes, but ArchC needs SystemC and both take quite some time to setup, if you don't know exactly what you need to do (e.g. ArchC requires an closely matched version of the binutil sources, or patching fails). When we finally got it working, we found some bugs in the generated assembler (e.g. labels were not handled correctly all the time). We had spent a few hours digging binutils code, when we decided that it's not worth the trouble and started work on our own tools.

The assembler is coded in c++ with help of the boost library. It supports various pseudo instructions (see below), can link/include multiple files and is able to write constants (words, strings, arrays of words) into a "data section".

Pseudo Instructions

Nop – No Operation

OperationSyntax
NopNOP

Do nothing
Implemented as a ldi 0 to zero-register

Example: ldi $0, 0


ret – Return

OperationSyntaxOperands
PC <-- AddrRETR31 cointains the addr. to jump to

Implemented as a branch to Addr. in R31 if R0 equals zero(always true)

Example: brez $0, $31


Jmp – Jump

OperationSyntaxOperands
PC <-- AddrJMP RsRs contains the Addr. to jump to

Implemented as a branch to Addr. in Rs if R0 equals zero(always true)

Example: brez $0, Rs


Rjmpi – Relative Jump Immediate

OperationSyntaxOperands
PC <-- ImmRJMPI ImmImm represents Addr. to jump to

Implemented as a branch to Addr represented by Imm if R0 equals zero(always true)
Example: brezi $0,Imm


Cmpgt – Compare Greater Than

OperationSyntaxOperands
Rd <-- Rd > RsCMPGT Rd, RsRd, Rs signed interpreted

Test if Rd > Rs
If true write 1 to Rd otherwise 0
Implemented as a cmplte between Rs and Rd

Example: CMPGT R1, R2 maps to CMPLTE R2, R1


Cmpgt – Compare Greater than Unsigned

OperationSyntaxOperands
Rd <-- Rd > RsCMPGTU Rd,RsRd, Rs unsigned interpreted

Test if Rd > Rs
If true write 1 to Rd otherwise 0
Implemented as a cmplteu between Rs and Rd

Example: CMPGTU R1, R2 maps to CMPLTEU R2, R1


Cmpgte – Compare Greater Than Equal

OperationSyntaxOperands
Rd <-- Rd >= RsCMPGTE Rd,RsRd, Rs signed interpreted

Test if Rd >= Rs
If true write 1 to Rd otherwise 0
Implemented as a cmplt between Rs and Rd

Example: CMPGTE R1, R2 maps to CMPLT R2, R1


Cmpgteu – Compare Greater Than Equal Unsigned

OperationSyntaxOperands
Rd <-- Rd >= RsCMPGTEU Rd,RsRd, Rs unsigned interpreted

Test if Rd >= Rs
If true write 1 to Rd otherwise 0
Implemented as a cmpltu between Rs and Rd

Example: CMPGTEU R1, R2 maps to CMPLTU R2, R1


Swpb – Swap Bytes

OperationSyntax
Rd (byte0,byte1) <-- Rd(byte1,byte0)SWPB Rd, Rs

Change the byte order in the register Rs and write the result to Rd
Implemented as a roti in register rd, about 8 bits

Example: SWPB R1 maps to ROTI R1,8


Set – Set all bits to 1

OperationSyntax
Rd <-- 0xFFFFSET Rd

Set all bits in Rd to 1
Implemented as a not Rd and zero-register

Example: Set R1 maps to NOT R1, R0


Clr – Clear all bits to 0

OperationSyntax
Rd <-- 0x0000CLR Rd

Set all bits in Rd to 0
Implemented as a mov Rd and zero-register

Example: CLR R1 maps to MOV R1, R0


Inc – Increment without carry

OperationSyntaxOperands
Rd <-- Rd + 1INC RdRd signed interpreted

Increment the content of Rd about 1.
Implemented as a addi with immediate 1

Example: INC R1 maps to ADDI R1, 1


Dec – Decrement without carry

OperationSyntaxOperands
Rd <-- Rd - 1DEC RdRd signed interpreted

Decrement the content of Rd about 1.
Implemented as a addi with immediate -1

Example: DEC R1 maps to ADDI R1, -1


ldiw – Load Immediate Word

OperationSyntax
Rd <-- Imm (word)LDIW Rd, Imm

Load a word to the Rd
Implemented as:
load immediate to register
shift bits in register about 8 digits
load immediate to register

Example: LDIW R1, 0x3F1A maps to
LDI R1, 3F
LSLI R1,8
LDI R1, 1A


setb - Set Bit

OperationSyntaxOperands
Rd(x) <-- 1SETB Rd, ImmImm is 4 bits long, and indicates bit pos to set

Set the corresponding bit in Rd
Implemented as scb where the 4th bit is 1.

Example: SETB R10, 0xF maps to SCB R10, 0x1F


clrb - Clear Bit

OperationSyntaxOperands
Rd(x) <-- 0CLRB Rd, ImmImm is 4 bits long, and indicates bit pos to clear

Clear the corresponding bit in Rd
Implemented as scb where the 4th bit is 0.

Example: CLRB R10, 0xF maps to SCB R10, 0x0F

Build Instructions

Get the source at https://github.com/stefanct/aua/.

VHDL (hw directory)

In the project folder is an aua.qsf file, that contains all necessary settings for Altera's Quartus II. The most important .vhd file is hw/src/aua_types_de2.vhd, which holds a bunch of important constants (e.g. CPU and UART frequencies) and defines various data types used all over in the source files. Another important file, especially if you wanna port AUA to another fpga, is src/aua_top.vhd, where the top entity and the configuration reside. In the configuration section one can choose between different caches (see above) and two ALUs.

Included are also two Modelsim scripts for RTL and gate level simulation. They can be found in hw/sim. They use the testbenches in the same directory.

Assembler (as)

A makefile is included in the as directory. The assembler links with boost_regex, so be sure to have boost installed. To assemble a program and embed the binary into ../hw/mmu/rom.vhd execute "as -r <progname>.asm" or to get the binary alone (useable with the simulator or for bootstrapping) call "as <progname>.asm <binaryname>".

High Level Simulator (sim)

The simulator is written in the scripting language python, so no compilation is necessary. When called with "pyhton sim.py" (in sim/src), it reads the binary file from ../../as/boot. See the online help (command "h") for further instructions.

Example program

Hello World

#include libaua/io/digit.h
#include libaua/io/uart.h

ldiw $20, SC_DIGITS -- (0xff10) address of memory mapped simcon slave
ldiw $21, 0xaa
st $21, $20


ldiw $1, SC_UART -- (0xff20) address of memory mapped simcon slave
mov $2, $1
addi $2, 1

#define MSG "Hello World"

ldiw $3, MSG

loop0:

ldiw $5, 0 -- Counter

loop:
	ld $10, $1
	ldi $11, 1
	and $10, $11
	brezi $10, loop

	mov $6, $3
	mov $7, $5
	ldi $8, 0xfe
	and $7, $8
	add $6, $7

	mov $7, $5
	ldi $8, 1
	and $7, $8
	ld $4, $6

	brezi $7, low_b

high_b:
	rjmpi foo

low_b:
	lsri $4, 8

foo:
	st $4, $2
	addi $5, 1

	ldi $8, 11 -- top value
	mov $6, $5
	sub $6, $8
	brezi $6, loop0

	rjmpi loop
This article is issued from Wikiversity. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.