Review of Computer Architetcure
A Sahu
Deptt. of Comp. Sc. & Engg.
IIT Guwahati
Outline
Computer organization Vs Architecture
Comp Organization => Digital Logic Module
Logic and Low level
============================
Comp Architecture = > ISA Design, MicroArch Design
Algorithm for
Hardware abstraction
Main
memory
bridge
Bus interface
ALU
Register file
CPU
System bus
Memory bus
Disk
controller
Graphics
adapter
USB
controller
Mouse
Keyboard
Display
Disk
I/O bus
Expansion slots for
other devices such
as network adapters
PC
Hardware/software interface
software
hardware
C++
m/c instr
reg, adder
transistors
Arch. focus
Instruction Set Architecture
ISA
Compiler
OS
CPU
Design
Circuit
Design
Chip
Layout
Application
Program
The Abstract Machine
PC
Registers
CPU
Memory
Code + Data
Addresses
Data
Instructions
Stack
Condition
Codes
ALU
Instructions
Instructions
Instructions LD/ST & Control
What constitutes ISA?
RISC vs. CISC
Motorola 680x0, Intel 80x86
MIPS subset for implementation
Incremental changes in the design to include other instructions will be discussed later
Design overview
R
e
g
i
s
t
e
r
s
R
e
g
i
s
t
e
r
#
D
a
t
a
R
e
g
i
s
t
e
r
#
D
a
t
a
m
e
m
o
r
y
A
d
d
r
e
s
s
D
a
t
a
R
e
g
i
s
t
e
r
#
P
C
I
n
s
t
r
u
c
t
i
o
n
A
L
U
I
n
s
t
r
u
c
t
i
o
n
m
e
m
o
r
y
A
d
d
r
e
s
s
Division into data path and control
DATA PATH
CONTROLLER
control
signals
status
signals
Building block types
Two types of functional units:
Components for MIPS subset
Components - register
PC
clock
32
32
Components - adder
32
32
32
PC+4
offset
+
32
32
32
PC
4
+
Components - ALU
32
32
32
operation
result
a
b
ALU
a=b
overflow
Components - multiplexers
0
mux
1
32
32
PC+4
PC+4+offset
32
select
Components - register file
R
e
g
W
r
i
t
e
R
e
g
i
s
t
e
r
s
W
r
i
t
e
r
e
g
i
s
t
e
r
R
e
a
d
d
a
t
a
1
R
e
a
d
d
a
t
a
2
R
e
a
d
r
e
g
i
s
t
e
r
1
R
e
a
d
r
e
g
i
s
t
e
r
2
W
r
i
t
e
d
a
t
a
D
a
t
a
D
a
t
a
R
e
g
i
s
t
e
r
n
u
m
b
e
r
s
5
5
5
Components - program memory
I
n
s
t
r
u
c
t
i
o
n
m
e
m
o
r
y
I
n
s
t
r
u
c
t
i
o
n
a
d
d
r
e
s
s
I
n
s
t
r
u
c
t
i
o
n
MIPS components - data memory
M
e
m
R
e
a
d
M
e
m
W
r
i
t
e
D
a
t
a
m
e
m
o
r
y
W
r
i
t
e
d
a
t
a
R
e
a
d
d
a
t
a
A
d
d
r
e
s
s
Components - bit manipulation circuits
Sign
xtend
16
32
shift
32
32
0
LSB
MSB
LSB
MSB
MIPS subset for implementation
Datapath for add,sub,and,or,slt
Format: add $t0, $s1, $s2
000000 10001 10010 01000 00000 100000� op rs rt rd shamt funct
Fetching instruction
PC
IM
ad
ins
Addressing RF
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
ins[25-21]
ins[20-16]
Passing operands to ALU
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
ALU
ins[25-21]
ins[20-16]
Passing the result to RF
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
ALU
ins[25-21]
ins[20-16]
ins[15-11]
Incrementing PC
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
ALU
+
4
ins[25-21]
ins[20-16]
ins[15-11]
Load and Store instructions
Adding “sw” instruction
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
4
ins[25-21]
ins[20-16]
ins[15-11]
0
1
sx
ins[15-0]
16
Adding “lw” instruction
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
sx
0
1
0
1
4
ins[25-21]
ins[20-16]
ins[15-11]
ins[15-0]
16
1
0
Adding “beq” instruction
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
sx
0
1
1
0
0
1
4
ins[25-21]
ins[20-16]
ins[15-11]
ins[15-0]
16
+
s2
0
1
Adding “j” instruction
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
+
s2
sx
0
1
0
1
1
0
0
1
4
ins[25-21]
ins[20-16]
ins[15-11]
ins[15-0]
0
1
s2
ins[25-0]
PC+4[31-28]
ja[31-0]
28
16
Control signals
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
+
s2
sx
0
1
0
1
1
0
0
1
0
1
s2
4
Rdst
jmp
Z
MR
RW
ins[25-0]
ins[25-21]
ins[20-16]
ins[15-11]
ins[15-0]
PC+4[31-28]
ja[31-0]
28
16
MW
op
3
Asrc
Psrc
M2R
Datapath + Control
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
+
s2
sx
0
1
0
1
1
0
0
1
0
1
s2
4
Rdst
jmp
Psrc
Z
brn
MR
MW
M2R
op
RW
Asrc
ins[25-0]
control
ins[31-26]
ins[25-21]
ins[20-16]
ins[15-11]
ins[15-0]
Actrl
ins[5-0]
PC+4[31-28]
ja[31-0]
28
16
opc
2
3
Analyzing performance
Component delays
Delay for {add, sub, and, or, slt}
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
ALU
+
4
ins[25-21]
ins[20-16]
ins[15-11]
Delay for {sw}
PC
IM
ad
ins
RF
rad1
rad2
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
4
ins[25-21]
ins[20-16]
sx
ins[15-0]
16
Clock period in single cycle design
tR
tR
tM
tM
tR
tR
tR
tR
tA
tA
tA
tA
t+
t+
tI
tI
tI
tI
t+
tI
t+
tI
R-class
lw
sw
beq
j
clock
period
Clock period in multi-cycle design
tR
tR
tM
tM
tR
tR
tR
tR
tA
tA
tA
tA
t+
t+
tI
tI
tI
tI
t+
tI
t+
tI
R-class
lw
sw
beq
j
clock
period
Cycle time and CPI
cycle time
short
long
low
high
CPI
single cycle
design
pipelined
design
multi-cycle
design
PIpelined datapath (abstract)
PC
IM
ad
ins
RF
rad
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
+
4
IF
ID
EX
Mem
WB
IF/ID
ID/EX
EX/Mem
Mem/WB
Fetch new instruction every cycle
PC
IM
ad
ins
RF
rad
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
+
4
IF
ID
EX
Mem
WB
IF/ID
ID/EX
EX/Mem
Mem/WB
Pipelined processor design
PC
IM
ad
ins
RF
rad1
wad
wd
rd1
rd2
DM
ad
rd
wd
ALU
+
+
4
0
1
1
0
0
1
s2
sx
1
0
rad2
control
Actrl
0
1
0
bubble
IF/IDw
PCw
PCw=0
IF/IDw=0
bubble=1
flush
Graphical representation
IM
RF
DM
ALU
RF
5 stage pipeline
IF
ID
EX
Mem
WB
stages
actions
Usage of stages by instructions
IM
RF
DM
ALU
RF
IM
RF
DM
ALU
RF
IM
RF
DM
ALU
RF
IM
RF
DM
ALU
RF
lw
sw
add
beq
Pipelining
IF D RF EX/AG M WB
Simple multicycle design :
Degree of overlap Depth
Serial
Overlapped
Pipelined
Shallow
Deep
Hazards in Pipelining
Data Hazards
delay = 3
previous
instr
current
instr
read/write
read/write
Structural Hazards
A
B
A
C
A
B
A
C
A
B
A
C
A
B
C
D
A
C
B
D
F
D
X
X
F
D
X
X
Caused by Resource Conflicts
Control Hazards
delay = 5
branch
instr
next inline
instr
target
instr
cond eval
delay = 2
target addr gen
Pipeline Performance
CPI = 1 + (S - 1) * b
Time = CPI * T / S
T
S stages
Frequency of interruptions - b
Improving Branch Performance
Branch Elimination
C
S
Use conditional instructions
(predicated execution)
T
F
C : S
OP1
BC CC = Z, * + 2
ADD R3, R2, R1
OP2
OP1
ADD R3, R2, R1, NZ
OP2
Branch Speed Up : �Early target address generation
IF IF IF D TIF TIF TIF
AG
BC
Branch Prediction
Strategies:
Static Branch Prediction
Total 68.2%
Branch Target Capture
instr addr pred stats target
target addr
target instr
prob of target change < 5%
BTB Performance
BTB miss
go inline
inline
BTB hit
go to target
decision
result
target inline
target
delay
0
5
4
0
.4 .6
.8 .2
.2 .8
.4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0 = 0.88 (Eff.Delay)
Compute/fetch scheme
I - cache
I
F
A�R
+
Instruction
Fetch address
Compute
BTA
BTA
IIFA
Next sequential
address
A I I + 1 I + 2 I + 3
BTI BTI+1 BTI+2 BTI+3
(no dynamic branch prediction)
BTAC scheme
I - cache
I
F
A�R
+
Instruction
Fetch address
BTA
IIFA
Next sequential
address
A I I + 1 I + 2 I + 3
BTI BTI+1 BTI+2 BTI+3
BTAC
BA BTA
ILP in VLIW processors
Cache/
memory
Fetch
Unit
Single multi-operation instruction
multi-operation instruction
FU
FU
FU
Register file
ILP in Superscalar processors
Cache/
memory
Fetch
Unit
Multiple instruction
Sequential stream of instructions
FU
FU
FU
Register file
Decode
and issue
unit
Instruction/control
Data
FU
Funtional Unit
Why Superscalars are popular ?
slide 69
Hierarchical structure
M
e
m
o
r
y
C
P
U
M
e
m
o
r
y
S
i
z
e
C
o
s
t
/
b
i
t
S
p
e
e
d
S
m
a
l
l
e
s
t
B
i
g
g
e
s
t
H
i
g
h
e
s
t
L
o
w
e
s
t
F
a
s
t
e
s
t
S
l
o
w
e
s
t
M
e
m
o
r
y
Data transfer between levels
unit of transfer = block
access
hit
miss
Principle of locality & Cache Policies
============================
Load policies
4 AU Block
Cache miss on AU 1
Block Load
Load Forward
Fetch Bypass
(wrap around
load)
0
1
2
3
Fetch Policies
questions:
Write Policies
Cache Types
Instruction | Data | Unified | Split
Split vs. Unified:
On-chip | Off-chip
Single level | Multi level
References
Thanks