1 of 78

Review of Computer Architetcure

A Sahu

Deptt. of Comp. Sc. & Engg.

IIT Guwahati

2 of 78

Outline

Computer organization Vs Architecture
Processor architecture
Pipeline architecture

Data, resource and branch hazards

Superscalar & VLIW architecture
Memory hierarchy
Reference

3 of 78

Computer organization Vs Architecture

Comp Organization => Digital Logic Module

Logic and Low level

============================

Comp Architecture = > ISA Design, MicroArch Design

Algorithm for

Designing best micro architecture,
Pipeline model,
Branch prediction strategy, memory management
Etc…..

4 of 78

Hardware abstraction

Main

memory

bridge

Bus interface

ALU

Register file

CPU

System bus

Memory bus

Disk

controller

Graphics

adapter

USB

controller

Mouse

Keyboard

Display

Disk

I/O bus

Expansion slots for

other devices such

as network adapters

PC

5 of 78

Hardware/software interface

software

hardware

C++

m/c instr

reg, adder

transistors

Arch. focus

Instruction set architecture

Lowest level visible to a programmer

Micro architecture

Fills the gap between instructions and logic modules

6 of 78

Instruction Set Architecture

Assembly Language View

Processor state (RF, mem)
Instruction set and encoding

Layer of Abstraction

Above: how to program machine - HLL, OS
Below: what needs to be built - tricks to make it run fast

ISA

Compiler

OS

CPU

Design

Circuit

Design

Chip

Layout

Application

Program

7 of 78

The Abstract Machine

Programmer-Visible State

PC Program Counter
Register File

heavily used data

Condition Codes

PC

Registers

CPU

Memory

Code + Data

Addresses

Data

Instructions

Stack

Condition

Codes

Memory

Byte array
Code + data
stack

ALU

8 of 78

Instructions

Language of Machine
Easily interpreted
primitive compared to HLLs

Instruction set design goals

maximize performance,
minimize cost,
reduce design time

9 of 78

Instructions

All MIPS Instructions: 32 bit long, have 3 operands

Operand order is fixed (destination first) �Example:� C code: A = B + C� MIPS code: add $s0, $s1, $s2 � (associated with variables by compiler)

Registers numbers 0 .. 31, e.g., $t0=8,$t1=9,$s0=16,$s1=17 etc.
000000 10001 10010 01000 00000 100000� op rs rt rd shamt funct

10 of 78

Instructions LD/ST & Control

Load and store instructions
Example:� C code: A[8] = h + A[8];� MIPS code: lw $t0, 32($s3)� add $t0, $s2, $t0� sw $t0, 32($s3)
Example: lw $t0, 32($s2)� 35 18 9 32� op rs rt 16 bit number
Example:�if (i != j) beq $s4, $s5, Lab1� h = i + j; add $s3, $s4, $s5�else j Lab2� h = i - j; Lab1: sub $s3, $s4, $s5� Lab2: ...

11 of 78

What constitutes ISA?

Set of basic/primitive operations

Arithmetic, Logical, Relational, Branch/jump, Data movement

Storage structure – registers/memory

Register-less machine, ACC based machine, A few special purpose registers, Several Gen purpose registers, Large number of registers

How addresses are specified

Direct, Indirect, Base vs. Index, Auto incr and auto decr, Pre (post) incr/decr, Stack

How operand are specified

3 address machine r1 = r2 + r3, 2 address machine r1 = r1 + r2
1 address machine Acc = Acc + x (Acc is implicit)
0 address machine add values on (top of stack)

How instructions are encoded

12 of 78

RISC vs. CISC

RISC

Uniformity of instructions,
Simple set of operations and addressing modes,
Register based architecture with 3 address instructions

RISC: Virtually all new ISA since 1982

ARM, MIPS, SPARC, HP’s PA-RISC, PowerPC, Alpha, CDC 6600

CISC : Minimize code size, make assembly language easy� VAX: instructions from 1 to 54 bytes long!

Motorola 680x0, Intel 80x86

13 of 78

MIPS subset for implementation

Arithmetic - logic instructions

add, sub, and, or, slt

Memory reference instructions

lw, sw

Control flow instructions

beq, j

Incremental changes in the design to include other instructions will be discussed later

14 of 78

Design overview

R

e

g

i

s

t

e

r

s

R

e

g

i

s

t

e

r

#

D

a

t

a

R

e

g

i

s

t

e

r

#

D

a

t

a

m

e

m

o

r

y

A

d

r

e

s

D

a

t

a

R

e

g

i

s

t

e

r

#

P

C

I

n

s

t

r

u

c

t

i

o

n

A

L

U

I

n

s

t

r

u

c

t

i

o

n

m

e

m

o

r

y

A

d

r

e

s

Use the program counter (PC) to supply instruction address
Get the instruction from memory
Read registers
Use the instruction to decide exactly what to do

15 of 78

Division into data path and control

DATA PATH

CONTROLLER

control

signals

status

signals

16 of 78

Building block types

Two types of functional units:

elements that operate on data values (combinational)

output is function of current input, no memory
Examples

gates: and, or, nand, nor, xor, inverter ,Multiplexer, decoder, adder, subtractor, comparator, ALU, array multipliers

elements that contain state (sequential)

output is function of current and previous inputs
state = memory
Examples:

flip-flops, counters, registers, register files, memories

17 of 78

Components for MIPS subset

Register,
Adder
ALU
Multiplexer
Register file
Program memory
Data memory
Bit manipulation components

18 of 78

Components - register

PC

clock

32

19 of 78

Components - adder

32

PC+4

offset

+

32

PC

4

+

20 of 78

Components - ALU

32

operation

result

a

b

ALU

a=b

overflow

21 of 78

Components - multiplexers

0

mux

1

32

PC+4

PC+4+offset

32

select

22 of 78

Components - register file

R

e

g

W

r

i

t

e

R

e

g

i

s

t

e

r

s

W

r

i

t

e

r

e

g

i

s

t

e

r

R

e

a

d

a

t

a

1

R

e

a

d

a

t

a

2

R

e

a

d

r

e

g

i

s

t

e

r

1

R

e

a

d

r

e

g

i

s

t

e

r

2

W

r

i

t

e

d

a

t

a

D

a

t

a

D

a

t

a

R

e

g

i

s

t

e

r

n

u

m

b

e

r

s

5

23 of 78

Components - program memory

I

n

s

t

r

u

c

t

i

o

n

m

e

m

o

r

y

I

n

s

t

r

u

c

t

i

o

n

a

d

r

e

s

I

n

s

t

r

u

c

t

i

o

n

24 of 78

MIPS components - data memory

M

e

m

R

e

a

d

M

e

m

W

r

i

t

e

D

a

t

a

m

e

m

o

r

y

W

r

i

t

e

d

a

t

a

R

e

a

d

a

t

a

A

d

r

e

s

25 of 78

Components - bit manipulation circuits

Sign

xtend

16

32

shift

32

0

LSB

MSB

LSB

MSB

26 of 78

MIPS subset for implementation

Arithmetic - logic instructions

add, sub, and, or, slt

Memory reference instructions

lw, sw

Control flow instructions

beq, j

27 of 78

Datapath for add,sub,and,or,slt

Fetch instruction
Address the register file
Pass operands to ALU actions
Pass result to register file required
Increment PC

Format: add $t0, $s1, $s2

000000 10001 10010 01000 00000 100000� op rs rt rd shamt funct

28 of 78

Fetching instruction

PC

IM

ad

ins

29 of 78

Addressing RF

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ins[25-21]

ins[20-16]

30 of 78

Passing operands to ALU

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

ins[25-21]

ins[20-16]

31 of 78

Passing the result to RF

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

ins[25-21]

ins[20-16]

ins[15-11]

32 of 78

Incrementing PC

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

+

4

ins[25-21]

ins[20-16]

ins[15-11]

33 of 78

Load and Store instructions

format : I

Example: lw $t0, 32($s2)�� 35 18 9 32� op rs rt 16 bit number

34 of 78

Adding “sw” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

ins[25-21]

ins[20-16]

ins[15-11]

0

1

sx

ins[15-0]

16

35 of 78

Adding “lw” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

sx

0

1

0

1

4

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

16

1

0

36 of 78

Adding “beq” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

sx

0

1

0

1

4

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

16

+

s2

0

1

37 of 78

Adding “j” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

s2

sx

0

1

0

1

0

1

4

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

0

1

s2

ins[25-0]

PC+4[31-28]

ja[31-0]

28

16

38 of 78

Control signals

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

s2

sx

0

1

0

1

0

1

0

1

s2

4

Rdst

jmp

Z

MR

RW

ins[25-0]

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

PC+4[31-28]

ja[31-0]

28

16

MW

op

3

Asrc

Psrc

M2R

39 of 78

Datapath + Control

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

s2

sx

0

1

0

1

0

1

0

1

s2

4

Rdst

jmp

Psrc

Z

brn

MR

MW

M2R

op

RW

Asrc

ins[25-0]

control

ins[31-26]

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

Actrl

ins[5-0]

PC+4[31-28]

ja[31-0]

28

16

opc

2

3

40 of 78

Analyzing performance

Component delays

Register 0
Adder t₊
ALU t_A
Multiplexer 0
Register file t_R
Program memory t_I
Data memory t_M
Bit manipulation components 0

41 of 78

Delay for {add, sub, and, or, slt}

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

+

4

ins[25-21]

ins[20-16]

ins[15-11]

42 of 78

Delay for {sw}

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

ins[25-21]

ins[20-16]

sx

ins[15-0]

16

43 of 78

Clock period in single cycle design

t_R

t_M

t_R

t_A

t₊

t_I

t₊

t_I

t₊

t_I

R-class

lw

sw

beq

j

clock

period

44 of 78

Clock period in multi-cycle design

t_R

t_M

t_R

t_A

t₊

t_I

t₊

t_I

t₊

t_I

R-class

lw

sw

beq

j

clock

period

45 of 78

Cycle time and CPI

cycle time

short

long

low

high

CPI

single cycle

design

pipelined

design

multi-cycle

design

46 of 78

PIpelined datapath (abstract)

PC

IM

ad

ins

RF

rad

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

IF

ID

EX

Mem

WB

IF/ID

ID/EX

EX/Mem

Mem/WB

47 of 78

Fetch new instruction every cycle

PC

IM

ad

ins

RF

rad

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

IF

ID

EX

Mem

WB

IF/ID

ID/EX

EX/Mem

Mem/WB

48 of 78

Pipelined processor design

PC

IM

ad

ins

RF

rad1

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

0

1

0

1

s2

sx

1

0

rad2

control

Actrl

0

1

0

bubble

IF/IDw

PCw

PCw=0

IF/IDw=0

bubble=1

flush

49 of 78

Graphical representation

IM

RF

DM

ALU

RF

5 stage pipeline

IF

ID

EX

Mem

WB

stages

actions

50 of 78

Usage of stages by instructions

IM

RF

DM

ALU

RF

IM

RF

DM

ALU

RF

IM

RF

DM

ALU

RF

IM

RF

DM

ALU

RF

lw

sw

add

beq

51 of 78

Pipelining

IF D RF EX/AG M WB

Faster throughput with pipelining

Simple multicycle design :

Resource sharing across cycles
All instructions may not take same cycles

52 of 78

Degree of overlap Depth

Serial

Overlapped

Pipelined

Shallow

Deep

53 of 78

Hazards in Pipelining

Procedural dependencies => Control hazards

cond and uncond branches, calls/returns

Data dependencies => Data hazards

RAW (read after write)
WAR (write after read)
WAW (write after write)

Resource conflicts => Structural hazards

use of same resource in different stages

54 of 78

Data Hazards

delay = 3

current

instr

read/write

55 of 78

Structural Hazards

Use of a hardware resource in more than one cycle

Different sequences of resource usage by different instructions

Non-pipelined multi-cycle resources

A

B

A

C

A

B

A

C

A

B

A

C

A

B

C

D

A

C

B

D

F

D

X

F

D

X

Caused by Resource Conflicts

56 of 78

Control Hazards

delay = 5

branch

instr

next inline

instr

target

instr

cond eval

delay = 2

the order of cond eval and target addr gen may be different
cond eval may be done in previous instruction

target addr gen

57 of 78

Pipeline Performance

CPI = 1 + (S - 1) * b

Time = CPI * T / S

T

S stages

Frequency of interruptions - b

58 of 78

Improving Branch Performance

Branch Elimination

Replace branch with other instructions

Branch Speed Up

Reduce time for computing CC and TIF

Branch Prediction

Guess the outcome and proceed, undo if necessary

Branch Target Capture

Make use of history

59 of 78

Branch Elimination

C

S

Use conditional instructions

(predicated execution)

T

F

C : S

OP1

BC CC = Z, * + 2

ADD R3, R2, R1

OP2

OP1

ADD R3, R2, R1, NZ

OP2

60 of 78

Branch Speed Up : �Early target address generation

Assume each instruction is Branch
Generate target address while decoding
If target in same page omit translation
After decoding discard target address if not Branch

IF IF IF D TIF TIF TIF

AG

BC

61 of 78

Branch Prediction

Treat conditional branches as unconditional branches / NOP
Undo if necessary

Strategies:

Fixed (always guess inline)
Static (guess on the basis of instruction type)
Dynamic (guess based on recent history)

62 of 78

Static Branch Prediction

Total 68.2%

63 of 78

Branch Target Capture

Branch Target Buffer (BTB)
Target Instruction Buffer (TIB)

instr addr pred stats target

target addr

target instr

prob of target change < 5%

64 of 78

BTB Performance

BTB miss

go inline

inline

BTB hit

go to target

decision

result

target inline

target

delay

0

5

4

0

.4 .6

.8 .2

.2 .8

.4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0 = 0.88 (Eff.Delay)

65 of 78

Compute/fetch scheme

I - cache

I

F

A�R

+

Instruction

Fetch address

Compute

BTA

IIFA

Next sequential

address

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

(no dynamic branch prediction)

66 of 78

BTAC scheme

I - cache

I

F

A�R

+

Instruction

Fetch address

BTA

IIFA

Next sequential

address

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

BTAC

BA BTA

67 of 78

ILP in VLIW processors

Cache/

memory

Fetch

Unit

Single multi-operation instruction

multi-operation instruction

FU

Register file

68 of 78

ILP in Superscalar processors

Cache/

memory

Fetch

Unit

Multiple instruction

Sequential stream of instructions

FU

Register file

Decode

and issue

unit

Instruction/control

Data

FU

Funtional Unit

69 of 78

Why Superscalars are popular ?

Binary code compatibility among scalar & superscalar processors of same family
Same compiler works for all processors (scalars and superscalars) of same family
Assembly programming of VLIWs is tedious
Code density in VLIWs is very poor - Instruction encoding schemes

slide 69

70 of 78

Hierarchical structure

M

e

m

o

r

y

C

P

U

M

e

m

o

r

y

S

i

z

e

C

o

s

t

/

b

i

t

S

p

e

d

S

m

a

l

e

s

t

B

i

g

e

s

t

H

i

g

h

e

s

t

L

o

w

e

s

t

F

a

s

t

e

s

t

S

l

o

w

e

s

t

M

e

m

o

r

y

71 of 78

Data transfer between levels

unit of transfer = block

access

hit

miss

72 of 78

Principle of locality & Cache Policies

Temporal Locality

references repeated in time

Spatial Locality

references repeated in space
Special case: Sequential Locality

============================

Read

Sequential / Concurrent
Simple / Forward

Load

Block load / Load forward / Wrap around

Replacement

LRU / LFU / FIFO / Random

73 of 78

Load policies

4 AU Block

Cache miss on AU 1

Block Load

Load Forward

Fetch Bypass

(wrap around

load)

0

1

2

3

74 of 78

Fetch Policies

Demand fetching

fetch only when required (miss)

Hardware prefetching

automatically prefetch next block

Software prefetching

programmer decides to prefetch

questions:

how much ahead (prefetch distance)
how often

75 of 78

Write Policies

Write Hit

Write Back
Write Through

Write Miss

Write Back
Write Through

With Write Allocate
With No Write Allocate

76 of 78

Cache Types

Instruction | Data | Unified | Split

Split vs. Unified:

Split allows specializing each part
Unified allows best use of the capacity

On-chip | Off-chip

on-chip : fast but small
off-chip : large but slow

Single level | Multi level

77 of 78

References

Patterson, D A.; Hennessy, J L. Computer Organization and Design: The Hardware/software Interface. Morgan Kaufman, 2000
Sima, T, FOUNTAIN, P KACSUK, Advanced Computer Architectures: A Design Space Approach, Pearson Education, 1998
Flynn M J, Computer Architecture: Pipelined and Parallel Processor Design, Narosa publishing India, 1999
John L. Hennessy, David A. Patterson, Computer architecture: a quantitative approach, 2^nd Ed, Morgan Kauffman, 2001

78 of 78

Thanks