1 of 78

Review of Computer Architetcure

A Sahu

Deptt. of Comp. Sc. & Engg.

IIT Guwahati

2 of 78

Outline

  • Computer organization Vs Architecture
  • Processor architecture
  • Pipeline architecture
    • Data, resource and branch hazards
  • Superscalar & VLIW architecture
  • Memory hierarchy
  • Reference

3 of 78

Computer organization Vs Architecture

Comp Organization => Digital Logic Module

Logic and Low level

============================

Comp Architecture = > ISA Design, MicroArch Design

Algorithm for

  • Designing best micro architecture,
  • Pipeline model,
  • Branch prediction strategy, memory management
  • Etc…..

4 of 78

Hardware abstraction

Main

memory

bridge

Bus interface

ALU

Register file

CPU

System bus

Memory bus

Disk

controller

Graphics

adapter

USB

controller

Mouse

Keyboard

Display

Disk

I/O bus

Expansion slots for

other devices such

as network adapters

PC

5 of 78

Hardware/software interface

software

hardware

C++

m/c instr

reg, adder

transistors

Arch. focus

  • Instruction set architecture
    • Lowest level visible to a programmer
  • Micro architecture
    • Fills the gap between instructions and logic modules

6 of 78

Instruction Set Architecture

  • Assembly Language View
    • Processor state (RF, mem)
    • Instruction set and encoding
  • Layer of Abstraction
    • Above: how to program machine - HLL, OS
    • Below: what needs to be built - tricks to make it run fast

ISA

Compiler

OS

CPU

Design

Circuit

Design

Chip

Layout

Application

Program

7 of 78

The Abstract Machine

  • Programmer-Visible State
    • PC Program Counter
    • Register File
      • heavily used data
    • Condition Codes

PC

Registers

CPU

Memory

Code + Data

Addresses

Data

Instructions

Stack

Condition

Codes

    • Memory
      • Byte array
      • Code + data
      • stack

ALU

8 of 78

Instructions

  • Language of Machine
  • Easily interpreted
  • primitive compared to HLLs

  • Instruction set design goals
    • maximize performance,
    • minimize cost,
    • reduce design time

9 of 78

Instructions

  • All MIPS Instructions: 32 bit long, have 3 operands
    • Operand order is fixed (destination first) �Example:� C code: A = B + C� MIPS code: add $s0, $s1, $s2 � (associated with variables by compiler)

  • Registers numbers 0 .. 31, e.g., $t0=8,$t1=9,$s0=16,$s1=17 etc.
  • 000000 10001 10010 01000 00000 100000� op rs rt rd shamt funct

10 of 78

Instructions LD/ST & Control

  • Load and store instructions
  • Example:� C code: A[8] = h + A[8];� MIPS code: lw $t0, 32($s3)� add $t0, $s2, $t0� sw $t0, 32($s3)
  • Example: lw $t0, 32($s2)� 35 18 9 32� op rs rt 16 bit number
  • Example:�if (i != j) beq $s4, $s5, Lab1� h = i + j; add $s3, $s4, $s5�else j Lab2� h = i - j; Lab1: sub $s3, $s4, $s5� Lab2: ...

11 of 78

What constitutes ISA?

  • Set of basic/primitive operations
    • Arithmetic, Logical, Relational, Branch/jump, Data movement
  • Storage structure – registers/memory
    • Register-less machine, ACC based machine, A few special purpose registers, Several Gen purpose registers, Large number of registers
  • How addresses are specified
    • Direct, Indirect, Base vs. Index, Auto incr and auto decr, Pre (post) incr/decr, Stack
  • How operand are specified
    • 3 address machine r1 = r2 + r3, 2 address machine r1 = r1 + r2
    • 1 address machine Acc = Acc + x (Acc is implicit)
    • 0 address machine add values on (top of stack)
  • How instructions are encoded

12 of 78

RISC vs. CISC

  • RISC
    • Uniformity of instructions,
    • Simple set of operations and addressing modes,
    • Register based architecture with 3 address instructions
  • RISC: Virtually all new ISA since 1982
    • ARM, MIPS, SPARC, HP’s PA-RISC, PowerPC, Alpha, CDC 6600
  • CISC : Minimize code size, make assembly language easy� VAX: instructions from 1 to 54 bytes long!

Motorola 680x0, Intel 80x86

13 of 78

MIPS subset for implementation

  • Arithmetic - logic instructions
    • add, sub, and, or, slt
  • Memory reference instructions
    • lw, sw
  • Control flow instructions
    • beq, j

Incremental changes in the design to include other instructions will be discussed later

14 of 78

Design overview

R

e

g

i

s

t

e

r

s

R

e

g

i

s

t

e

r

#

D

a

t

a

R

e

g

i

s

t

e

r

#

D

a

t

a

m

e

m

o

r

y

A

d

d

r

e

s

s

D

a

t

a

R

e

g

i

s

t

e

r

#

P

C

I

n

s

t

r

u

c

t

i

o

n

A

L

U

I

n

s

t

r

u

c

t

i

o

n

m

e

m

o

r

y

A

d

d

r

e

s

s

  • Use the program counter (PC) to supply instruction address
  • Get the instruction from memory
  • Read registers
  • Use the instruction to decide exactly what to do

15 of 78

Division into data path and control

DATA PATH

CONTROLLER

control

signals

status

signals

16 of 78

Building block types

Two types of functional units:

  • elements that operate on data values (combinational)
    • output is function of current input, no memory
    • Examples
      • gates: and, or, nand, nor, xor, inverter ,Multiplexer, decoder, adder, subtractor, comparator, ALU, array multipliers
  • elements that contain state (sequential)
    • output is function of current and previous inputs
    • state = memory
    • Examples:
      • flip-flops, counters, registers, register files, memories

17 of 78

Components for MIPS subset

  • Register,
  • Adder
  • ALU
  • Multiplexer
  • Register file
  • Program memory
  • Data memory
  • Bit manipulation components

18 of 78

Components - register

PC

clock

32

32

19 of 78

Components - adder

32

32

32

PC+4

offset

+

32

32

32

PC

4

+

20 of 78

Components - ALU

32

32

32

operation

result

a

b

ALU

a=b

overflow

21 of 78

Components - multiplexers

0

mux

1

32

32

PC+4

PC+4+offset

32

select

22 of 78

Components - register file

R

e

g

W

r

i

t

e

R

e

g

i

s

t

e

r

s

W

r

i

t

e

r

e

g

i

s

t

e

r

R

e

a

d

d

a

t

a

1

R

e

a

d

d

a

t

a

2

R

e

a

d

r

e

g

i

s

t

e

r

1

R

e

a

d

r

e

g

i

s

t

e

r

2

W

r

i

t

e

d

a

t

a

D

a

t

a

D

a

t

a

R

e

g

i

s

t

e

r

n

u

m

b

e

r

s

5

5

5

23 of 78

Components - program memory

I

n

s

t

r

u

c

t

i

o

n

m

e

m

o

r

y

I

n

s

t

r

u

c

t

i

o

n

a

d

d

r

e

s

s

I

n

s

t

r

u

c

t

i

o

n

24 of 78

MIPS components - data memory

M

e

m

R

e

a

d

M

e

m

W

r

i

t

e

D

a

t

a

m

e

m

o

r

y

W

r

i

t

e

d

a

t

a

R

e

a

d

d

a

t

a

A

d

d

r

e

s

s

25 of 78

Components - bit manipulation circuits

Sign

xtend

16

32

shift

32

32

0

LSB

MSB

LSB

MSB

26 of 78

MIPS subset for implementation

  • Arithmetic - logic instructions
    • add, sub, and, or, slt
  • Memory reference instructions
    • lw, sw
  • Control flow instructions
    • beq, j

27 of 78

Datapath for add,sub,and,or,slt

  • Fetch instruction
  • Address the register file
  • Pass operands to ALU actions
  • Pass result to register file required
  • Increment PC

Format: add $t0, $s1, $s2

000000 10001 10010 01000 00000 100000� op rs rt rd shamt funct

28 of 78

Fetching instruction

PC

IM

ad

ins

29 of 78

Addressing RF

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ins[25-21]

ins[20-16]

30 of 78

Passing operands to ALU

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

ins[25-21]

ins[20-16]

31 of 78

Passing the result to RF

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

ins[25-21]

ins[20-16]

ins[15-11]

32 of 78

Incrementing PC

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

+

4

ins[25-21]

ins[20-16]

ins[15-11]

33 of 78

Load and Store instructions

  • format : I

  • Example: lw $t0, 32($s2)�� 35 18 9 32� op rs rt 16 bit number

34 of 78

Adding “sw” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

ins[25-21]

ins[20-16]

ins[15-11]

0

1

sx

ins[15-0]

16

35 of 78

Adding “lw” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

sx

0

1

0

1

4

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

16

1

0

36 of 78

Adding “beq” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

sx

0

1

1

0

0

1

4

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

16

+

s2

0

1

37 of 78

Adding “j” instruction

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

+

s2

sx

0

1

0

1

1

0

0

1

4

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

0

1

s2

ins[25-0]

PC+4[31-28]

ja[31-0]

28

16

38 of 78

Control signals

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

+

s2

sx

0

1

0

1

1

0

0

1

0

1

s2

4

Rdst

jmp

Z

MR

RW

ins[25-0]

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

PC+4[31-28]

ja[31-0]

28

16

MW

op

3

Asrc

Psrc

M2R

39 of 78

Datapath + Control

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

+

s2

sx

0

1

0

1

1

0

0

1

0

1

s2

4

Rdst

jmp

Psrc

Z

brn

MR

MW

M2R

op

RW

Asrc

ins[25-0]

control

ins[31-26]

ins[25-21]

ins[20-16]

ins[15-11]

ins[15-0]

Actrl

ins[5-0]

PC+4[31-28]

ja[31-0]

28

16

opc

2

3

40 of 78

Analyzing performance

Component delays

  • Register 0
  • Adder t+
  • ALU tA
  • Multiplexer 0
  • Register file tR
  • Program memory tI
  • Data memory tM
  • Bit manipulation components 0

41 of 78

Delay for {add, sub, and, or, slt}

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

ALU

+

4

ins[25-21]

ins[20-16]

ins[15-11]

42 of 78

Delay for {sw}

PC

IM

ad

ins

RF

rad1

rad2

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

4

ins[25-21]

ins[20-16]

sx

ins[15-0]

16

43 of 78

Clock period in single cycle design

tR

tR

tM

tM

tR

tR

tR

tR

tA

tA

tA

tA

t+

t+

tI

tI

tI

tI

t+

tI

t+

tI

R-class

lw

sw

beq

j

clock

period

44 of 78

Clock period in multi-cycle design

tR

tR

tM

tM

tR

tR

tR

tR

tA

tA

tA

tA

t+

t+

tI

tI

tI

tI

t+

tI

t+

tI

R-class

lw

sw

beq

j

clock

period

45 of 78

Cycle time and CPI

cycle time

short

long

low

high

CPI

single cycle

design

pipelined

design

multi-cycle

design

46 of 78

PIpelined datapath (abstract)

PC

IM

ad

ins

RF

rad

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

+

4

IF

ID

EX

Mem

WB

IF/ID

ID/EX

EX/Mem

Mem/WB

47 of 78

Fetch new instruction every cycle

PC

IM

ad

ins

RF

rad

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

+

4

IF

ID

EX

Mem

WB

IF/ID

ID/EX

EX/Mem

Mem/WB

48 of 78

Pipelined processor design

PC

IM

ad

ins

RF

rad1

wad

wd

rd1

rd2

DM

ad

rd

wd

ALU

+

+

4

0

1

1

0

0

1

s2

sx

1

0

rad2

control

Actrl

0

1

0

bubble

IF/IDw

PCw

PCw=0

IF/IDw=0

bubble=1

flush

49 of 78

Graphical representation

IM

RF

DM

ALU

RF

5 stage pipeline

IF

ID

EX

Mem

WB

stages

actions

50 of 78

Usage of stages by instructions

IM

RF

DM

ALU

RF

IM

RF

DM

ALU

RF

IM

RF

DM

ALU

RF

IM

RF

DM

ALU

RF

lw

sw

add

beq

51 of 78

Pipelining

IF D RF EX/AG M WB

  • Faster throughput with pipelining

Simple multicycle design :

  • Resource sharing across cycles
  • All instructions may not take same cycles

52 of 78

Degree of overlap Depth

Serial

Overlapped

Pipelined

Shallow

Deep

53 of 78

Hazards in Pipelining

  • Procedural dependencies => Control hazards
    • cond and uncond branches, calls/returns
  • Data dependencies => Data hazards
    • RAW (read after write)
    • WAR (write after read)
    • WAW (write after write)
  • Resource conflicts => Structural hazards
    • use of same resource in different stages

54 of 78

Data Hazards

delay = 3

previous

instr

current

instr

read/write

read/write

55 of 78

Structural Hazards

  • Use of a hardware resource in more than one cycle

  • Different sequences of resource usage by different instructions

  • Non-pipelined multi-cycle resources

A

B

A

C

A

B

A

C

A

B

A

C

A

B

C

D

A

C

B

D

F

D

X

X

F

D

X

X

Caused by Resource Conflicts

56 of 78

Control Hazards

delay = 5

branch

instr

next inline

instr

target

instr

cond eval

delay = 2

  • the order of cond eval and target addr gen may be different
  • cond eval may be done in previous instruction

target addr gen

57 of 78

Pipeline Performance

CPI = 1 + (S - 1) * b

Time = CPI * T / S

T

S stages

Frequency of interruptions - b

58 of 78

Improving Branch Performance

  • Branch Elimination
    • Replace branch with other instructions
  • Branch Speed Up
    • Reduce time for computing CC and TIF
  • Branch Prediction
    • Guess the outcome and proceed, undo if necessary
  • Branch Target Capture
    • Make use of history

59 of 78

Branch Elimination

C

S

Use conditional instructions

(predicated execution)

T

F

C : S

OP1

BC CC = Z, * + 2

ADD R3, R2, R1

OP2

OP1

ADD R3, R2, R1, NZ

OP2

60 of 78

Branch Speed Up : Early target address generation

  • Assume each instruction is Branch
  • Generate target address while decoding
  • If target in same page omit translation
  • After decoding discard target address if not Branch

IF IF IF D TIF TIF TIF

AG

BC

61 of 78

Branch Prediction

  • Treat conditional branches as unconditional branches / NOP
  • Undo if necessary

Strategies:

    • Fixed (always guess inline)
    • Static (guess on the basis of instruction type)
    • Dynamic (guess based on recent history)

62 of 78

Static Branch Prediction

Total 68.2%

63 of 78

Branch Target Capture

  • Branch Target Buffer (BTB)
  • Target Instruction Buffer (TIB)

instr addr pred stats target

target addr

target instr

prob of target change < 5%

64 of 78

BTB Performance

BTB miss

go inline

inline

BTB hit

go to target

decision

result

target inline

target

delay

0

5

4

0

.4 .6

.8 .2

.2 .8

.4*.8*0 + .4*.2*5 + .6*.2*4 + .6*.8*0 = 0.88 (Eff.Delay)

65 of 78

Compute/fetch scheme

I - cache

I

F

A�R

+

Instruction

Fetch address

Compute

BTA

BTA

IIFA

Next sequential

address

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

(no dynamic branch prediction)

66 of 78

BTAC scheme

I - cache

I

F

A�R

+

Instruction

Fetch address

BTA

IIFA

Next sequential

address

A I I + 1 I + 2 I + 3

BTI BTI+1 BTI+2 BTI+3

BTAC

BA BTA

67 of 78

ILP in VLIW processors

Cache/

memory

Fetch

Unit

Single multi-operation instruction

multi-operation instruction

FU

FU

FU

Register file

68 of 78

ILP in Superscalar processors

Cache/

memory

Fetch

Unit

Multiple instruction

Sequential stream of instructions

FU

FU

FU

Register file

Decode

and issue

unit

Instruction/control

Data

FU

Funtional Unit

69 of 78

Why Superscalars are popular ?

  • Binary code compatibility among scalar & superscalar processors of same family
  • Same compiler works for all processors (scalars and superscalars) of same family
  • Assembly programming of VLIWs is tedious
  • Code density in VLIWs is very poor - Instruction encoding schemes

slide 69

70 of 78

Hierarchical structure

M

e

m

o

r

y

C

P

U

M

e

m

o

r

y

S

i

z

e

C

o

s

t

/

b

i

t

S

p

e

e

d

S

m

a

l

l

e

s

t

B

i

g

g

e

s

t

H

i

g

h

e

s

t

L

o

w

e

s

t

F

a

s

t

e

s

t

S

l

o

w

e

s

t

M

e

m

o

r

y

71 of 78

Data transfer between levels

unit of transfer = block

access

hit

miss

72 of 78

Principle of locality & Cache Policies

  • Temporal Locality
    • references repeated in time
  • Spatial Locality
    • references repeated in space
    • Special case: Sequential Locality

============================

  • Read
    • Sequential / Concurrent
    • Simple / Forward
  • Load
    • Block load / Load forward / Wrap around
  • Replacement
    • LRU / LFU / FIFO / Random

73 of 78

Load policies

4 AU Block

Cache miss on AU 1

Block Load

Load Forward

Fetch Bypass

(wrap around

load)

0

1

2

3

74 of 78

Fetch Policies

  • Demand fetching
    • fetch only when required (miss)
  • Hardware prefetching
    • automatically prefetch next block
  • Software prefetching
    • programmer decides to prefetch

questions:

    • how much ahead (prefetch distance)
    • how often

75 of 78

Write Policies

  • Write Hit
    • Write Back
    • Write Through
  • Write Miss
    • Write Back
    • Write Through
      • With Write Allocate
      • With No Write Allocate

76 of 78

Cache Types

Instruction | Data | Unified | Split

Split vs. Unified:

      • Split allows specializing each part
      • Unified allows best use of the capacity

On-chip | Off-chip

      • on-chip : fast but small
      • off-chip : large but slow

Single level | Multi level

77 of 78

References

  1. Patterson, D A.; Hennessy, J L. Computer Organization and Design: The Hardware/software Interface. Morgan Kaufman, 2000
  2. Sima, T, FOUNTAIN, P KACSUK, Advanced Computer Architectures: A Design Space Approach, Pearson Education, 1998
  3. Flynn M J, Computer Architecture: Pipelined and Parallel Processor Design, Narosa publishing India, 1999
  4. John L. Hennessy, David A. Patterson, Computer architecture: a quantitative approach, 2nd Ed, Morgan Kauffman, 2001

78 of 78

Thanks