1 of 52

The Processor�Single-Cycle Implementation

Chapter 4

Sections: 4.1-4.4, Appendix A, Appendix C.1 and C.2

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 52

Outline

  • Introduction
  • Logic Design Conventions
  • Building Single-Cycle Datapath
  • Building Single-Cycle Control Unit
  • Single-Cycle Performance

2

Prof. Iyad Jafar

3 of 52

Introduction

3

Prof. Iyad Jafar

4 of 52

Introduction

  • So far, we have built a small ALU for some instructions:
    • Add, sub, and, or, slt …
  • What about:
    • Memory and registers?
    • Controlling the ALU based on the instruction?
    • Decoding the instruction?
  • The big picture for a processor:
    • A datapath through which data is moved around.
    • A control unit to manage data.
  • Different implementation schemes:
    • Single-cycle, Multi-cycle, Pipelining

4

Prof. Iyad Jafar

5 of 52

Logic Design Conventions

5

Prof. Iyad Jafar

6 of 52

Logic Design Conventions

  • The datapath contains two types of elements:
    • Combinational
      • Elements that operate on data values such as gates, adders, ALU and multiplexors.
      • Output is function of input only.

    • Sequential
      • Elements that contain state such as flip-flops, memory, file register, PC, IR.
      • Output depends on stored value and input.
      • Stored value is allowed to change at the clock edge.

6

A

B

Y

A

B

Y

+

I0

I1

Y

M�u�x

S

A

B

Y

ALU

F

D

Clk

Q

Prof. Iyad Jafar

7 of 52

Sequential Elements

  • A state element may require updating stored value on every clock edge.

  • Or, it may update on clock edge when write control is active.

7

Clk

D

Q

D

Clk

Q

Write

Write

D

Q

Clk

D

Clk

Q

Prof. Iyad Jafar

8 of 52

Registers

  • A register is basically a group of binary storage elements.

8

4-bit Register

4-bit Register with load control

(clock gating)

4-bit Register with load control

Prof. Iyad Jafar

9 of 52

Clocking Methodology

  • The processors’ clock defines when signals can be read from or written to state elements.
    • Edge-triggered methodology.
    • Rising or falling.

  • Typical execution:
    • read contents of state elements.
    • send values through combinational logic.
    • write results to one or more state elements.

  • Cycle time is determined by the longest delay in the path.

9

Combinational

logic

State

Element

State

Element

clock

one clock cycle

Prof. Iyad Jafar

10 of 52

Building Single-Cycle Datapath

10

Prof. Iyad Jafar

11 of 52

Single-Cycle Implementation

  • Simplest implementation.
  • All instructions start and finish execution in one clock cycle.
  • This includes the time required to fetch, decode, and execute the instruction.
  • We consider a simple implementation that supports:
    • The arithmetic-logical instructions add, sub, and, or
    • The memory-reference instructions load word (lw) and store word (sw)
    • The conditional branch instruction branch if equal (beq)
  • Implementation of remaining instructions is similar.

11

Prof. Iyad Jafar

12 of 52

Remember

  • Programs are loaded into memory at some address prior to execution.
  • Once loaded, all instructions require these steps:
    1. Fetch the instruction: Read the instruction from memory.
    2. Decode the instruction: Understand what should be done (Control Unit).
    3. Execute the instruction: Perform the required task (ALU).
    4. Update PC by 4 (or branch address/jump address).
    5. Repeat until the end of program or something interrupts the CPU.
  • Let’s build the datapath required for each of these steps.

12

Fetch

Decode

Execute

Prof. Iyad Jafar

13 of 52

Fetch Datapath

  • Fetching the instruction from memory requires:
    • Sending the PC to memory to read the instruction.
    • Calculate the updated PC to point to the next instruction (PC+4).

  • Do we need an explicit write signal for writing the PC?
  • Do we need an explicit read signal for reading the memory?
  • We will assume separate memories for data and instructions. Why?

13

PC

Read

Address

Data

Instruction

Memory

+

4

Instruction

Prof. Iyad Jafar

14 of 52

Decode Datapath

  • Regardless of the instruction, do the following:
    • Send the opcode and the function fields in the instruction to the control unit.
    • Read two registers at the rs1 and rs2 fields in the instruction.
      • Not all instructions need two registers.
      • Reading is not harmful.

14

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register File

Read

Data 1

Read

Data 2

Control

Unit

R[rs1]

R[rs2]

Prof. Iyad Jafar

15 of 52

Inside the Register File

  • For reading, select a register based on its address (number)
    • Use two 32-to-1 multiplexors for each read register.

  • We don’t need an explicit read signal
    • We read on every cycle.

15

Register 0

Register 1

Register 2

….

Register 31

32-to-1 MUX

32-to-1 MUX

rs1

rs2

R[rs1]

R[rs2]

0

1

31

0

1

31

.

.

.

.

Prof. Iyad Jafar

16 of 52

Inside the Register File

  • For writing to a register, enable it based on its address:
    • We need a RegWrite control signal since the register file is not written on every cycle.
    • Use 5-to-32 decoder to generate enabling signals based on rd.
    • Gate the clock with write signal and decoders’ outputs.

16

rd

Write Data

Register 0

D

C

Register 1

D

C

Register 2

D

C

…..

D

C

Register 31

D

C

RegWrite

5-to-32 Decoder

31

1

0

Clock

Prof. Iyad Jafar

17 of 52

Inside the Register File

17

rd

5-to-32 Decoder

Write Data

Register 0

D

C

Register 1

D

C

Register 2

D

C

…..

D

C

Register 31

D

C

RegWrite

31

1

0

Clock

32-to-1 MUX

0

1

31

.

.

32-to-1 MUX

0

1

31

.

.

R[rs1]

R[rs2]

rs1

rs2

.

.

.

.

.

.

.

Prof. Iyad Jafar

18 of 52

Execute Datapath for R-Type

  • At this stage, the execution of instructions takes different paths depending on their type.
  • For R-type instructions (add, sub, and, or, …):
    • The two registers are read already.
    • Perform operation based on opcode and function fields.
    • Store the result back into the register file.

18

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register File

Read

Data 1

Read

Data 2

R[rs1]

R[rs2]

Instruction

Write

ALU

RegWrite

ALU Operation

rd

Not all instructions write to register file. Need RegWrite signal

Instructions use the ALU differently. Need ALU Operation signal

R[rd] 🡨 R[rs1] op R[rs2]

Prof. Iyad Jafar

19 of 52

Execute Datapath for LW

  • Data is stored in a separate memory.
  • For load instruction:
    • Base register rs1 is already read.
    • Sign-extend the immediate and use ALU to calculate the memory address.
    • Store the loaded value to register rd in the register file.

19

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register File

Read

Data 1

Read

Data 2

R[rs1]

R[rs2]

Instruction

Write

ALU

RegWrite

ALU Operation

Address

Data

Data Memory

Imm Gen

Write

Data

MemRead

imm

rd

Not all instructions read from memory

Need MemRead signal

R[rd] 🡨 M[R[rs1] + sign_ext(offset)]

Prof. Iyad Jafar

20 of 52

Execution Datapath for SW

  • For store instruction:
    • Base register rs1 is already read.
    • Sign-extend the immediate and use ALU to calculate the memory address.
    • Store R[rs2] to memory.

20

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register File

Read

Data 1

Read

Data 2

R[rs1]

R[rs2]

Instruction

Write

ALU

RegWrite

ALU Operation

Address

Data

Data Memory

Imm Gen

Write

Data

MemRead

MemWrite

imm

Not all instructions write to memory. Need MemWrite signal

M[R[rs1]+sign_ext(offset)] 🡨 R[rs2]

Prof. Iyad Jafar

21 of 52

Execution Datapath for BEQ

  • For branch if equal:
    • Use ALU to subtract two registers.
    • Use an adder to calculate branch address.
    • Check Zero flag to update PC accordingly.

21

if (R[rs1] == R[rs2])

PC 🡨 PC + sign_ext(imm)x2

else

PC 🡨 PC +4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register File

Read

Data 1

Read

Data 2

Instruction

Write

RegWrite

PC

+

4

ALU

ALU

Operation

+

Imm Gen

Zero

Branch Address

1

0

Only beq updates PC if Zero signal is 1. Need Branch signal.

Zero

Branch

R[rs1]

R[rs2]

Prof. Iyad Jafar

22 of 52

Combining Datapaths for Single-Cycle

  • Assemble the datapath segments by considering the following issues:
    • Since fetch, decode and execute steps of instructions is in one clock cycle, no datapath resource can be used more than once per instruction.
      • Some elements must be duplicated.
      • Separate Instruction and Data memories, two adders in addition to ALU.

    • Multiplexors are needed at the input(s) shared between elements:
      • PC register input is either PC+4 or branch address.
        • Use Branch and Zero signals to select.
      • ALU second input is either R[rs2] or sign-extended immediate.
        • Add ALUSrc control signal to select.
      • Write port of register file receives data from ALU or Memory.
        • Add MemToReg control signal to select.

22

PC

Prof. Iyad Jafar

23 of 52

Combining Datapaths for Single-Cycle

23

Prof. Iyad Jafar

24 of 52

Immediate Generation Unit

  • Five classes of instructions that work with immediate.

  • The location and operation of the immediate depends on instruction:
    • For immediate arithmetic, jalr, loads and stores 🡪 sign-extend immediate.
    • For branch and jal 🡪 sign_extend immediate x 2.
    • For lui 🡪 lower 12 bits are set to zero.
  • The generated immediate is selected based on the opcode of the instruction, i.e. the unit is basically a set of multiplexors.

24

Prof. Iyad Jafar

25 of 52

Immediate Generation Unit

  • Consider an immediate generation unit from 2 bits to 4 bits with two operations only: (1) Sign_extend (2) Sign_extend x2.

25

B1

B0

0

1

0

1

0

Selection Logic

Opcode

Out0

Out1

Out2

Out3

Oper.

Out3

Out2

Out1

Out0

(1)

(2)

B1

B1

B1

B0

B1

B1

B0

0

How about generating an immediate for lui?

Prof. Iyad Jafar

26 of 52

Building Single-Cycle Control Unit

26

Prof. Iyad Jafar

27 of 52

Building Control Unit

  • Need to design the control that generates the appropriate control signals based on the opcode and function fields to:
    • Specify the operation of the ALU for different instructions.
    • Control the data flow by selecting the appropriate input of the multiplexors.
    • Enable reading/writing from/to state elements.

  • With the following observations across different instructions:
    • Opcode is always in bits I[6:0].
    • The source registers rs1 and rs2 are always in I[19:15] and I[24:20], respectively.
    • The destination register is always the rd field which is in I[11:7].
    • The Immediate Generation unit:
      • Sign-extend the immediate for lw and sw.
      • Sign-extend the immediate and multiplies it by 2 for branch.

27

Prof. Iyad Jafar

28 of 52

Multi-level Decoding and Control

  • We will design control at two levels.
  • Main control unit:
    • Generates control signals to control the ALU control unit, state elements, multiplexors.
    • Signals’ values are determined based on opcode only.
  • ALU control unit:
    • Generates the signals to specify the ALU operation (addition, subtraction, and, or).
    • Signals’ values are determined based on Funct3 and Funct7 fields in the instruction as well as the control signals from the main control (ALUOp).
  • Why?
    • Reduce size of control unit and thus latency and cost.

28

ALU Control

func3

func7

3

7

ALU

Prof. Iyad Jafar

29 of 52

Main Control Unit

  • Generate main control signals based on opcode
    • Truth table!

29

Opcode

7

ALUOp

2

MemWrite

MemToReg

Branch

MemRead

ALUSrc

RegWrite

Main Control

Prof. Iyad Jafar

30 of 52

Datapath with Control

30

32

Prof. Iyad Jafar

31 of 52

Control Signals for R-Type

31

32

0

1

X

Values from main control

ALU Operation

ALUOp1

ALUOp0

R-Type

1

0

LW

0

0

SW

0

0

BEQ

0

1

Prof. Iyad Jafar

32 of 52

Control Signals for LW

32

32

0

1

X

Values from main control

ALU Operation

ALUOp1

ALUOp0

R-Type

1

0

LW

0

0

SW

0

0

BEQ

0

1

Prof. Iyad Jafar

33 of 52

Control Signals for SW

33

32

0

1

X

Values from main control

ALU Operation

ALUOp1

ALUOp0

R-Type

1

0

LW

0

0

SW

0

0

BEQ

0

1

Prof. Iyad Jafar

34 of 52

Control Signals for BEQ

34

32

0

1

X

Values from main control

ALU Operation

ALUOp1

ALUOp0

R-Type

1

0

LW

0

0

SW

0

0

BEQ

0

1

Prof. Iyad Jafar

35 of 52

Summary of Main Control Signals

35

Signal Name

Effect when Deassereted (0)

Effect when Asserted (1)

Branch

Instruction is not branch

Instruction is branch

MemRead

None

Contents of memory address are put on memory data output

MemtoReg

Data written to the register file comes from ALU

Data written to the register file comes from memory

ALUOp

Used with function fields to generate the ALUOp signal that specify the ALU operation

MemWrite

None

Data on memory data input is stored in the specified address

ALUSrc

The second ALU operand comes from R[rs2]

The second ALU operand is the sign extended offset

RegWrite

None

Enable writing to the register file

Prof. Iyad Jafar

36 of 52

Values of Control Signals

36

  • Truth Table

  • Find logic expressions by writing expressions as sum-of-minterms. For example:

Inputs

Outputs

Op6

Op5

Op4

Op3

Op2

Op1

Op0

Branch

MemRead

MemWrite

RegWrite

MemToReg

AlUSrc

ALUOp1

ALUOp0

R-type

0

1

1

0

0

1

1

LW

0

0

0

0

0

1

1

SW

0

1

0

0

0

1

1

BEQ

1

1

0

0

0

1

1

0

0

0

1

0

1

0

0

0

0

1

0

1

1

0

0

0

1

X

X

0

1

1

0

1

0

0

0

0

0

0

1

 

 

Prof. Iyad Jafar

37 of 52

Main Control Unit

37

I[6]

I[5]

I[4]

I[3]

I[2]

I[1]

I[0]

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOp0

lw

sw

beq

R-format

Opcode

R-type

0

1

1

0

0

1

1

LW

0

0

0

0

0

1

1

SW

0

1

0

0

0

1

1

BEQ

1

1

0

0

0

1

1

Prof. Iyad Jafar

38 of 52

ALU Control Unit

  • Assume that the ALU supports the following operations:

  • For R-Type
    • ALU performs add, sub, and, or operations based on funct3 and func7 fields
  • For LW and SW
    • ALU performs addition to calculate memory address
  • For BEQ
    • ALU performs subtraction between two registers

38

Function

Ainv

Bnegate

Op [1:0]

and

0

0

00

or

0

0

01

add

0

0

10

sub

0

1

10

ALU Control

funct3

func7

ALUOp

3

7

2

2

1

1

Op

 

 

Prof. Iyad Jafar

39 of 52

ALU Control Unit

39

ALUOp1

ALUOp0

Funct7

Func3

Op[1]

Op[0]

I[31]

I[30]

I[29]

I[28]

I[27]

I[26]

I[25]

I[14]

I[13]

I[12]

0

0

x

x

x

x

x

x

x

x

x

x

0

0

1

0

0

0

x

x

x

x

x

x

x

x

x

x

0

0

1

0

0

1

x

x

x

x

x

x

x

x

x

x

0

1

1

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

1

0

0

1

0

0

0

0

0

0

0

0

0

1

1

0

1

0

0

0

0

0

0

0

0

1

1

1

0

0

0

0

1

0

0

0

0

0

0

0

0

1

1

0

0

0

0

1

LW

SW

BEQ

ADD

SUB

AND

OR

Prof. Iyad Jafar

40 of 52

ALU Control Unit

40

Prof. Iyad Jafar

41 of 52

Exercise 1

  • Modify the single-cycle implementation to support addi instruction.

41

addi rd, rs1 , immed

Prof. Iyad Jafar

42 of 52

Exercise

  • Modify the single-cycle implementation to support addi instruction.

42

Inputs

Outputs

Op6

Op5

Op4

Op3

Op2

Op1

Op0

Branch

MemRead

MemWrite

RegWrite

MemToReg

AlUSrc

ALUOp1

ALUOp0

R-type

0

1

1

0

0

1

1

LW

0

0

0

0

0

1

1

SW

0

1

0

0

0

1

1

BEQ

1

1

0

0

0

1

1

ADDI

0

0

1

0

0

1

1

0

0

0

1

0

1

0

0

0

0

1

0

1

1

0

0

0

1

X

X

0

1

1

0

1

0

0

0

0

0

0

1

Prof. Iyad Jafar

43 of 52

Exercise 2

  • Modify the single-cycle implementation to support jal instruction.

43

jal rd, immed

Prof. Iyad Jafar

44 of 52

Exercise 2

  • Modify the single-cycle implementation to support jal instruction.

44

Inputs

Outputs

Op6

Op5

Op4

Op3

Op2

Op1

Op0

Branch

MemRead

MemWrite

RegWrite

MemToReg

AlUSrc

ALUOp1

ALUOp0

R-type

0

1

1

0

0

1

1

LW

0

0

0

0

0

1

1

SW

0

1

0

0

0

1

1

BEQ

1

1

0

0

0

1

1

ADDI

0

0

1

0

0

1

1

JAL

1

1

0

1

1

1

1

0

0

0

1

0

0

1

0

0

0

0

0

1

0

0

1

1

0

0

1

0

1

X

X

0

0

1

1

0

1

1

0

0

0

0

0

0

0

1

0

Prof. Iyad Jafar

45 of 52

Single-Cycle Performance

45

Prof. Iyad Jafar

46 of 52

Performance Analysis

  • All instructions have to finish in one cycle.
  • What is the cycle time?
    • Different units are used in different instructions with different delays.
    • Cycle time should be long enough to accommodate for longest delay.

46

Fetch

Fetch

Unit

Delay

ALU

2 ns

Memory

2 ns

Register File

1 ns

Fetch

Fetch

Register

Register

Register

Register

ALU

ALU

ALU

ALU

Register

Register

Read Mem.

Write Mem.

R-Type

Load

Store

Branch

6 ns

8 ns

7 ns

5 ns

Prof. Iyad Jafar

47 of 52

Performance Analysis

  • The cycle time is fixed.
  • However, not all instructions require the same time.
    • There is a wasted time for some instructions.

  • Possible solution
    • Modify the cycle time based on instruction.
    • Example!

47

LW

SW

Clock

Cycle 1

Cycle 2

waste

Prof. Iyad Jafar

48 of 52

Example

  • Consider the following two single-cycle implementations:
    • Processor A: all instructions execute in one cycle of fixed length.
    • Processor B: all instructions execute in one cycle , however, the cycle time adapts to instruction types.

Use the information given in the tables to compare the performance of the two processors using a program the have the given instruction mix.

48

Unit

Time (ps)

Memory

200

ALU and adders

100

Register File

50

Instruction

%

R-Type

45

Load

25

Store

10

Branch

15

Prof. Iyad Jafar

49 of 52

Example

49

Unit

Time (ps)

Memory

200

ALU and adders

100

Register File

50

Instruction

%

R-Type

45

Load

25

Store

10

Branch

15

 

 

 

 

 

 

Processor B is 1.37 faster

So, adaptive clock cycle is faster; however it is hard to implement

IM

Reg

ALU

DM

Reg

Total Time

R-type

Load

Store

Branch

Prof. Iyad Jafar

50 of 52

Single-Cycle Summary

  • Single-cycle implementation assumes that all instructions can execute in one cycle.

  • Advantages:
    • Simple
    • Easy to understand

  • Disadvantages:
    • Hardware duplication 🡪 cost and power.
    • Uses the clock cycle inefficiently as the clock cycle must accommodate the slowest instruction
      • Problematic for more complex instructions like floating point multiply

50

Prof. Iyad Jafar

51 of 52

Suggested Problems

51

Prof. Iyad Jafar

52 of 52

Suggested Problems

  • 4.1
  • 4.3
  • 4.4
  • 4.5
  • 4.6
  • 4.8
  • 4.9
  • 4.11
  • 4.12
  • 4.13

52

Prof. Iyad Jafar