1 of 31

Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3

Appendix C & H

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 31

Outline

  • ILP: Concepts and challenges (3.1)
  • Basic Compiler Techniques for Exposing ILP (3.2)
  • Advanced Branch Prediction Techniques (3.3)

2

Prof. Iyad Jafar

3 of 31

Reading Assignment

  • Read Appendix C
  • Read Appendix H

3

Prof. Iyad Jafar

4 of 31

Instruction-Level Parallelism�Concepts and Challenges

4

Prof. Iyad Jafar

5 of 31

ILP Concepts and Challenges

  • Instruction-level parallelism is basically to overlap the execution of instructions to improve performance
  • Pipelining is a basic form for ILP that adopted in most processors since 1985 with attempts to achieve an ideal CPI of 1
  • Dependencies among instructions executed in parallel affects the improvement gain of ILP!

CPIPipeline = CPIIdeal + Stalls due to dependencies

Stalls due to dependencies = Structural + Data + Control

  • Extensions to reduce effect of dependencies:
    • Static Software-based (PMD)
    • Dynamic HW-based (Desktop and Servers)

5

Prof. Iyad Jafar

6 of 31

Techniques for Improving ILP

6

Prof. Iyad Jafar

7 of 31

Data Dependences and Hazards

  •  

7

Prof. Iyad Jafar

8 of 31

Name Dependence

  •  

8

add x3, x2, x4

ld x2, 10(x5)

add x2, x3, x4

ld x2, 10(x5)

Prof. Iyad Jafar

9 of 31

Data Hazards

  •  

9

ld x2, 10(x5)

add x3, x2, x4

add x3, x2, x4

ld x2, 10(x5)

add x2, x3, x4

ld x2, 10(x5)

add x3, x2, x4

ld x5, 10(x2)

WAW and WAR don’t happen in normal pipeline! Out-of-Order execution

Prof. Iyad Jafar

10 of 31

Control Hazards

  •  

10

For: add x3, x2, x4

ld x2, 10(x5)

bne x2, x0, For

sub x1, x10, x18

Prof. Iyad Jafar

11 of 31

Basic Compiler Techniques for Exposing ILP

11

Prof. Iyad Jafar

12 of 31

Basic Compiler Techniques for ILP

  • These techniques are crucial for processors that use static issue or static scheduling.
  • Basic Idea
    • Find sequences of unrelated instructions that can be overlapped in the pipeline to keep it full!
  • To avoid a pipeline stall
    • Dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.
  • Two basic techniques
    • Scheduling instructions (reordering)
    • Loop unrolling
  • Advanced techniques in Appendix H

12

Prof. Iyad Jafar

13 of 31

Scheduling

  • Consider the following loop that is to be executed on a standard 5-stage pipeline with no structural hazards and latencies of different operations as given in the table. Assume that
    • The address of the last element in x is in x1
    • R[x2] contains the address of x[0] - 8
    • f2 contains the scalar s

13

Let’s analyze the performance

Prof. Iyad Jafar

14 of 31

Scheduling

14

No scheduling

8 issue cycles per element

With scheduling

7 issue cycles per element

3 cycles are actually needed to process an element

2 cycles stalls

2 cycles loop overhead per element?

Reduce loop overhead by unrolling the loop

Move ?

Prof. Iyad Jafar

15 of 31

Loop Unrolling

  •  

15

Prof. Iyad Jafar

16 of 31

Loop Unrolling

16

Unrolled loop without scheduling

4 elements are processed per iteration

27/4 cycles per element

13 stall cycles!

12 cycles actually needed

2 cycles overhead/ 4 elements

Better performance!

Schedule the loop

Note: number of live registers vs. original loop

Register pressure!

Prof. Iyad Jafar

17 of 31

Loop Unrolling

17

4 elements are processed per iteration

14 cycles/4 elements

12 cycles actually needed

2 cycles overhead/ 4 elements

0 stall cycles

Limitations

  1. Code Size
  2. Register pressure

Unrolled loop with scheduling

Prof. Iyad Jafar

18 of 31

Strip Mining

  •  

18

Prof. Iyad Jafar

19 of 31

Advanced Branch Prediction Techniques

19

Prof. Iyad Jafar

20 of 31

Advanced Branch Prediction

  • Branches hurt pipeline performance! Reduction of cost
    • Loop unrolling
    • Prediction techniques
  • Static Prediction
    • Predict as taken
    • Predict as not taken
  • Dynamic Prediction
    • 1-bit prediction/2-bit prediction
    • Correlating predictors (two-level)
    • G-share predictor
    • Tournament predictors
    • Tagged Hybrid Predictors

20

Prof. Iyad Jafar

21 of 31

2-bit Predictor

  • Uses Branch History Table (BHT) with n = 2 bits and 2k entries
    • Table size = n × 2k bits
  • For each branch, predict taken or not taken
  • Only change prediction on two successive mispredictions

21

Prof. Iyad Jafar

22 of 31

Correlating predictors (two-level)

  •  

22

Prof. Iyad Jafar

23 of 31

Correlating predictors (two-level)

  •  

23

Shift register

Prof. Iyad Jafar

24 of 31

Correlating predictors (two-level)

  • Correlating predictor not only outperforms a simple 2-bit predictor (0,2) with the same total number of state bits, but it also often outperforms a 2-bit predictor with an unlimited number of entries

24

Prof. Iyad Jafar

25 of 31

G-share Predictor

  • The index to BHT is formed by XORing the address of the branch and the most recent conditional branch outcomes

  • Acts as a hash of the branch address and the branch history

  • The hashed result is used to index a prediction array of 2-bit counters

25

Prof. Iyad Jafar

26 of 31

Tournament Predictors

  • Global predictor
    • A global predictor indexed by the outcome of the most recent branch history
  • Local predictor
    • For local prediction and selecting the global of local predictor
  • Changing the preferred predictor (local/global) for a branch address when two mispredictions occur in a row

26

Prof. Iyad Jafar

27 of 31

Tournament Predictors

27

Prof. Iyad Jafar

28 of 31

Tagged Hybrid Predictors (Best as of 2017)

  • A series of global predictors indexed with different length histories
  • A five-component tagged hybrid predictor
    • 2/3-bit prediction tables indexed by a hash of the branch address and a segment of recent branch history (h) of length 0–4
    • Uses a tag (typically 4–8 bits)
    • A prediction from P(1) … P(4) is used if the tag matches the hash
    • The prediction for a given branch is the predictor with the longest branch history that also has matching tags

28

Prof. Iyad Jafar

29 of 31

Tagged Hybrid Predictor

29

Prof. Iyad Jafar

30 of 31

The Evolution of Intel Core i7 Branch Predictor

  • Six generations of Intel Core i7 processors between 2008 (Core i7 920, Nehalem microarchitecture) and 2016 (Core i7 6700, Skylake microarchitecture)
    • Deep pipelining and multiple issues per clock
    • The i7 has many instructions in-flight at once (30 to 256) 🡪 misspredictions are expensive
  • Core i7 920
    • Two-level predictor
    • Each has 2-bit predictor, global history predictor and loop exit predictor
  • Newer models
    • Some form of tagged hybrid predictor

30

Prof. Iyad Jafar

31 of 31

The Evolution of Intel Core i7 Branch Predictor

31

Prof. Iyad Jafar