1 of 31

Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3

Appendix C & H

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 31

Outline

ILP: Concepts and challenges (3.1)
Basic Compiler Techniques for Exposing ILP (3.2)
Advanced Branch Prediction Techniques (3.3)

Prof. Iyad Jafar

3 of 31

Reading Assignment

Read Appendix C
Read Appendix H

Prof. Iyad Jafar

4 of 31

Instruction-Level Parallelism�Concepts and Challenges

Prof. Iyad Jafar

5 of 31

ILP Concepts and Challenges

Instruction-level parallelism is basically to overlap the execution of instructions to improve performance
Pipelining is a basic form for ILP that adopted in most processors since 1985 with attempts to achieve an ideal CPI of 1
Dependencies among instructions executed in parallel affects the improvement gain of ILP!

CPI_Pipeline = CPI_Ideal + Stalls due to dependencies

Stalls due to dependencies = Structural + Data + Control

Extensions to reduce effect of dependencies:

Static Software-based (PMD)
Dynamic HW-based (Desktop and Servers)

Prof. Iyad Jafar

6 of 31

Techniques for Improving ILP

Prof. Iyad Jafar

7 of 31

Data Dependences and Hazards

Prof. Iyad Jafar

8 of 31

Name Dependence

add x3, x2, x4

ld x2, 10(x5)

add x2, x3, x4

ld x2, 10(x5)

Prof. Iyad Jafar

9 of 31

Data Hazards

ld x2, 10(x5)

add x3, x2, x4

ld x2, 10(x5)

add x2, x3, x4

ld x2, 10(x5)

add x3, x2, x4

ld x5, 10(x2)

WAW and WAR don’t happen in normal pipeline! Out-of-Order execution

Prof. Iyad Jafar

10 of 31

Control Hazards

For: add x3, x2, x4

ld x2, 10(x5)

bne x2, x0, For

sub x1, x10, x18

Prof. Iyad Jafar

11 of 31

Basic Compiler Techniques for Exposing ILP

Prof. Iyad Jafar

12 of 31

Basic Compiler Techniques for ILP

These techniques are crucial for processors that use static issue or static scheduling.
Basic Idea

Find sequences of unrelated instructions that can be overlapped in the pipeline to keep it full!

To avoid a pipeline stall

Dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.

Two basic techniques

Scheduling instructions (reordering)
Loop unrolling

Advanced techniques in Appendix H

Prof. Iyad Jafar

13 of 31

Scheduling

Consider the following loop that is to be executed on a standard 5-stage pipeline with no structural hazards and latencies of different operations as given in the table. Assume that

The address of the last element in x is in x1
R[x2] contains the address of x[0] - 8
f2 contains the scalar s

Let’s analyze the performance

Prof. Iyad Jafar

14 of 31

Scheduling

No scheduling

8 issue cycles per element

With scheduling

7 issue cycles per element

3 cycles are actually needed to process an element

2 cycles stalls

2 cycles loop overhead per element?

Reduce loop overhead by unrolling the loop

Move ?

Prof. Iyad Jafar

15 of 31

Loop Unrolling

Prof. Iyad Jafar

16 of 31

Loop Unrolling

Unrolled loop without scheduling

4 elements are processed per iteration

27/4 cycles per element

13 stall cycles!

12 cycles actually needed

2 cycles overhead/ 4 elements

Better performance!

Schedule the loop

Note: number of live registers vs. original loop

Prof. Iyad Jafar

17 of 31

Loop Unrolling

4 elements are processed per iteration

14 cycles/4 elements

12 cycles actually needed

2 cycles overhead/ 4 elements

0 stall cycles

Limitations

Code Size
Register pressure

Unrolled loop with scheduling

Prof. Iyad Jafar

18 of 31

Strip Mining

Prof. Iyad Jafar

19 of 31

Advanced Branch Prediction Techniques

Prof. Iyad Jafar

20 of 31

Advanced Branch Prediction

Branches hurt pipeline performance! Reduction of cost

Loop unrolling
Prediction techniques

Static Prediction

Predict as taken
Predict as not taken

Dynamic Prediction

1-bit prediction/2-bit prediction
Correlating predictors (two-level)
G-share predictor
Tournament predictors
Tagged Hybrid Predictors

Prof. Iyad Jafar

21 of 31

2-bit Predictor

Uses Branch History Table (BHT) with n = 2 bits and 2^k entries

Table size = n × 2^k bits

For each branch, predict taken or not taken
Only change prediction on two successive mispredictions

Prof. Iyad Jafar

22 of 31

Correlating predictors (two-level)

Prof. Iyad Jafar

23 of 31

Correlating predictors (two-level)

Shift register

Prof. Iyad Jafar

24 of 31

Correlating predictors (two-level)

Correlating predictor not only outperforms a simple 2-bit predictor (0,2) with the same total number of state bits, but it also often outperforms a 2-bit predictor with an unlimited number of entries

Prof. Iyad Jafar

25 of 31

G-share Predictor

The index to BHT is formed by XORing the address of the branch and the most recent conditional branch outcomes

Acts as a hash of the branch address and the branch history

The hashed result is used to index a prediction array of 2-bit counters

Prof. Iyad Jafar

26 of 31

Tournament Predictors

Global predictor

A global predictor indexed by the outcome of the most recent branch history

Local predictor

For local prediction and selecting the global of local predictor

Changing the preferred predictor (local/global) for a branch address when two mispredictions occur in a row

Prof. Iyad Jafar

27 of 31

Tournament Predictors

Prof. Iyad Jafar

28 of 31

Tagged Hybrid Predictors (Best as of 2017)

A series of global predictors indexed with different length histories
A five-component tagged hybrid predictor

2/3-bit prediction tables indexed by a hash of the branch address and a segment of recent branch history (h) of length 0–4
Uses a tag (typically 4–8 bits)
A prediction from P(1) … P(4) is used if the tag matches the hash
The prediction for a given branch is the predictor with the longest branch history that also has matching tags

Prof. Iyad Jafar

29 of 31

Tagged Hybrid Predictor

Prof. Iyad Jafar

30 of 31

The Evolution of Intel Core i7 Branch Predictor

Six generations of Intel Core i7 processors between 2008 (Core i7 920, Nehalem microarchitecture) and 2016 (Core i7 6700, Skylake microarchitecture)

Deep pipelining and multiple issues per clock
The i7 has many instructions in-flight at once (30 to 256) 🡪 misspredictions are expensive

Core i7 920

Two-level predictor
Each has 2-bit predictor, global history predictor and loop exit predictor

Newer models

Some form of tagged hybrid predictor

Prof. Iyad Jafar

31 of 31

The Evolution of Intel Core i7 Branch Predictor

Prof. Iyad Jafar