1 of 49

Data-Level Parallelism in Vector, SIMD and GPU Architectures (Part 2)

Chapter 4

Appendix A (Computer Organization and Design Book)

1

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 49

Outline

SIMD Instruction Set Extensions for Multimedia (4.3)

Graphical Processing Units (4.4)

2

Prof. Iyad Jafar

3 of 49

SIMD Instruction Set Extensions for Multimedia

3

Prof. Iyad Jafar

4 of 49

SIMD Extensions

SIMD Multimedia extensions started by the observation that media application operate on data types narrower than 32-bit !

Pixels (8 bits) and audio samples (8 or 16 bits).
Partition wide HW to handle smaller operands with few additional cost.
For example, given a 256-bit adder, we can partition it into:

4

Prof. Iyad Jafar

5 of 49

SIMD Extensions

Similar to vector ISAs, SIMD instruction operate on vectors of data.
However, SIMD extensions:

Fix the number of data operands in the opcode while vectors ISAs have VLR;

Increased number of instructions in SIMD.

Do not support stride access and gather-scatter addressing modes;

Lower possibility of vectorization.

Do not offer mask registers.

Hence, it is harder for compiler to generate SIMD code and more difficult to program in SIMD assembly language!

5

Prof. Iyad Jafar

6 of 49

SIMD Extensions

Intel Multimedia Extensions (MMX) (1996)

Repurposed the 64-bit floating-point registers.
Eight 8-bit integer ops or four 16-bit integer ops simultaneously.

Streaming SIMD Extensions (SSE) (1999)

Added separate 128-bit registers.
Allow 16 8-bit, 8 16-bit or 4 32-bit operations simultaneously.
SEE2, SEE3, and SEE4 🡪 additional multimedia instructions.

Advanced Vector Extensions (AVX) (2010)

Four 64-bit integer/FP ops.

AVX-512 (2017)

Eight 64-bit integer/FP ops.

Operands must be consecutive and aligned memory locations.

6

These extensions are intended to accelerate carefully written libraries rather than requiring the compiler to generate them.

Prof. Iyad Jafar

7 of 49

SIMD Extensions

Why are Multimedia SIMD Extensions so popular?

Little cost and easy to add to standard arithmetic unit.
Little extra state.
Less memory bandwidth.
Less virtual memory problems. Fewer operands that are aligned.
Vector architectures had issues with caches!

More recent vector architectures have addressed all of these problems, but the legacy of past flaws shaped the skeptical attitude toward vectors among architects.

7

Prof. Iyad Jafar

8 of 49

SIMD Extensions

8

Prof. Iyad Jafar

9 of 49

RISC-V with SIMD Extensions (RV64P)

Assume we added a 256-bit SIMD multimedia instruction extension to RISC-V, tentatively called RV64P (P for packed)

256-bit SIMD instructions.
RV64P expands the F registers to be the full width (256 bits).
Suffix ‘4D’ indicates FP SIMD that operate on four 64-bit operands at once.
Have four lanes.
Reuse the floating-point registers as operands for 4D instructions.

9

Prof. Iyad Jafar

10 of 49

Example 1. DAXPY with RV64P

10

Processing is for 32 double elements in X and Y.
Every 4 elements will be processed simultaneously.

3+8*8 = 67 instructions will be fetched while in RV64G 258 instructions are needed.

Not as good as RV64V, which required 32 instructions.

Prof. Iyad Jafar

11 of 49

Roofline Performance Model [Williams, 2009]

11

Prof. Iyad Jafar

12 of 49

Roofline Performance Example

NEC SX-9 is a vector supercomputer announced in 2008 that cost millions of dollars. It has a peak DP FP performance of 102.4 GFLOP/s and a peak memory bandwidth of 162 GB/s.

3.2 GHz, eight vector pipes, each having two multiply and two addition units
Peak FP = 3.2 GHz * 8 * (2 + 2) = 102.4 GFLOPs

Core i7 920 has a peak DP FP performance of 42.66 GFLOP/s and a peak memory bandwidth of 16.4 GB/s.

Peak FP = 2.66 GHz * 4 (cores/chip) * 2 (ops/SIMD instr.) * 2 (FP add-mul/op) = 42.66 GFLOP/s

12

Prof. Iyad Jafar

13 of 49

Roofline Performance Example

@4 FLOPs/byte

both processors operate at peak performance with SX-9 2.4x faster

@0.25 FLOPs/byte

NEC SX-9 is 10x faster

13

Given the arithmetic intensity, we can determine the FLOPS performance.

Memory bandwidth is the slope of the graph.

The graph is a log–log scale, and is done just once for a computer.

Prof. Iyad Jafar

14 of 49

Graphical Processing Units (GPUs)

14

Prof. Iyad Jafar

15 of 49

GPUs – Introduction

By the end of last century, graphics on a PC were performed using video graphics array (VGA), i.e. memory controller and display generator.
VGAs evolved to include more advanced graphics functions (shading, texture mapping, ….)!
By 2000, the term GPU was coined to reflect that the graphics device has become a processor.

Programmable processors replaced graphics fixed logic.
More precise: Integer 🡪 double precision.

GPUs have become massively programmable parallel processors (100s cores and 1000s threads)

GPUs implement all forms of parallelism; ILP, SIMD, Multithreading, MIMD.

15

Prof. Iyad Jafar

16 of 49

GPUs – Introduction

Given the hardware invested to do graphics well, how can it be supplemented to improve performance of a wider range of applications?
GPU Computing

Using GPU for computing via parallel programming language and APIs without using traditional graphics APIs and graphics pipeline.

Basic idea:

Heterogeneous execution model

CPU is the host, GPU is the device

Develop a C-like programming language for GPU

Compute Unified Device Architecture (CUDA) and
OpenCL for vendor-independent language

Unify all forms of GPU parallelism as CUDA thread.
Programming model is “Single Instruction Multiple Thread” (SIMT).

16

Prof. Iyad Jafar

17 of 49

GPUs - CUDA

Compute Unified Device Architecture (CUDA) is a scalable parallel programming model for the GPU and parallel processors.

CUDA addresses the challenge of heterogeneous system and the various forms of parallelism.

CUDA produces C/C++ for the system processor (host) and a C and C++ dialect for the GPU (device).

CUDA allows the programmer to bypass the graphics APIs and interfaces and program in C/C++.

17

Prof. Iyad Jafar

18 of 49

GPUs - Threads and Blocks

A GPU is simply a multiprocessor system composed of multithreaded SIMD processors.
A thread (CUDA thread) is associated with each data element/iteration.
Threads are organized into thread blocks:

Up to 512 elements or threads per blocks
Each block executes on a multithreaded SIMD Processor
32 elements executed per SIMD instruction

Blocks are organized into a grid:

Blocks are executed independently, and in any order.
Different blocks cannot communicate directly but can coordinate using atomic memory operations in Global Memory.

Thread management is through GPU hardware, not applications or OS.

18

Prof. Iyad Jafar

19 of 49

GPUs - Threads and Blocks

19

Launch n threads

256 threads per block

The sequential part of the program is executed on the host while the parallel part is executed on the device.

Prof. Iyad Jafar

20 of 49

GPUs – NVIDIA Architecture

Example: Multiplying two 8192-element vectors

The code (for loop in this case) that works on the whole 8192 elements is the grid (vectorized loop).
The gird is decomposed into thread blocks (body of vectorized loop)

Each has up to 512 elements 🡪 Need 8192/512 or 16 blocks

Assuming that SIMD instructions process 32 elements at a time:

Each thread block has 512/32 or 16 threads of CUDA threads (Warp, SIMD thread).
SIMD processors may execute maximum number of CUDA threads simultaneously (16 for Tesla, 32 for Fermi).

A thread block is assigned to a multithreaded SIMD processor by the thread block scheduler.
Current-generation GPUs (Fermi) have 7-15 multithreaded SIMD processors.

20

Prof. Iyad Jafar

21 of 49

GPUs

21

Prof. Iyad Jafar

22 of 49

GPUs – NVIDIA Architecture

22

Simplified block diagram of a Multithreaded SIMD Processor.

It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independent threads of SIMD instructions (Warps) that it schedules with a table of 48 PCs.

SIMD Processors are full processors with separate PCs and are programmed using threads

Prof. Iyad Jafar

23 of 49

GPUs – NVIDIA Architecture

The machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD instructions (Warp).
Each SIMD thread is independent from other threads:

Contains exclusively SIMD instructions.
Has its own PC.
Runs on multithreaded SIMD processor.

Threads in a processor are scheduled using the SIMD thread scheduler

It has a scoreboard to know which threads of SIMD instructions are ready to run.
Schedules threads of SIMD instructions.

Hence, two levels of scheduling:

Block-level
SIMD thread level

23

Prof. Iyad Jafar

24 of 49

GPUs – NVIDIA Architecture

24

The scheduler selects a ready thread of SIMD instructions and issues an instruction synchronously to all the SIMD Lanes executing the SIMD thread.

Because threads of SIMD instructions are independent, the scheduler may select a different SIMD thread each time

Prof. Iyad Jafar

25 of 49

GPUs – NVIDIA Architecture

NVIDIA GPU SIMD has 32,768 32-bit registers

Divided across the SIMD lanes.
Each SIMD thread is limited to 64 vector registers

64 vector registers of 32 32-bit elements
32 vector registers of 32 64-bit elements

Fermi has 16 physical SIMD lanes, each containing 2048 registers.

Registers are dynamically allocated when threads are created and freed when SIMD threads exits.

Note that a CUDA thread is just a vertical cut of a thread of SIMD instructions, corresponding to one element executed by one SIMD Lane.

25

Prof. Iyad Jafar

26 of 49

GPUs – NVIDIA Architecture

Terminology Summary

Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads).
Warp: a group of threads executed physically in parallel in G80/GT200.
Block: a group of threads that are executed together and form the unit of resource assignment.
Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect.

26

Prof. Iyad Jafar

27 of 49

GPUs – NVIDIA Architecture

Mapping Summary:

Grid is broken into thread blocks. Blocks are independent and can execute in any order.
Thread block consists of CUDA threads. Each 32 of which form a Warp (SIMD thread).
Threads in a block execute the same program and are assumed to be independent.
Blocks are identified by blockIdx.
Threads are identified by threadIdx (sequential within a block).

27

Prof. Iyad Jafar

28 of 49

GPUs – NVIDIA Architecture

28

Thread Id #:�0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

SIMD Thread or Warp

Prof. Iyad Jafar

29 of 49

NVIDIA GPUs

29

Prof. Iyad Jafar

30 of 49

NVIDIA GPUs Performance

30

Prof. Iyad Jafar

31 of 49

NVIDIA GPUs

NVIDIA GTX280 Specifications

933 GFLOPS peak performance
10 thread processing clusters (TPC)
3 multiprocessors per TPC
8 cores (multi-threaded SIMD procesor) per multiprocessor
16384 registers per multiprocessor
16 KB shared memory per multiprocessor
64 KB constant cache per multiprocessor
6 KB < texture cache < 8 KB per multiprocessor
1.3 GHz clock rate
Single and double-precision floating-point calculation
1 GB DDR3 dedicated memory

31

Prof. Iyad Jafar

32 of 49

NVIDIA GPUs

32

Prof. Iyad Jafar

33 of 49

The Fermi GPU Architecture

Each SIMD processor has

Two SIMD thread schedulers, two instruction dispatch units
Thus, two threads of SIMD instructions are scheduled every two clock cycles
16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units

Fast double precision: gen- 78 🡪515 GFLOPs for DAXPY
Caches for GPU memory: I/D L1/SIMD proc and shared L2
64-bit addressing and unified address space: C/C++ ptrs
Error correcting codes: dependability for long-running apps
Faster context switching: hardware support, 10X faster
Faster atomic instructions: 5-20X faster than gen-

33

Prof. Iyad Jafar

34 of 49

The Fermi GPU Architecture

34

Prof. Iyad Jafar

35 of 49

The Fermi GPU Architecture

35

Prof. Iyad Jafar

36 of 49

GPUs – NVIDIA ISA

The instruction set target of NVIDIA compilers is an abstraction of the hardware instruction set
Parallel Thread Execution (PTX) provides a stable ISA for compilers. The hardware ISA is hidden!
PTX uses virtual registers. Compiler assigns required physical registers.
General format of PTX instruction
opcode.type d, a, b, c

a, b and c are operands while d is the destination
Operands are 32-bit or 64-bit registers or constant value
Destination d is a register or memory

Check p. 299 for PTX instructions!

36

Prof. Iyad Jafar

37 of 49

GPUs – NVIDIA ISA

PTX code for one CUDA thread in DAXPY

37

Prof. Iyad Jafar

38 of 49

GPUs – NVIDIA ISA

Conditional Branching
GPU hardware executes an instruction for all threads in the same warp before moving to next instruction (SIMT or SIMD)

Works well when all threads in a warp follow the same control flow path!
It is not uncommon to have conditional branching within a loop! CUDA threads may take different paths?! Branch divergence!
Solution 🡪 serialize execution paths

Example: if-then-else is executed in two passes:

One pass for threads executing the THEN path
Second pass for threads executing the ELSE path
Merge threads in the warp once completed

38

Prof. Iyad Jafar

39 of 49

GPUs – NVIDIA ISA

Illustration

39

Branch

Path A

Path B

Branch

Path A

Path B

Warp of CUDA Threads

Pass 1 – Then Part

Pass 2 – Else Part

Merge

Prof. Iyad Jafar

40 of 49

GPUs – NVIDIA ISA

Implementation
Hardware

Internal Masks (just like vector processors)
Predicate registers (1-bit per SIMD lane)
Branch synchronization stack per SIMD lane (nested IF)
Instruction markers to control masks (*comp, *push, *pop)

Lanes are enabled or disabled based on the 1-bit predicate registers values in each pass.

40

Prof. Iyad Jafar

41 of 49

GPUs – NVIDIA ISA

Example

41

if (X[i] != 0)

X[i] = X[i] – Y[i];

else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1

@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits

; if P1 false, go to ELSE1

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

sub.f64 RD0, RD0, RD2 ; Difference in RD0

st.global.f64 [X+R8], RD0 ; X[i] = RD0

@P1, bra ENDIF1, *Comp ; complement mask bits

; if P1 true, go to ENDIF1

ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

st.global.f64 [X+R8], RD0 p; X[i] = RD0

ENDIF1: <next instruction>, *Pop ; pop to restore old mask

Prof. Iyad Jafar

42 of 49

GPUs – NVIDIA ISA

It is like each element has its own program counter! Illusion that each CUDA thread is acting independently.

Vector compliers could do the same tricks!

Need scalar instructions to manipulate mask registers
GPUs do it at run time!

What if all threads take the same path?

Optimization!
When all mask bits are 0, the THEN part is skipped.
Similarly, when the mask bits are all 1, the ELSE part is skipped in all threads.

42

Prof. Iyad Jafar

43 of 49

GPUs – NVIDIA ISA

Conditional Branching Performance

How frequently divergence occurs?
In the best case, all masks are the same! Only the THEN or ELSE parts are executed!
If at least one CUDA thread diverges, we need two passes!

50% efficiency in case the THEN and ELSE parts are of equal lengths

In case of nested IF-THEN-ELSE, the cost is more!

Doubly nested 🡪 25%
Triply nested 🡪 12.5%

Active research area for optimization!

Optimization? Avoid divergence when possible?

If (threadIdx.x > 2)??
If (threadIdx.x / WARP_SIZE > 2) ??

43

Prof. Iyad Jafar

44 of 49

GPUs – NVIDIA Memory Structure

44

Private

Off-chip, Recently, in L1 and L2 caches
For stack and spilling registers

Local

On-chip
One per multithreaded processor
Shared between threads in block
Dynamically allocated to blocks

Global

Off-chip
Shared by all processors
Accessed by host

Prof. Iyad Jafar

45 of 49

GPUs vs. Vector Processors

Both architectures

Designed to execute DLP programs
Have multiple processors

However, architecturally

GPUs rely on multithreading! (shallow pipelines)
Have more registers!
Has many lanes (8-16 vs. 2-8)

Memory

VPs have explicit unit-stride load/store.
GPUs is implicit. (address coalescing)

Branch

VPs manage masks explicitly in SW. GPUs do that at run time!
Strip-mining in VPs requires VLR. GPUs iterate the loop until the last iteration and mask off unused lanes.

45

Prof. Iyad Jafar

46 of 49

GPUs vs. Vector Processors

Control Unit

In VPs, it handles vector and scalar operations
GPUs have no control unit, but the thread block scheduler (less power efficient)

Scalar Processor

Separate simple scalar processor in VPs
None in GPUs. Use single SIMD lane and disable others rather than using the system processor (less power efficient and slower)

46

Prof. Iyad Jafar

47 of 49

GPUs vs. Vector Processors

47

Prof. Iyad Jafar

48 of 49

GPUs vs. Multimedia SIMD Processors

48

The scalar processor and Multimedia SIMD instructions are tightly integrated in traditional computers; they are separated by an I/O bus in GPUs, and they even have separate main memories.
The multiple SIMD processors in a GPU use a single address space, but the caches are not coherent as they are in traditional multicore computers.
Unlike GPUs, multimedia SIMD instructions do not support gather-scatter memory accesses

Prof. Iyad Jafar

49 of 49

Reading Assignment

Section 4.7 – Putting it All Together

Appendix A from the Computer Organization and Design Textbook

49

Prof. Iyad Jafar