Data-Level Parallelism in Vector, SIMD and GPU Architectures (Part 2)
Chapter 4
Appendix A (Computer Organization and Design Book)
1
Outline
2
Prof. Iyad Jafar
SIMD Instruction Set Extensions for Multimedia
3
Prof. Iyad Jafar
SIMD Extensions
4
Prof. Iyad Jafar
SIMD Extensions
5
Prof. Iyad Jafar
SIMD Extensions
6
These extensions are intended to accelerate carefully written libraries rather than requiring the compiler to generate them.
Prof. Iyad Jafar
SIMD Extensions
7
Prof. Iyad Jafar
SIMD Extensions
8
Prof. Iyad Jafar
RISC-V with SIMD Extensions (RV64P)
9
Prof. Iyad Jafar
Example 1. DAXPY with RV64P
10
Prof. Iyad Jafar
Roofline Performance Model [Williams, 2009]
11
Prof. Iyad Jafar
Roofline Performance Example
12
Prof. Iyad Jafar
Roofline Performance Example
13
Prof. Iyad Jafar
Graphical Processing Units (GPUs)
14
Prof. Iyad Jafar
GPUs – Introduction
15
Prof. Iyad Jafar
GPUs – Introduction
16
Prof. Iyad Jafar
GPUs - CUDA
17
Prof. Iyad Jafar
GPUs - Threads and Blocks
18
Prof. Iyad Jafar
GPUs - Threads and Blocks
19
Launch n threads
256 threads per block
The sequential part of the program is executed on the host while the parallel part is executed on the device.
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
20
Prof. Iyad Jafar
GPUs
21
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
22
Simplified block diagram of a Multithreaded SIMD Processor.
It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independent threads of SIMD instructions (Warps) that it schedules with a table of 48 PCs.
SIMD Processors are full processors with separate PCs and are programmed using threads
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
23
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
24
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
25
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
26
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
27
Prof. Iyad Jafar
GPUs – NVIDIA Architecture
28
Thread Id #:�0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
SIMD Thread or Warp
Prof. Iyad Jafar
NVIDIA GPUs
29
Prof. Iyad Jafar
NVIDIA GPUs Performance
30
Prof. Iyad Jafar
NVIDIA GPUs
31
Prof. Iyad Jafar
NVIDIA GPUs
32
Prof. Iyad Jafar
The Fermi GPU Architecture
33
Prof. Iyad Jafar
The Fermi GPU Architecture
34
Prof. Iyad Jafar
The Fermi GPU Architecture
35
Prof. Iyad Jafar
GPUs – NVIDIA ISA
36
Prof. Iyad Jafar
GPUs – NVIDIA ISA
37
Prof. Iyad Jafar
GPUs – NVIDIA ISA
38
Prof. Iyad Jafar
GPUs – NVIDIA ISA
39
Branch
Path A
Path B
Branch
Path A
Path B
Warp of CUDA Threads
Pass 1 – Then Part
Pass 2 – Else Part
Merge
Prof. Iyad Jafar
GPUs – NVIDIA ISA
40
Prof. Iyad Jafar
GPUs – NVIDIA ISA
41
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 p; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask
Prof. Iyad Jafar
GPUs – NVIDIA ISA
42
Prof. Iyad Jafar
GPUs – NVIDIA ISA
43
Prof. Iyad Jafar
GPUs – NVIDIA Memory Structure
44
Private
Local
Global
Prof. Iyad Jafar
GPUs vs. Vector Processors
45
Prof. Iyad Jafar
GPUs vs. Vector Processors
46
Prof. Iyad Jafar
GPUs vs. Vector Processors
47
Prof. Iyad Jafar
GPUs vs. Multimedia SIMD Processors
48
Prof. Iyad Jafar
Reading Assignment
49
Prof. Iyad Jafar