1 of 49

Data-Level Parallelism in Vector, SIMD and GPU Architectures (Part 2)

Chapter 4

Appendix A (Computer Organization and Design Book)

1

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 49

Outline

  • SIMD Instruction Set Extensions for Multimedia (4.3)

  • Graphical Processing Units (4.4)

2

Prof. Iyad Jafar

3 of 49

SIMD Instruction Set Extensions for Multimedia

3

Prof. Iyad Jafar

4 of 49

SIMD Extensions

  • SIMD Multimedia extensions started by the observation that media application operate on data types narrower than 32-bit !
    • Pixels (8 bits) and audio samples (8 or 16 bits).
    • Partition wide HW to handle smaller operands with few additional cost.
    • For example, given a 256-bit adder, we can partition it into:

4

Prof. Iyad Jafar

5 of 49

SIMD Extensions

  • Similar to vector ISAs, SIMD instruction operate on vectors of data.
  • However, SIMD extensions:
    • Fix the number of data operands in the opcode while vectors ISAs have VLR;
      • Increased number of instructions in SIMD.
    • Do not support stride access and gather-scatter addressing modes;
      • Lower possibility of vectorization.
    • Do not offer mask registers.

  • Hence, it is harder for compiler to generate SIMD code and more difficult to program in SIMD assembly language!

5

Prof. Iyad Jafar

6 of 49

SIMD Extensions

  • Intel Multimedia Extensions (MMX) (1996)
    • Repurposed the 64-bit floating-point registers.
    • Eight 8-bit integer ops or four 16-bit integer ops simultaneously.
  • Streaming SIMD Extensions (SSE) (1999)
    • Added separate 128-bit registers.
    • Allow 16 8-bit, 8 16-bit or 4 32-bit operations simultaneously.
    • SEE2, SEE3, and SEE4 🡪 additional multimedia instructions.
  • Advanced Vector Extensions (AVX) (2010)
    • Four 64-bit integer/FP ops.
  • AVX-512 (2017)
    • Eight 64-bit integer/FP ops.
  • Operands must be consecutive and aligned memory locations.

6

These extensions are intended to accelerate carefully written libraries rather than requiring the compiler to generate them.

Prof. Iyad Jafar

7 of 49

SIMD Extensions

  • Why are Multimedia SIMD Extensions so popular?
    • Little cost and easy to add to standard arithmetic unit.
    • Little extra state.
    • Less memory bandwidth.
    • Less virtual memory problems. Fewer operands that are aligned.
    • Vector architectures had issues with caches!

  • More recent vector architectures have addressed all of these problems, but the legacy of past flaws shaped the skeptical attitude toward vectors among architects.

7

Prof. Iyad Jafar

8 of 49

SIMD Extensions

8

Prof. Iyad Jafar

9 of 49

RISC-V with SIMD Extensions (RV64P)

  • Assume we added a 256-bit SIMD multimedia instruction extension to RISC-V, tentatively called RV64P (P for packed)
    • 256-bit SIMD instructions.
    • RV64P expands the F registers to be the full width (256 bits).
    • Suffix ‘4D’ indicates FP SIMD that operate on four 64-bit operands at once.
    • Have four lanes.
    • Reuse the floating-point registers as operands for 4D instructions.

9

Prof. Iyad Jafar

10 of 49

Example 1. DAXPY with RV64P

10

  • Processing is for 32 double elements in X and Y.
  • Every 4 elements will be processed simultaneously.
  • 3+8*8 = 67 instructions will be fetched while in RV64G 258 instructions are needed.

  • Not as good as RV64V, which required 32 instructions.

Prof. Iyad Jafar

11 of 49

Roofline Performance Model [Williams, 2009]

  •  

11

Prof. Iyad Jafar

12 of 49

Roofline Performance Example

  • NEC SX-9 is a vector supercomputer announced in 2008 that cost millions of dollars. It has a peak DP FP performance of 102.4 GFLOP/s and a peak memory bandwidth of 162 GB/s.
    • 3.2 GHz, eight vector pipes, each having two multiply and two addition units
    • Peak FP = 3.2 GHz * 8 * (2 + 2) = 102.4 GFLOPs

  • Core i7 920 has a peak DP FP performance of 42.66 GFLOP/s and a peak memory bandwidth of 16.4 GB/s.
    • Peak FP = 2.66 GHz * 4 (cores/chip) * 2 (ops/SIMD instr.) * 2 (FP add-mul/op) = 42.66 GFLOP/s

12

Prof. Iyad Jafar

13 of 49

Roofline Performance Example

  • @4 FLOPs/byte
    • both processors operate at peak performance with SX-9 2.4x faster
  • @0.25 FLOPs/byte
    • NEC SX-9 is 10x faster

13

  • Given the arithmetic intensity, we can determine the FLOPS performance.

  • Memory bandwidth is the slope of the graph.

  • The graph is a log–log scale, and is done just once for a computer.

Prof. Iyad Jafar

14 of 49

Graphical Processing Units (GPUs)

14

Prof. Iyad Jafar

15 of 49

GPUs – Introduction

  • By the end of last century, graphics on a PC were performed using video graphics array (VGA), i.e. memory controller and display generator.
  • VGAs evolved to include more advanced graphics functions (shading, texture mapping, ….)!
  • By 2000, the term GPU was coined to reflect that the graphics device has become a processor.
    • Programmable processors replaced graphics fixed logic.
    • More precise: Integer 🡪 double precision.
  • GPUs have become massively programmable parallel processors (100s cores and 1000s threads)
    • GPUs implement all forms of parallelism; ILP, SIMD, Multithreading, MIMD.

15

Prof. Iyad Jafar

16 of 49

GPUs – Introduction

  • Given the hardware invested to do graphics well, how can it be supplemented to improve performance of a wider range of applications?
  • GPU Computing
    • Using GPU for computing via parallel programming language and APIs without using traditional graphics APIs and graphics pipeline.
  • Basic idea:
    • Heterogeneous execution model
      • CPU is the host, GPU is the device
    • Develop a C-like programming language for GPU
      • Compute Unified Device Architecture (CUDA) and
      • OpenCL for vendor-independent language
    • Unify all forms of GPU parallelism as CUDA thread.
    • Programming model is “Single Instruction Multiple Thread” (SIMT).

16

Prof. Iyad Jafar

17 of 49

GPUs - CUDA

  • Compute Unified Device Architecture (CUDA) is a scalable parallel programming model for the GPU and parallel processors.

  • CUDA addresses the challenge of heterogeneous system and the various forms of parallelism.

  • CUDA produces C/C++ for the system processor (host) and a C and C++ dialect for the GPU (device).

  • CUDA allows the programmer to bypass the graphics APIs and interfaces and program in C/C++.

17

Prof. Iyad Jafar

18 of 49

GPUs - Threads and Blocks

  • A GPU is simply a multiprocessor system composed of multithreaded SIMD processors.
  • A thread (CUDA thread) is associated with each data element/iteration.
  • Threads are organized into thread blocks:
    • Up to 512 elements or threads per blocks
    • Each block executes on a multithreaded SIMD Processor
    • 32 elements executed per SIMD instruction
  • Blocks are organized into a grid:
    • Blocks are executed independently, and in any order.
    • Different blocks cannot communicate directly but can coordinate using atomic memory operations in Global Memory.
  • Thread management is through GPU hardware, not applications or OS.

18

Prof. Iyad Jafar

19 of 49

GPUs - Threads and Blocks

19

Launch n threads

256 threads per block

The sequential part of the program is executed on the host while the parallel part is executed on the device.

Prof. Iyad Jafar

20 of 49

GPUs – NVIDIA Architecture

  • Example: Multiplying two 8192-element vectors
    • The code (for loop in this case) that works on the whole 8192 elements is the grid (vectorized loop).
    • The gird is decomposed into thread blocks (body of vectorized loop)
      • Each has up to 512 elements 🡪 Need 8192/512 or 16 blocks
    • Assuming that SIMD instructions process 32 elements at a time:
      • Each thread block has 512/32 or 16 threads of CUDA threads (Warp, SIMD thread).
      • SIMD processors may execute maximum number of CUDA threads simultaneously (16 for Tesla, 32 for Fermi).
    • A thread block is assigned to a multithreaded SIMD processor by the thread block scheduler.
    • Current-generation GPUs (Fermi) have 7-15 multithreaded SIMD processors.

20

Prof. Iyad Jafar

21 of 49

GPUs

21

Prof. Iyad Jafar

22 of 49

GPUs – NVIDIA Architecture

22

Simplified block diagram of a Multithreaded SIMD Processor.

It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independent threads of SIMD instructions (Warps) that it schedules with a table of 48 PCs.

SIMD Processors are full processors with separate PCs and are programmed using threads

Prof. Iyad Jafar

23 of 49

GPUs – NVIDIA Architecture

  • The machine object that the hardware creates, manages, schedules, and executes is a thread of SIMD instructions (Warp).
  • Each SIMD thread is independent from other threads:
    • Contains exclusively SIMD instructions.
    • Has its own PC.
    • Runs on multithreaded SIMD processor.
  • Threads in a processor are scheduled using the SIMD thread scheduler
    • It has a scoreboard to know which threads of SIMD instructions are ready to run.
    • Schedules threads of SIMD instructions.
  • Hence, two levels of scheduling:
    • Block-level
    • SIMD thread level

23

Prof. Iyad Jafar

24 of 49

GPUs – NVIDIA Architecture

24

  • The scheduler selects a ready thread of SIMD instructions and issues an instruction synchronously to all the SIMD Lanes executing the SIMD thread.

  • Because threads of SIMD instructions are independent, the scheduler may select a different SIMD thread each time

Prof. Iyad Jafar

25 of 49

GPUs – NVIDIA Architecture

  • NVIDIA GPU SIMD has 32,768 32-bit registers
    • Divided across the SIMD lanes.
    • Each SIMD thread is limited to 64 vector registers
      • 64 vector registers of 32 32-bit elements
      • 32 vector registers of 32 64-bit elements
    • Fermi has 16 physical SIMD lanes, each containing 2048 registers.

  • Registers are dynamically allocated when threads are created and freed when SIMD threads exits.

  • Note that a CUDA thread is just a vertical cut of a thread of SIMD instructions, corresponding to one element executed by one SIMD Lane.

25

Prof. Iyad Jafar

26 of 49

GPUs – NVIDIA Architecture

  • Terminology Summary
    • Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads).
    • Warp: a group of threads executed physically in parallel in G80/GT200.
    • Block: a group of threads that are executed together and form the unit of resource assignment.
    • Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect.

26

Prof. Iyad Jafar

27 of 49

GPUs – NVIDIA Architecture

  • Mapping Summary:
    • Grid is broken into thread blocks. Blocks are independent and can execute in any order.
    • Thread block consists of CUDA threads. Each 32 of which form a Warp (SIMD thread).
    • Threads in a block execute the same program and are assumed to be independent.
    • Blocks are identified by blockIdx.
    • Threads are identified by threadIdx (sequential within a block).

27

Prof. Iyad Jafar

28 of 49

GPUs – NVIDIA Architecture

28

Thread Id #:�0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

SIMD Thread or Warp

Prof. Iyad Jafar

29 of 49

NVIDIA GPUs

29

Prof. Iyad Jafar

30 of 49

NVIDIA GPUs Performance

30

Prof. Iyad Jafar

31 of 49

NVIDIA GPUs

  • NVIDIA GTX280 Specifications
    • 933 GFLOPS peak performance
    • 10 thread processing clusters (TPC)
    • 3 multiprocessors per TPC
    • 8 cores (multi-threaded SIMD procesor) per multiprocessor
    • 16384 registers per multiprocessor
    • 16 KB shared memory per multiprocessor
    • 64 KB constant cache per multiprocessor
    • 6 KB < texture cache < 8 KB per multiprocessor
    • 1.3 GHz clock rate
    • Single and double-precision floating-point calculation
    • 1 GB DDR3 dedicated memory

31

Prof. Iyad Jafar

32 of 49

NVIDIA GPUs

32

Prof. Iyad Jafar

33 of 49

The Fermi GPU Architecture

  • Each SIMD processor has
    • Two SIMD thread schedulers, two instruction dispatch units
    • Thus, two threads of SIMD instructions are scheduled every two clock cycles
    • 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function units
  • Fast double precision: gen- 78 🡪515 GFLOPs for DAXPY
  • Caches for GPU memory: I/D L1/SIMD proc and shared L2
  • 64-bit addressing and unified address space: C/C++ ptrs
  • Error correcting codes: dependability for long-running apps
  • Faster context switching: hardware support, 10X faster
  • Faster atomic instructions: 5-20X faster than gen-

33

Prof. Iyad Jafar

34 of 49

The Fermi GPU Architecture

34

Prof. Iyad Jafar

35 of 49

The Fermi GPU Architecture

35

Prof. Iyad Jafar

36 of 49

GPUs – NVIDIA ISA

  • The instruction set target of NVIDIA compilers is an abstraction of the hardware instruction set
  • Parallel Thread Execution (PTX) provides a stable ISA for compilers. The hardware ISA is hidden!
  • PTX uses virtual registers. Compiler assigns required physical registers.
  • General format of PTX instruction
  • opcode.type d, a, b, c
    • a, b and c are operands while d is the destination
    • Operands are 32-bit or 64-bit registers or constant value
    • Destination d is a register or memory
  • Check p. 299 for PTX instructions!

36

Prof. Iyad Jafar

37 of 49

GPUs – NVIDIA ISA

  • PTX code for one CUDA thread in DAXPY

37

Prof. Iyad Jafar

38 of 49

GPUs – NVIDIA ISA

  • Conditional Branching
  • GPU hardware executes an instruction for all threads in the same warp before moving to next instruction (SIMT or SIMD)
    • Works well when all threads in a warp follow the same control flow path!
    • It is not uncommon to have conditional branching within a loop! CUDA threads may take different paths?! Branch divergence!
    • Solution 🡪 serialize execution paths
  • Example: if-then-else is executed in two passes:
    • One pass for threads executing the THEN path
    • Second pass for threads executing the ELSE path
    • Merge threads in the warp once completed

38

Prof. Iyad Jafar

39 of 49

GPUs – NVIDIA ISA

  • Illustration

39

Branch

Path A

Path B

Branch

Path A

Path B

Warp of CUDA Threads

Pass 1 – Then Part

Pass 2 – Else Part

Merge

Prof. Iyad Jafar

40 of 49

GPUs – NVIDIA ISA

  • Implementation
  • Hardware
    • Internal Masks (just like vector processors)
    • Predicate registers (1-bit per SIMD lane)
    • Branch synchronization stack per SIMD lane (nested IF)
    • Instruction markers to control masks (*comp, *push, *pop)

  • Lanes are enabled or disabled based on the 1-bit predicate registers values in each pass.

40

Prof. Iyad Jafar

41 of 49

GPUs – NVIDIA ISA

  • Example

41

if (X[i] != 0)

X[i] = X[i] – Y[i];

else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1

@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits

; if P1 false, go to ELSE1

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

sub.f64 RD0, RD0, RD2 ; Difference in RD0

st.global.f64 [X+R8], RD0 ; X[i] = RD0

@P1, bra ENDIF1, *Comp ; complement mask bits

; if P1 true, go to ENDIF1

ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

st.global.f64 [X+R8], RD0 p; X[i] = RD0

ENDIF1: <next instruction>, *Pop ; pop to restore old mask

Prof. Iyad Jafar

42 of 49

GPUs – NVIDIA ISA

  • It is like each element has its own program counter! Illusion that each CUDA thread is acting independently.

  • Vector compliers could do the same tricks!
    • Need scalar instructions to manipulate mask registers
    • GPUs do it at run time!

  • What if all threads take the same path?
    • Optimization!
    • When all mask bits are 0, the THEN part is skipped.
    • Similarly, when the mask bits are all 1, the ELSE part is skipped in all threads.

42

Prof. Iyad Jafar

43 of 49

GPUs – NVIDIA ISA

  • Conditional Branching Performance
    • How frequently divergence occurs?
    • In the best case, all masks are the same! Only the THEN or ELSE parts are executed!
    • If at least one CUDA thread diverges, we need two passes!
      • 50% efficiency in case the THEN and ELSE parts are of equal lengths
    • In case of nested IF-THEN-ELSE, the cost is more!
      • Doubly nested 🡪 25%
      • Triply nested 🡪 12.5%
    • Active research area for optimization!
  • Optimization? Avoid divergence when possible?
    • If (threadIdx.x > 2)??
    • If (threadIdx.x / WARP_SIZE > 2) ??

43

Prof. Iyad Jafar

44 of 49

GPUs – NVIDIA Memory Structure

44

Private

  • Off-chip, Recently, in L1 and L2 caches
  • For stack and spilling registers

Local

  • On-chip
  • One per multithreaded processor
  • Shared between threads in block
  • Dynamically allocated to blocks

Global

  • Off-chip
  • Shared by all processors
  • Accessed by host

Prof. Iyad Jafar

45 of 49

GPUs vs. Vector Processors

  • Both architectures
    • Designed to execute DLP programs
    • Have multiple processors
  • However, architecturally
    • GPUs rely on multithreading! (shallow pipelines)
    • Have more registers!
    • Has many lanes (8-16 vs. 2-8)
  • Memory
    • VPs have explicit unit-stride load/store.
    • GPUs is implicit. (address coalescing)
  • Branch
    • VPs manage masks explicitly in SW. GPUs do that at run time!
    • Strip-mining in VPs requires VLR. GPUs iterate the loop until the last iteration and mask off unused lanes.

45

Prof. Iyad Jafar

46 of 49

GPUs vs. Vector Processors

  • Control Unit
    • In VPs, it handles vector and scalar operations
    • GPUs have no control unit, but the thread block scheduler (less power efficient)

  • Scalar Processor
    • Separate simple scalar processor in VPs
    • None in GPUs. Use single SIMD lane and disable others rather than using the system processor (less power efficient and slower)

46

Prof. Iyad Jafar

47 of 49

GPUs vs. Vector Processors

47

Prof. Iyad Jafar

48 of 49

GPUs vs. Multimedia SIMD Processors

48

  • The scalar processor and Multimedia SIMD instructions are tightly integrated in traditional computers; they are separated by an I/O bus in GPUs, and they even have separate main memories.
  • The multiple SIMD processors in a GPU use a single address space, but the caches are not coherent as they are in traditional multicore computers.
  • Unlike GPUs, multimedia SIMD instructions do not support gather-scatter memory accesses

Prof. Iyad Jafar

49 of 49

Reading Assignment

  • Section 4.7 – Putting it All Together

  • Appendix A from the Computer Organization and Design Textbook

49

Prof. Iyad Jafar