1 of 63

GPU Programming Model

Dr A Sahu

Dept of Comp Sc & Engg.

IIT Guwahati

1

2 of 63

Outline

  • Graphics System
  • GPU Architecture
  • Memory Model
    • Vertex Buffer, Texture buffer
  • GPU Programming Model
    • DirectX, OpenGL, OpenCL
  • GP GPU Program
    • Introduction to Nvidia Cuda Programming

2

3 of 63

Graphics System

3

3D application

3D API: OpenGL

DirectX/3D

3D API Commands

CPU-GPU Boundary

GPU Command

& Data Stream

GPU

Command

Primitive

Assembly

Rastereisation

Interpolation

Raster

Operation

Frame Buffer

Programmable

Fragment

Processors

Programmable

Vertex

Processor

Vertex Index

Stream

Assembled polygon, line

& points

Pixel

Location

Stream

Pixel

Updates

Transformed

Fragments

Rastorized Pretransformed

Fragments

transformed

Vertices

Pretransformed

Vertices

4 of 63

Graphics System

4

Memory

System

Texture

Memory

Frame

Buffer

Vertex

Processing

Pixel

Processing

Vertices

(x,y,z)

Pixel

R, G,B

Vertex

Shadder

Pixel

Shadder

5 of 63

The Graphics Pipeline

  • Primitives are processed in a series of stages
  • Each stage forwards its result on to the next stage
  • The pipeline can be drawn and implemented in different ways
  • Some stages may be in hardware, others in software
  • Optimizations & additional programmability are available at some stages

Modeling �Transformations

Illumination

(Shading)

Viewing Transformation

(Perspective / Orthographic)

Clipping

Projection �(to Screen Space)

Scan Conversion�(Rasterization)

Visibility / Display

6 of 63

The Graphics Pipeline

Modeling �Transformations

Illumination

(Shading)

Viewing Transformation

(Perspective / Orthographic)

Clipping

Projection �(to Screen Space)

Scan Conversion�(Rasterization)

Visibility / Display

7 of 63

Programmable Graphics Hardware

  • Graphics pipeline (simplified)

Vertex

Shader

Pixel

Shader

Object space

Window space

Framebuffer

IN

OUT

Textures

8 of 63

GPU vs CPU

  • The computing capacities of graphics processing units (GPUs) have improved exponentially in the recent decade.

  • NVIDIA released a CUDA programming model for GPUs.

  • The CUDA programming environment applies the parallel processing capabilities of the GPUs to medical image processing research.

9 of 63

NVIDIA GeForce GTX 480

  • CUDA Cores 480
    • (Compute Unified Dev Arch)
  • Microsoft® DirectX® 11 Support
  • 3D Vision™ Surround Ready
  • Interactive Ray Tracing
  • 3-way SLI® Technology
  • PhysX® Technology
  • CUDA™ Technology
  • 32x Anti-aliasing Technology
  • PureVideo® HD Technology
  • PCI Express 2.0 Support.
  • Dual-link DVI Support, HDMI 1.4

10 of 63

Generation IV: Radeon 9700/GeForce FX (2002)

Vertex

Transforms

  • This generation is the first generation of fully-programmable graphics cards
  • Different versions have different resource limits on fragment/vertex programs

Primitive

Assembly

Frame

Buffer

Raster

Operations

Rasterization

and

Interpolation

AGP

Programmable

Vertex shader

Programmable

Fragment

Processor

11 of 63

High-level shading language

  • Writing assembly is
    • Painful
    • Not portable
    • Not optimize-able
  • High level shading language solves these
    • Cg, HLSL

12 of 63

Memory Hierarchy

  • CPU and GPU Memory Hierarchy

Disk

CPU Main

Memory

GPU Video

Memory

CPU Caches

CPU Registers

GPU Caches

GPU Temporary

Registers

GPU Constant

Registers

13 of 63

GPU Memory Model

  • Much more restricted memory access
    • Allocate/free memory only before computation
    • Limited memory access during computation (kernel)
      • Registers
        • Read/write
      • Local memory
        • Does not exist
      • Global memory
        • Read-only during computation
        • Write-only at end of computation (pre-computed address)
      • Disk access
        • Does not exist

14 of 63

CPU Memory Model

  • At any program point
    • Allocate/free local or global memory
    • Random memory access
      • Registers
        • Read/write
      • Local memory
        • Read/write to stack
      • Global memory
        • Read/write to heap
      • Disk
        • Read/write to disk

15 of 63

GPU Memory Model

  • Where is GPU Data Stored?
    • Vertex buffer
    • Frame buffer
    • Texture

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

16 of 63

GPU Memory API

  • Each GPU memory type supports subset of the following operations
    • CPU interface
    • GPU interface

17 of 63

GPU Memory API

  • CPU interface
    • Allocate
    • Free
    • Copy CPU 🡪 GPU
    • Copy GPU 🡪 CPU
    • Copy GPU 🡪 GPU
    • Bind for read-only vertex stream access
    • Bind for read-only random access
    • Bind for write-only framebuffer access

17

18 of 63

GPU Memory API

  • GPU (shader/kernel) interface
    • Random-access read
    • Stream read

19 of 63

Vertex Buffers

  • GPU memory for vertex data
  • Vertex data required to initiate render pass

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

20 of 63

Vertex Buffers

  • Supported Operations
    • CPU interface
      • Allocate
      • Free
      • Copy CPU 🡪 GPU
      • Copy GPU 🡪 GPU (Render-to-vertex-array)
      • Bind for read-only vertex stream access

    • GPU interface
      • Stream read (vertex program only)

21 of 63

Vertex Buffers

  • Limitations
    • CPU
      • No copy GPU 🡪 CPU
      • No bind for read-only random access
      • No bind for write-only framebuffer access
    • GPU
      • No random-access reads
      • No access from fragment programs

22 of 63

Textures

  • Random-access GPU memory

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

23 of 63

Textures

  • Supported Operations
    • CPU interface
      • Allocate
      • Free
      • Copy CPU 🡪 GPU
      • Copy GPU 🡪 CPU
      • Copy GPU 🡪 GPU (Render-to-texture)
      • Bind for read-only random access (vertex or fragment)
      • Bind for write-only framebuffer access
    • GPU interface
      • Random read

24 of 63

Framebuffer

  • Memory written by fragment processor
  • Write-only GPU memory

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

25 of 63

Programming Model: Early GPUs

  • Fixed function pipeline
    • Made early games look fairly similar
    • Little freedom in rendering
    • “One way to do things”
      • glShadeModel(GL_SMOOTH);
  • Different render methods
    • Triangle rasterization, proved to be very efficiently implemented in hardware.
    • Raytracing, voxels, produce nice results, very slow and require large amounts of memory

26 of 63

DirectX and OpenGL

  • DirectX before version 8 entirely fixed function
  • OpenGL before version 2.0 entirely fixed function
    • Extensions were often added for different effects, but no real programmability on the GPU.
  • OpenGL is just a specification
    • Vendors must implement the specification, but on whatever platform they wish
  • DirectX is a library, Windows only
    • Direct3D is the graphics component

27 of 63

Programmability in GPUs

  • Direct3D 8.0 (2000), OpenGL 2.0 (2004) added support for assembly language programming of vertex and fragment shaders.
    • NVIDIA GeForce 3, ATI Radeon 8000
  • Direct3D 9.0 (2002) added HLSL (High Level Shader Language) for much easier programming of GPUs.
    • NVIDIA GeForce FX 5000, ATI Radeon 9000
  • Minor increments on this for a long time, with more capabilities being added to shaders.

28 of 63

GPU Pipeline

  • Vertex data sent in by graphics API
    • Mostly OpenGL or DirectX
  • Processed in vertex program – “vertex shader”
  • Rasterized into pixels
  • Processed in “fragment shader”

Vertex

Shader

Fragment

Shader

Vertex

Data

Rasterize

To Pixels

Output

29 of 63

Shader Languages

  • No longer need to write shaders in assembly
  • GLSL, HLSL, Cg, offer C style programming languages
  • Write two main() functions, which are executed on each vertex/pixel
  • Declare auxiliary functions, local variables
  • Output by setting position and color

30 of 63

Shader Unification

  • Prior to Direct3D 10/GeForce 8000/Radeon 2000, vertex and fragment shaders were executed in separate hardware.
  • Direct3D 10 (with Vista) brought shader unification, and added Geometry Shaders.
    • GPUs now used the same ‘cores’ to geometry/vertex/fragment shader code.
  • CUDA comes out alongside GeForce 8000 line, allowing ‘cores’ to run general C code, rather than being restricted to graphics APIs.

31 of 63

Unified Shader Pipeline�(DX10, OpenGL 2, OpenGL 3)

Vertex Programs

Geometry Programs

Pixel Programs

Compute Programs

Rasterization

Hidden Surface Removal

GPU

Programmable Unified Processors

GPU memory (DRAM)

Final Image

3D Geometric Primitives

32 of 63

Generalized GPU programming

  • CUDA the first to drop graphics API, and allows the GPU to be treated as a coprocessor to the CPU.
    • Linear memory accesses (no more buffer objects)
    • Run thousands of threads on separate scalar cores (with limitations)
    • High theoretical/achieved performance for data parallel applications
  • ATI has Stream SDK
    • Closer to assembly language programming for Stream

33 of 63

OpenCL, DirectCompute

  • Apple announces OpenCL initiative in 2008
    • Officially owned by Khronos Group, the same that controls OpenGL
    • Released in 2009, with support from NVIDIA/ATI.
    • Another specification for parallel programming, not entirely specific to GPUs (support for CPU SSE instructions, etc.).
  • DirectX11 (and Direct3D10 extension) add in DirectComputeshaders
    • Similar idea to OpenCL, just tied in with Direct3D

CS101 GPU Programming

33

34 of 63

DirectX11, OpenGL4

  • DirectX11 also adds multithreaded rendering, and tessellation stages to the pipeline
    • Two new shader stages in the unified pipeline; Hull and Domain shaders
    • Allow high detail geometry to be created on the GPU, rather than flooding the PCI-E bus with geometry data.
    • More programmable geometry
  • OpenGL 4 (specification just released) is close to feature parity with Direct3D11
    • Namely also adds tessellation

35 of 63

Modern GPU computing

  • Newest GPUs have incredible compute power
    • 1-3 TFlops, 100+ GB/s memory access bandwidth
  • More parallel constructs
    • High speed atomic operations, more control over thread interaction/synchronization.
  • Becoming easier to program
    • NVIDIA’s ‘Fermi’ architecture has support for C++ code, 64bit pointers, etc.
  • GPU computing starting to go mainstream
    • Photoshop5, Video encode/decode, physics/fluid simulation, etc.

36 of 63

Motivation: Computational Power

  • GPUs are fast…
    • 3.0 GHz dual-core Pentium4: 24.6 GFLOPS
    • NVIDIA GeForceFX 7800: 165 GFLOPs
    • 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
    • ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
  • GPUs are getting faster, faster
    • CPUs: 1.4× annual growth
    • GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

37 of 63

Motivation: Flexible and Precise

  • Modern GPUs are deeply programmable
    • Programmable pixel, vertex, video engines
    • Solidifying high-level language support
  • Modern GPUs support high precision
    • 32 bit floating point throughout the pipeline
    • High enough for many (not all) applications

38 of 63

Problems: Difficult To Use

  • GPUs designed for & driven by video games
    • Programming model unusual
    • Programming idioms tied to computer graphics
    • Programming environment tightly constrained
  • Underlying architectures are:
    • Inherently parallel
    • Rapidly evolving (even in basic feature set!)
    • Largely secret
  • Can’t simply “port” CPU code!

39 of 63

Programming a GPU for Graphics

  • Each fragment is shaded w/ SIMD program
  • Shading can use values from texture memory
  • Image can be used as texture on future passes
  • Application specifies geometry 🡪 rasterized

40 of 63

Programming a GPU for GP Programs

  • Run a SIMD kernel over each fragment
  • “Gather” is permitted from texture memory
  • Resulting buffer can be treated as texture on next pass
  • Draw a screen-sized quad 🡪 stream

41 of 63

Nvidia CUDA

  • Introduced November of 2006
  • Converts GPU to general purpose CPU
  • Required hardware changes
    • Only available on N70 or later GPU
      • GeForce 8000 series or newer
  • Implemented as extension to C/C++
    • Results in lower learning curve

42 of 63

GeForce 8800 Specs

  • 16 Streaming Multiprocessors (SM)
    • Each one has 8 Streaming Processors (SP)
    • Each SM can execute 32 threads simultaneously
    • 512 threads execute per cycle
    • SPs hide instruction latencies
  • 768 MB DRAM
    • 86.4 Gbps memory bandwidth to GPU cores
    • 4 Gbps memory bandwidth with system memory

43 of 63

Typical NVIDIA GPU Device Layout

Load/store

Global Memory

Thread Execution

Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data�Cache

Parallel Data�Cache

Parallel Data�Cache

Parallel Data�Cache

Parallel Data�Cache

Parallel Data�Cache

Parallel Data�Cache

Parallel Data�Cache

Load/store

Load/store

Load/store

Load/store

Load/store

44 of 63

CUDA Execution Model

  • Starts with Kernel
  • Kernel is function called on host that executes on GPU
  • Thread resources are abstracted into 3 levels
    • Grid – highest level
    • Block – Collection of Threads
    • Thread – Execution unit

45 of 63

CUDA Execution Model

46 of 63

CUDA Memory Model

  • 768 GB global memory
    • Accessible to all threads globally
    • 86.4 Gbps throughput
  • 16 KB shared memory per SP
    • Accessible to all threads within a block
    • 384 Gbps throughput
  • 32 KB register file per SM
    • Allocated to threads at runtime (local variables)
    • 384 Gbps throughput
    • Threads can only see their own registers

47 of 63

CUDA Memory Model

Grid

Global Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Host

48 of 63

How Do You Execute CUDA Kernel?

(From C/C++ function)

  • Allocate memory on CUDA device
  • Copy data to CUDA device
  • Configure thread resources
    • Grid Layout (max 65536x65536)
    • Block Layout (3 dimensional, max of 512 threads)
  • Execute kernel with thread resources
  • Copy data out of CUDA device
  • Free memory on CUDA device

49 of 63

CUDA In Action: Matrix Multiplication

  • Multiply matrices M and N to form result R
  • General algorithm
    • For each row i in matrix R
      • For each column j in matrix R
        • Cell (i, j) = dot product of row i of M and column j of N
  • Algorithm runs in O(length3)

50 of 63

Matrix Multiplication On CUDA

  • Each thread represents cell (i, j)
  • Calculate value for cell (i, j)
  • Use single block
  • Should run in O(length)
    • Much better than O(length3)

51 of 63

Matrix Multiplication On CUDA

M

N

P

WIDTH

WIDTH

WIDTH

WIDTH

52 of 63

Matrix Multiplication On CUDA Code

53 of 63

Limitations With this type of attempt

  • Max threads allowed per block is 512.
  • Only supports max matrix size of 22x22
    • 484 threads needed

54 of 63

CUDA Blocks

  • Split result matrix into smaller blocks
  • Utilizes more SM’s rather than the single block approach
  • Better speed-up

55 of 63

Blocks Diagram

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTH

WIDTH

bx

tx

0

1

TILE_WIDTH-1

2

0

1

2

by

ty

2

1

0

TILE_WIDTH-1

2

1

0

TILE_WIDTHE

WIDTH

WIDTH

56 of 63

Matrix Multiplication Using Blocks

57 of 63

Matrix Multiplication Speed Analysis

  • Runs 10 times as fast as serial approach
  • Solution runs 21.4 GFLOPS
    • GPU is capable of 384 GFLOPS
    • What gives?

58 of 63

How GPU Executes Code

  • Each block assigned to SP
    • 8 SPs to 1 SM
  • SM executes single SP
  • SM switches SPs when long-latency is found
    • Works similar to Intel’s Hyperthreading
  • SM executes batch of 32 threads at a time
    • Batch of 32 threads called warp.

59 of 63

GPU Constraints – Memory Speed

  • Global Memory bandwidth is 86.4 Gbps
  • Shared Memory bandwidth is 384 Gbps
  • Register File bandwidth is 384 Gbps
  • Key is to use shared memory and registers when possible

60 of 63

GPU Constraints – Memory Size

  • Each SP has 16 KB shared memory
  • Each SM has 32 KB register file
  • Local variables in function take up registers
  • Register file must support all threads in SM
    • If not enough registers, then less blocks are scheduled
    • Program still executes, but less parallelism occurs.

61 of 63

GPU Constraints – Thread Count

  • SM can only handle 768 threads
  • SM can handle 8 blocks, 1 block for each SP
  • Each block can have up to 96 threads
    • Max out SM resources

62 of 63

Larrabee

  • Intel’s new approach to a GPU
  • Considered to be a hybrid between a multi-core CPU and a GPU
  • Combines functions of a multi-core CPU with the functions of a GPU

63 of 63

Thanks

63