1 of 63

GPU Programming Model

Dr A Sahu

Dept of Comp Sc & Engg.

IIT Guwahati

1

2 of 63

Outline

Graphics System
GPU Architecture
Memory Model

Vertex Buffer, Texture buffer

GPU Programming Model

DirectX, OpenGL, OpenCL

GP GPU Program

Introduction to Nvidia Cuda Programming

2

3 of 63

Graphics System

3

3D application

3D API: OpenGL

DirectX/3D

3D API Commands

CPU-GPU Boundary

GPU Command

& Data Stream

GPU

Command

Primitive

Assembly

Rastereisation

Interpolation

Raster

Operation

Frame Buffer

Programmable

Fragment

Processors

Programmable

Vertex

Processor

Vertex Index

Stream

Assembled polygon, line

& points

Pixel

Location

Stream

Pixel

Updates

Transformed

Fragments

Rastorized Pretransformed

Fragments

transformed

Vertices

Pretransformed

Vertices

4 of 63

Graphics System

4

Memory

System

Texture

Memory

Frame

Buffer

Vertex

Processing

Pixel

Processing

Vertices

(x,y,z)

Pixel

R, G,B

Vertex

Shadder

Pixel

Shadder

5 of 63

The Graphics Pipeline

Primitives are processed in a series of stages
Each stage forwards its result on to the next stage
The pipeline can be drawn and implemented in different ways
Some stages may be in hardware, others in software
Optimizations & additional programmability are available at some stages

Modeling �Transformations

Illumination

(Shading)

Viewing Transformation

(Perspective / Orthographic)

Clipping

Projection �(to Screen Space)

Scan Conversion�(Rasterization)

Visibility / Display

6 of 63

The Graphics Pipeline

Modeling �Transformations

Illumination

(Shading)

Viewing Transformation

(Perspective / Orthographic)

Clipping

Projection �(to Screen Space)

Scan Conversion�(Rasterization)

Visibility / Display

7 of 63

Programmable Graphics Hardware

Graphics pipeline (simplified)

Vertex

Shader

Pixel

Shader

Object space

Window space

Framebuffer

IN

OUT

Textures

8 of 63

GPU vs CPU

The computing capacities of graphics processing units (GPUs) have improved exponentially in the recent decade.

NVIDIA released a CUDA programming model for GPUs.

The CUDA programming environment applies the parallel processing capabilities of the GPUs to medical image processing research.

9 of 63

NVIDIA GeForce GTX 480

CUDA Cores 480

(Compute Unified Dev Arch)

Microsoft® DirectX® 11 Support
3D Vision™ Surround Ready
Interactive Ray Tracing
3-way SLI® Technology
PhysX® Technology
CUDA™ Technology
32x Anti-aliasing Technology
PureVideo® HD Technology
PCI Express 2.0 Support.
Dual-link DVI Support, HDMI 1.4

10 of 63

Generation IV: Radeon 9700/GeForce FX (2002)

Vertex

Transforms

This generation is the first generation of fully-programmable graphics cards
Different versions have different resource limits on fragment/vertex programs

Primitive

Assembly

Frame

Buffer

Raster

Operations

Rasterization

and

Interpolation

AGP

Programmable

Vertex shader

Programmable

Fragment

Processor

11 of 63

High-level shading language

Writing assembly is

Painful
Not portable
Not optimize-able

High level shading language solves these

Cg, HLSL

12 of 63

Memory Hierarchy

CPU and GPU Memory Hierarchy

Disk

CPU Main

Memory

GPU Video

Memory

CPU Caches

CPU Registers

GPU Caches

GPU Temporary

Registers

GPU Constant

Registers

13 of 63

GPU Memory Model

Much more restricted memory access

Allocate/free memory only before computation
Limited memory access during computation (kernel)

Registers

Read/write

Local memory

Does not exist

Global memory

Read-only during computation
Write-only at end of computation (pre-computed address)

Disk access

Does not exist

14 of 63

CPU Memory Model

At any program point

Allocate/free local or global memory
Random memory access

Registers

Read/write

Local memory

Read/write to stack

Global memory

Read/write to heap

Disk

Read/write to disk

15 of 63

GPU Memory Model

Where is GPU Data Stored?

Vertex buffer
Frame buffer
Texture

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

er result to ``other GPU memory'' (i.e.,

texture)

- Write directly to ``other GPU memory'' instead of framebuffer.

- Does the OS or OpenGL own GPU memory?

- What other memory can we write to?

- Textures

- Vertex array buffers?

- Fbuffer?

- Mechanisms by which GPU can write to its own memory

- Copy from framebuffer/pbuffer to texture

- Cross platform

- 2D output, save in 1D, 2D, 3D texture memory

- Slow...

- WGL_ARB_render_texture

- RTT using pbuffers (only on Windows)

- Fast RTT, but context switch is slow (time this!)

- Current state of the art and lots of hacks to speed up

- See next section for details of hackery

- GL_EXT_Render_Target

- Lightweight extension to enable x-platform, efficient RTT.

- Spec. not yet approved and no implemenation

- GL_EXT_pixel_buffer_object

- Copy from frame buffer to vertex buffer

- Asynchronous CPU readbacks

- Supported by current NVIDIA drivers

- TODO: Can I talk about this?

- Uber buffers

- General memory model for GPUs

- Textures, frame buffers, vertex buffers are all just ``memory''

- Render to any GPU memory: N-D Texture, Vertex arrays, stencil

bufer, frame buffer, etc.

- Cross platform (OpenGL owns the memory, not the OS)

- Mix-and-match depth buffers/color buffers/etc.

- Alpha ATI drivers and spec. not approved

- Stream/GPU-Based Data Structures

1) Multi-dimensional streams

- Read/Write GPU memory optimized for 2D (images!)

- But isn't memory all really 1D?

- Yes, but GPU memory heirarchy is optimized for 2D

accesses. Texture caches must capture multidimensional

locality for texture filtering and 2D rasterization.

- Reference texture cache stanford paper.

- Result is that GPGPU programmer should use illusion of 2D

physical memory.

- Large 1D streams

- Lay out in 2D

- 3D streams

- Update slice-by-slice (potentially limits parallelism)

- Flatten parts or all into large 2D texture(s)

- Streams of higher dimension (> 3D)

- Layout in 2D memory in the same way that N-D arrays use

1D CPU memory.

- 2D memory is limited in size.

4) How does the GPU get memory addresses?

- Per-Vertex

- Vertex attributes

- Computed in vertex program

- Read from vertex texture

- Per-Fragment

- Per-vertex addresses interpolated by rasterizer

- Computed in fragment stage

- Read from texture memory

2) Pointers

- Dependent texture lookups

3) Sparse Data

- Two options

- Store entire dataset on GPU and create substreams out

of it (depth culling or geometry-based substreams).

- Sherbondy et al., IEEE Visualization 2003

- Purcell et al.

- Only store sparse data on GPU (in packed format)

- Sparse matrices: Kruger, Bolz

- Sparse volume: Lefohn

- Performance

- Pbuffers

- Currently the state-of-the-art for RTT

- Most implementations optimized for RGBA??? (TODO: Is this true?)

- Avoid context switches (TIME this!!)

- Pack scalar data into RGBA channels

- Use multiple surfaces (front/back/aux0/...)

- Pack 2D domains into larger buffers (dangerous!)

- Texture Cache Considerations

- Caches designed to capture 2D locality wrt. to rasterization

and texture filtering.

- Dependent Texture Reads

- NVIDIA: Based on cache locality

- ATI: ???

- Compute addresses at lowest possible computational frequency

- Neighbor offsets in vertex program

- Avoid fragment-level address computation whenever possible

16 of 63

GPU Memory API

Each GPU memory type supports subset of the following operations

CPU interface
GPU interface

17 of 63

GPU Memory API

CPU interface

Allocate
Free
Copy CPU 🡪 GPU
Copy GPU 🡪 CPU
Copy GPU 🡪 GPU
Bind for read-only vertex stream access
Bind for read-only random access
Bind for write-only framebuffer access

17

18 of 63

GPU Memory API

GPU (shader/kernel) interface

Random-access read
Stream read

19 of 63

Vertex Buffers

GPU memory for vertex data
Vertex data required to initiate render pass

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

er result to ``other GPU memory'' (i.e.,

texture)

- Write directly to ``other GPU memory'' instead of framebuffer.

- Does the OS or OpenGL own GPU memory?

- What other memory can we write to?

- Textures

- Vertex array buffers?

- Fbuffer?

- Mechanisms by which GPU can write to its own memory

- Copy from framebuffer/pbuffer to texture

- Cross platform

- 2D output, save in 1D, 2D, 3D texture memory

- Slow...

- WGL_ARB_render_texture

- RTT using pbuffers (only on Windows)

- Fast RTT, but context switch is slow (time this!)

- Current state of the art and lots of hacks to speed up

- See next section for details of hackery

- GL_EXT_Render_Target

- Lightweight extension to enable x-platform, efficient RTT.

- Spec. not yet approved and no implemenation

- GL_EXT_pixel_buffer_object

- Copy from frame buffer to vertex buffer

- Asynchronous CPU readbacks

- Supported by current NVIDIA drivers

- TODO: Can I talk about this?

- Uber buffers

- General memory model for GPUs

- Textures, frame buffers, vertex buffers are all just ``memory''

- Render to any GPU memory: N-D Texture, Vertex arrays, stencil

bufer, frame buffer, etc.

- Cross platform (OpenGL owns the memory, not the OS)

- Mix-and-match depth buffers/color buffers/etc.

- Alpha ATI drivers and spec. not approved

- Stream/GPU-Based Data Structures

1) Multi-dimensional streams

- Read/Write GPU memory optimized for 2D (images!)

- But isn't memory all really 1D?

- Yes, but GPU memory heirarchy is optimized for 2D

accesses. Texture caches must capture multidimensional

locality for texture filtering and 2D rasterization.

- Reference texture cache stanford paper.

- Result is that GPGPU programmer should use illusion of 2D

physical memory.

- Large 1D streams

- Lay out in 2D

- 3D streams

- Update slice-by-slice (potentially limits parallelism)

- Flatten parts or all into large 2D texture(s)

- Streams of higher dimension (> 3D)

- Layout in 2D memory in the same way that N-D arrays use

1D CPU memory.

- 2D memory is limited in size.

4) How does the GPU get memory addresses?

- Per-Vertex

- Vertex attributes

- Computed in vertex program

- Read from vertex texture

- Per-Fragment

- Per-vertex addresses interpolated by rasterizer

- Computed in fragment stage

- Read from texture memory

2) Pointers

- Dependent texture lookups

3) Sparse Data

- Two options

- Store entire dataset on GPU and create substreams out

of it (depth culling or geometry-based substreams).

- Sherbondy et al., IEEE Visualization 2003

- Purcell et al.

- Only store sparse data on GPU (in packed format)

- Sparse matrices: Kruger, Bolz

- Sparse volume: Lefohn

- Performance

- Pbuffers

- Currently the state-of-the-art for RTT

- Most implementations optimized for RGBA??? (TODO: Is this true?)

- Avoid context switches (TIME this!!)

- Pack scalar data into RGBA channels

- Use multiple surfaces (front/back/aux0/...)

- Pack 2D domains into larger buffers (dangerous!)

- Texture Cache Considerations

- Caches designed to capture 2D locality wrt. to rasterization

and texture filtering.

- Dependent Texture Reads

- NVIDIA: Based on cache locality

- ATI: ???

- Compute addresses at lowest possible computational frequency

- Neighbor offsets in vertex program

- Avoid fragment-level address computation whenever possible

20 of 63

Vertex Buffers

Supported Operations

CPU interface

Allocate
Free
Copy CPU 🡪 GPU
Copy GPU 🡪 GPU (Render-to-vertex-array)
Bind for read-only vertex stream access

GPU interface

Stream read (vertex program only)

21 of 63

Vertex Buffers

Limitations

CPU

No copy GPU 🡪 CPU
No bind for read-only random access
No bind for write-only framebuffer access

GPU

No random-access reads
No access from fragment programs

22 of 63

Textures

Random-access GPU memory

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

er result to ``other GPU memory'' (i.e.,

texture)

- Write directly to ``other GPU memory'' instead of framebuffer.

- Does the OS or OpenGL own GPU memory?

- What other memory can we write to?

- Textures

- Vertex array buffers?

- Fbuffer?

- Mechanisms by which GPU can write to its own memory

- Copy from framebuffer/pbuffer to texture

- Cross platform

- 2D output, save in 1D, 2D, 3D texture memory

- Slow...

- WGL_ARB_render_texture

- RTT using pbuffers (only on Windows)

- Fast RTT, but context switch is slow (time this!)

- Current state of the art and lots of hacks to speed up

- See next section for details of hackery

- GL_EXT_Render_Target

- Lightweight extension to enable x-platform, efficient RTT.

- Spec. not yet approved and no implemenation

- GL_EXT_pixel_buffer_object

- Copy from frame buffer to vertex buffer

- Asynchronous CPU readbacks

- Supported by current NVIDIA drivers

- TODO: Can I talk about this?

- Uber buffers

- General memory model for GPUs

- Textures, frame buffers, vertex buffers are all just ``memory''

- Render to any GPU memory: N-D Texture, Vertex arrays, stencil

bufer, frame buffer, etc.

- Cross platform (OpenGL owns the memory, not the OS)

- Mix-and-match depth buffers/color buffers/etc.

- Alpha ATI drivers and spec. not approved

- Stream/GPU-Based Data Structures

1) Multi-dimensional streams

- Read/Write GPU memory optimized for 2D (images!)

- But isn't memory all really 1D?

- Yes, but GPU memory heirarchy is optimized for 2D

accesses. Texture caches must capture multidimensional

locality for texture filtering and 2D rasterization.

- Reference texture cache stanford paper.

- Result is that GPGPU programmer should use illusion of 2D

physical memory.

- Large 1D streams

- Lay out in 2D

- 3D streams

- Update slice-by-slice (potentially limits parallelism)

- Flatten parts or all into large 2D texture(s)

- Streams of higher dimension (> 3D)

- Layout in 2D memory in the same way that N-D arrays use

1D CPU memory.

- 2D memory is limited in size.

4) How does the GPU get memory addresses?

- Per-Vertex

- Vertex attributes

- Computed in vertex program

- Read from vertex texture

- Per-Fragment

- Per-vertex addresses interpolated by rasterizer

- Computed in fragment stage

- Read from texture memory

2) Pointers

- Dependent texture lookups

3) Sparse Data

- Two options

- Store entire dataset on GPU and create substreams out

of it (depth culling or geometry-based substreams).

- Sherbondy et al., IEEE Visualization 2003

- Purcell et al.

- Only store sparse data on GPU (in packed format)

- Sparse matrices: Kruger, Bolz

- Sparse volume: Lefohn

- Performance

- Pbuffers

- Currently the state-of-the-art for RTT

- Most implementations optimized for RGBA??? (TODO: Is this true?)

- Avoid context switches (TIME this!!)

- Pack scalar data into RGBA channels

- Use multiple surfaces (front/back/aux0/...)

- Pack 2D domains into larger buffers (dangerous!)

- Texture Cache Considerations

- Caches designed to capture 2D locality wrt. to rasterization

and texture filtering.

- Dependent Texture Reads

- NVIDIA: Based on cache locality

- ATI: ???

- Compute addresses at lowest possible computational frequency

- Neighbor offsets in vertex program

- Avoid fragment-level address computation whenever possible

23 of 63

Textures

Supported Operations

CPU interface

Allocate
Free
Copy CPU 🡪 GPU
Copy GPU 🡪 CPU
Copy GPU 🡪 GPU (Render-to-texture)
Bind for read-only random access (vertex or fragment)
Bind for write-only framebuffer access

GPU interface

Random read

24 of 63

Framebuffer

Memory written by fragment processor
Write-only GPU memory

Vertex Buffer

Vertex

Processor

Rasterizer

Fragment

Processor

Texture

Frame

Buffer(s)

VS 3.0 GPUs

er result to ``other GPU memory'' (i.e.,

texture)

- Write directly to ``other GPU memory'' instead of framebuffer.

- Does the OS or OpenGL own GPU memory?

- What other memory can we write to?

- Textures

- Vertex array buffers?

- Fbuffer?

- Mechanisms by which GPU can write to its own memory

- Copy from framebuffer/pbuffer to texture

- Cross platform

- 2D output, save in 1D, 2D, 3D texture memory

- Slow...

- WGL_ARB_render_texture

- RTT using pbuffers (only on Windows)

- Fast RTT, but context switch is slow (time this!)

- Current state of the art and lots of hacks to speed up

- See next section for details of hackery

- GL_EXT_Render_Target

- Lightweight extension to enable x-platform, efficient RTT.

- Spec. not yet approved and no implemenation

- GL_EXT_pixel_buffer_object

- Copy from frame buffer to vertex buffer

- Asynchronous CPU readbacks

- Supported by current NVIDIA drivers

- TODO: Can I talk about this?

- Uber buffers

- General memory model for GPUs

- Textures, frame buffers, vertex buffers are all just ``memory''

- Render to any GPU memory: N-D Texture, Vertex arrays, stencil

bufer, frame buffer, etc.

- Cross platform (OpenGL owns the memory, not the OS)

- Mix-and-match depth buffers/color buffers/etc.

- Alpha ATI drivers and spec. not approved

- Stream/GPU-Based Data Structures

1) Multi-dimensional streams

- Read/Write GPU memory optimized for 2D (images!)

- But isn't memory all really 1D?

- Yes, but GPU memory heirarchy is optimized for 2D

accesses. Texture caches must capture multidimensional

locality for texture filtering and 2D rasterization.

- Reference texture cache stanford paper.

- Result is that GPGPU programmer should use illusion of 2D

physical memory.

- Large 1D streams

- Lay out in 2D

- 3D streams

- Update slice-by-slice (potentially limits parallelism)

- Flatten parts or all into large 2D texture(s)

- Streams of higher dimension (> 3D)

- Layout in 2D memory in the same way that N-D arrays use

1D CPU memory.

- 2D memory is limited in size.

4) How does the GPU get memory addresses?

- Per-Vertex

- Vertex attributes

- Computed in vertex program

- Read from vertex texture

- Per-Fragment

- Per-vertex addresses interpolated by rasterizer

- Computed in fragment stage

- Read from texture memory

2) Pointers

- Dependent texture lookups

3) Sparse Data

- Two options

- Store entire dataset on GPU and create substreams out

of it (depth culling or geometry-based substreams).

- Sherbondy et al., IEEE Visualization 2003

- Purcell et al.

- Only store sparse data on GPU (in packed format)

- Sparse matrices: Kruger, Bolz

- Sparse volume: Lefohn

- Performance

- Pbuffers

- Currently the state-of-the-art for RTT

- Most implementations optimized for RGBA??? (TODO: Is this true?)

- Avoid context switches (TIME this!!)

- Pack scalar data into RGBA channels

- Use multiple surfaces (front/back/aux0/...)

- Pack 2D domains into larger buffers (dangerous!)

- Texture Cache Considerations

- Caches designed to capture 2D locality wrt. to rasterization

and texture filtering.

- Dependent Texture Reads

- NVIDIA: Based on cache locality

- ATI: ???

- Compute addresses at lowest possible computational frequency

- Neighbor offsets in vertex program

- Avoid fragment-level address computation whenever possible

25 of 63

Programming Model: Early GPUs

Fixed function pipeline

Made early games look fairly similar
Little freedom in rendering
“One way to do things”

glShadeModel(GL_SMOOTH);

Different render methods

Triangle rasterization, proved to be very efficiently implemented in hardware.
Raytracing, voxels, produce nice results, very slow and require large amounts of memory

26 of 63

DirectX and OpenGL

DirectX before version 8 entirely fixed function
OpenGL before version 2.0 entirely fixed function

Extensions were often added for different effects, but no real programmability on the GPU.

OpenGL is just a specification

Vendors must implement the specification, but on whatever platform they wish

DirectX is a library, Windows only

Direct3D is the graphics component

27 of 63

Programmability in GPUs

Direct3D 8.0 (2000), OpenGL 2.0 (2004) added support for assembly language programming of vertex and fragment shaders.

NVIDIA GeForce 3, ATI Radeon 8000

Direct3D 9.0 (2002) added HLSL (High Level Shader Language) for much easier programming of GPUs.

NVIDIA GeForce FX 5000, ATI Radeon 9000

Minor increments on this for a long time, with more capabilities being added to shaders.

28 of 63

GPU Pipeline

Vertex data sent in by graphics API

Mostly OpenGL or DirectX

Processed in vertex program – “vertex shader”
Rasterized into pixels
Processed in “fragment shader”

Vertex

Shader

Fragment

Shader

Vertex

Data

Rasterize

To Pixels

Output

29 of 63

Shader Languages

No longer need to write shaders in assembly
GLSL, HLSL, Cg, offer C style programming languages
Write two main() functions, which are executed on each vertex/pixel
Declare auxiliary functions, local variables
Output by setting position and color

30 of 63

Shader Unification

Prior to Direct3D 10/GeForce 8000/Radeon 2000, vertex and fragment shaders were executed in separate hardware.
Direct3D 10 (with Vista) brought shader unification, and added Geometry Shaders.

GPUs now used the same ‘cores’ to geometry/vertex/fragment shader code.

CUDA comes out alongside GeForce 8000 line, allowing ‘cores’ to run general C code, rather than being restricted to graphics APIs.

31 of 63

Unified Shader Pipeline�(DX10, OpenGL 2, OpenGL 3)

Vertex Programs

Geometry Programs

Pixel Programs

Compute Programs

Rasterization

Hidden Surface Removal

GPU

Programmable Unified Processors

GPU memory (DRAM)

Final Image

3D Geometric Primitives

32 of 63

Generalized GPU programming

CUDA the first to drop graphics API, and allows the GPU to be treated as a coprocessor to the CPU.

Linear memory accesses (no more buffer objects)
Run thousands of threads on separate scalar cores (with limitations)
High theoretical/achieved performance for data parallel applications

ATI has Stream SDK

Closer to assembly language programming for Stream

33 of 63

OpenCL, DirectCompute

Apple announces OpenCL initiative in 2008

Officially owned by Khronos Group, the same that controls OpenGL
Released in 2009, with support from NVIDIA/ATI.
Another specification for parallel programming, not entirely specific to GPUs (support for CPU SSE instructions, etc.).

DirectX11 (and Direct3D10 extension) add in DirectComputeshaders

Similar idea to OpenCL, just tied in with Direct3D

CS101 GPU Programming

33

34 of 63

DirectX11, OpenGL4

DirectX11 also adds multithreaded rendering, and tessellation stages to the pipeline

Two new shader stages in the unified pipeline; Hull and Domain shaders
Allow high detail geometry to be created on the GPU, rather than flooding the PCI-E bus with geometry data.
More programmable geometry

OpenGL 4 (specification just released) is close to feature parity with Direct3D11

Namely also adds tessellation

35 of 63

Modern GPU computing

Newest GPUs have incredible compute power

1-3 TFlops, 100+ GB/s memory access bandwidth

More parallel constructs

High speed atomic operations, more control over thread interaction/synchronization.

Becoming easier to program

NVIDIA’s ‘Fermi’ architecture has support for C++ code, 64bit pointers, etc.

GPU computing starting to go mainstream

Photoshop5, Video encode/decode, physics/fluid simulation, etc.

36 of 63

Motivation: Computational Power

GPUs are fast…

3.0 GHz dual-core Pentium4: 24.6 GFLOPS
NVIDIA GeForceFX 7800: 165 GFLOPs
1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
ATI Radeon X850 XT Platinum Edition: 37.8 GB/s

GPUs are getting faster, faster

CPUs: 1.4× annual growth
GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth

37 of 63

Motivation: Flexible and Precise

Modern GPUs are deeply programmable

Programmable pixel, vertex, video engines
Solidifying high-level language support

Modern GPUs support high precision

32 bit floating point throughout the pipeline
High enough for many (not all) applications

38 of 63

Problems: Difficult To Use

GPUs designed for & driven by video games

Programming model unusual
Programming idioms tied to computer graphics
Programming environment tightly constrained

Underlying architectures are:

Inherently parallel
Rapidly evolving (even in basic feature set!)
Largely secret

Can’t simply “port” CPU code!

39 of 63

Programming a GPU for Graphics

Each fragment is shaded w/ SIMD program

Shading can use values from texture memory

Image can be used as texture on future passes

Application specifies geometry 🡪 rasterized

40 of 63

Programming a GPU for GP Programs

Run a SIMD kernel over each fragment

“Gather” is permitted from texture memory

Resulting buffer can be treated as texture on next pass

Draw a screen-sized quad 🡪 stream

41 of 63

Nvidia CUDA

Introduced November of 2006
Converts GPU to general purpose CPU
Required hardware changes

Only available on N70 or later GPU

GeForce 8000 series or newer

Implemented as extension to C/C++

Results in lower learning curve

42 of 63

GeForce 8800 Specs

16 Streaming Multiprocessors (SM)

Each one has 8 Streaming Processors (SP)
Each SM can execute 32 threads simultaneously
512 threads execute per cycle
SPs hide instruction latencies

768 MB DRAM

86.4 Gbps memory bandwidth to GPU cores
4 Gbps memory bandwidth with system memory

43 of 63

Typical NVIDIA GPU Device Layout

Load/store

Global Memory

Thread Execution

Manager

Input Assembler

Host

Texture

Parallel Data�Cache

Load/store

44 of 63

CUDA Execution Model

Starts with Kernel
Kernel is function called on host that executes on GPU
Thread resources are abstracted into 3 levels

Grid – highest level
Block – Collection of Threads
Thread – Execution unit

45 of 63

CUDA Execution Model

46 of 63

CUDA Memory Model

768 GB global memory

Accessible to all threads globally
86.4 Gbps throughput

16 KB shared memory per SP

Accessible to all threads within a block
384 Gbps throughput

32 KB register file per SM

Allocated to threads at runtime (local variables)
384 Gbps throughput
Threads can only see their own registers

47 of 63

CUDA Memory Model

Grid

Global Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Host

48 of 63

How Do You Execute CUDA Kernel?

(From C/C++ function)

Allocate memory on CUDA device
Copy data to CUDA device
Configure thread resources

Grid Layout (max 65536x65536)
Block Layout (3 dimensional, max of 512 threads)

Execute kernel with thread resources
Copy data out of CUDA device
Free memory on CUDA device

49 of 63

CUDA In Action: Matrix Multiplication

Multiply matrices M and N to form result R
General algorithm

For each row i in matrix R

For each column j in matrix R

Cell (i, j) = dot product of row i of M and column j of N

Algorithm runs in O(length³)

50 of 63

Matrix Multiplication On CUDA

Each thread represents cell (i, j)
Calculate value for cell (i, j)
Use single block
Should run in O(length)

Much better than O(length³)

51 of 63

Matrix Multiplication On CUDA

M

N

P

WIDTH

52 of 63

Matrix Multiplication On CUDA Code

53 of 63

Limitations With this type of attempt

Max threads allowed per block is 512.
Only supports max matrix size of 22x22

484 threads needed

54 of 63

CUDA Blocks

Split result matrix into smaller blocks
Utilizes more SM’s rather than the single block approach
Better speed-up

55 of 63

Blocks Diagram

Md

Nd

Pd

Pd_sub

TILE_WIDTH

WIDTH

bx

tx

0

1

TILE_WIDTH-1

2

0

1

2

by

ty

2

1

0

TILE_WIDTH-1

2

1

0

TILE_WIDTHE

WIDTH

56 of 63

Matrix Multiplication Using Blocks

57 of 63

Matrix Multiplication Speed Analysis

Runs 10 times as fast as serial approach
Solution runs 21.4 GFLOPS

GPU is capable of 384 GFLOPS
What gives?

58 of 63

How GPU Executes Code

Each block assigned to SP

8 SPs to 1 SM

SM executes single SP
SM switches SPs when long-latency is found

Works similar to Intel’s Hyperthreading

SM executes batch of 32 threads at a time

Batch of 32 threads called warp.

59 of 63

GPU Constraints – Memory Speed

Global Memory bandwidth is 86.4 Gbps
Shared Memory bandwidth is 384 Gbps
Register File bandwidth is 384 Gbps
Key is to use shared memory and registers when possible

60 of 63

GPU Constraints – Memory Size

Each SP has 16 KB shared memory
Each SM has 32 KB register file
Local variables in function take up registers
Register file must support all threads in SM

If not enough registers, then less blocks are scheduled
Program still executes, but less parallelism occurs.

61 of 63

GPU Constraints – Thread Count

SM can only handle 768 threads
SM can handle 8 blocks, 1 block for each SP
Each block can have up to 96 threads

Max out SM resources

62 of 63

Larrabee

Intel’s new approach to a GPU
Considered to be a hybrid between a multi-core CPU and a GPU
Combines functions of a multi-core CPU with the functions of a GPU

63 of 63

Thanks

63