An Introduction to GPU Computing and CUDA Architecture

Sarah Tariq, NVIDIA Corporation

GPU Computing

▪ GPU: Graphics Processing Unit

▪ Traditionally used for real-time rendering

▪ High computational density (100s of ALUs) and memory bandwidth (100+ GB/s)

▪ Throughput processor: 1000s of concurrent threads to hide latency (vs. large fast caches)

What is CUDA?

▪ CUDA Architecture

▪ Expose GPU computing for general purpose

▪ Retain performance

▪ CUDA C/C++

▪ Based on industry-standard C/C++

▪ Small set of extensions to enable heterogeneous programming

▪ Straightforward APIs to manage devices, memory etc.

▪ This session introduces CUDA C/C++

Introduction to CUDA C/C++

▪ What will you learn in this session?

▪ Start from “Hello World!”

▪ Write and launch CUDA C/C++ kernels

▪ Manage GPU memory

▪ Manage communication and synchronization

Prerequisites

▪ You (probably) need experience with C or C++

▪ You don’t need GPU experience

▪ You don’t need parallel programming experience

▪ You don’t need graphics experience

CONCEPTS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

HELLO WORLD!

CONCEPTS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

Heterogeneous Computing

▪ Terminology:

▪ Host The CPU and its memory (host memory)

▪ Device The GPU and its memory (device memory)

Host Device

Heterogeneous Computing

#include <iostream> #include <algorithm>

using namespace std;

#define N 1024 #define RADIUS 3 #define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) {

temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

// Synchronize (ensure all the data is available) __syncthreads();

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];

// Store the result out[gindex] = result; }

void fill_ints(int *x, int n) {

fill_n(x, n, 1); }

int main(void) {

int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS); out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies cudaMalloc((void **)&d_in, size); cudaMalloc((void **)&d_out, size);

// Copy to device cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice); cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);

// Copy result back to host cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup free(in); free(out); cudaFree(d_in); cudaFree(d_out); return 0; }

parallel fn

serial code

parallel code serial code

Simple Processing Flow

1. Copy input data from CPU memory to GPU

memory

PCI Bus