pmpp book ch. 1-3
CUDA-MODE
Lecture 2
Agenda for Lecture 2
Ch 1: Introduction
The Power Wall
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
Transistors (thousands)
Frequency
(MHz)
100
101
102
103
104
105
106
107
108
(increasing frequency further would make the chip too hot to cool feasibly)
The rise of CUDA
Amdahl's Law
Challenges
Main Goals of the Book
1. Parallel programming & computational thinking
2. Correct & reliable: debugging function & performance
3. Scalability: regularize and localize memory access
Ch 2: Heterogeneous data parallel computing
RGB->Grayscale, data independence
CUDA C
Example: Vector Addition
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | |
Input Vector x:
Input Vector y:
Output Vector z:
CUDA Essentials: Memory allocation
float *A_d;
size_t size = n * sizeof(float); // size in bytes
cudaMalloc((void**)&A_d, size); // pointer to pointer!
...
cudaFree(A_d);
cudaMemcpy: Host <-> Device Transfer
// copy input vectors to device (host -> device)
cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);
...
// transfer result back to CPU memory (device -> host)
cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);
CUDA Error handling
Kernel functions fn<<<>>>
Kernel Coordinates
Threads execute the same kernel code
__global__ & __host__
Vector Addition Example
01 // compute vector sum C = A + B
02 // each thread peforms one pair-wise addition
03 __global__
04 void vecAddKernel(float* A, float* B, float* C, int n) {
05 int i = threadIdx.x + blockDim.x * blockIdx.x;
06 if (i < n) { // check bounds
07 C[i] = A[i] + B[i];
08 }
09 }
Calling Kernels
dim3 numThreads(256);
dim3 numBlocks((n + numThreads - 1) / numThreads);
vecAddKernel<<<numBlocks, numThreads>>>(A_d, B_d, C_d, n);
Compiler
Ch 3: Multidimensional grids and data
Grid continued
dim3 grid(32, 1, 1);
dim3 block(128, 1, 1);
kernelFunction<<<grid, block>>>(..);
// Number of threads: 128 * 32=4096
Built-in Variables
blockIdx // dim3 block coordinate
threadIdx // dim3 thread coordinate
blockDim // number of threads in a block
gridDim // number of blocks in a grid
nd-Arrays in Memory
- row-major:
A B C
D E F
G H I
0,0 | 0,1 | 0,2 | 0,3 |
1,0 | 1,1 | 1,2 | 1,3 |
2,0 | 2,1 | 2,2 | 2,3 |
3,0 | 3,1 | 3,2 | 3,3 |
Logical view of data
0,0 | 0,1 | 0,2 | 0,3 | 1,0 | 1,1 | 1,2 | 1,3 | 2,0 | 2,1 | 2,2 | 2,3 | 3,0 | 3,1 | 3,2 | 3,3 |
Actual layout in memory
- column-major:
A D G
B E H
C F I
Image blur example (3.3, p. 60)
Handling Boundary Conditions
Matrix Multiplication