1 of 55

GPU programming model

Jumanazarov Mardonbek

2 of 55

Plan:

  1. GPU Programming Abstractions: A Common Framework
  2. Code Structure for the GPU Programming Model
  3. Optimizing GPU Resource Utilization
  4. The Reduction Pattern Requires Synchronization Between Workgroups
  5. Asynchronous Computing via Queues (Operation Threads)
  6. Developing a Parallelization Plan for GPU Processors
  7. Exercises

3 of 55

In this chapter, we'll gradually develop an abstract model of how work is performed on GPU processors. This programming model is applicable to a wide range of GPU devices from different vendors and different types of GPUs within each vendor. It's also simpler than what happens on real hardware, covering only the essential aspects needed for application development. Fortunately, different GPUs share a lot of similarities in their structure. This model is a natural result of the needs of high-performance graphics applications.

The choice of data structures and algorithms has a long-term impact on the performance and ease of programming of GPUs. Having a good mental model of the GPU will help you plan how to relate data structures and algorithms to GPU parallelism. Our primary goal, particularly with GPU processors, as application developers is to achieve as much parallelism as possible. By mobilizing thousands of threads, we need to fundamentally redesign the work so that many small operational tasks are distributed among the threads. A GPU language, like any other parallel programming language, must have several components. This is a way to:

  • express computational cycles in a GPU-parallel form (see Section 10.2);
  • move data between the host CPU and the GPU computing device (see Section 10.2.4);
  • coordinate work between the threads needed for reduction (see Section 10.4).

4 of 55

Consider how these three components are implemented in each GPU programming language. In some languages, you directly control certain aspects to implement the necessary operations, while in others, you rely on the compiler or template programming. While GPU operations may seem arcane at first glance, they are not that different from what is required for parallel code on a CPU. We must write fine-grained parallelism-safe loops, sometimes called do concurrent in Fortran, or forall or foreach in C/C++. We must think about moving data between nodes, processes, and the CPU. We must also have special mechanisms for reductions.

In native GPU computing languages, such as CUDA and OpenCL, the programming model is an integral part of the language. These GPU languages ​​are discussed in Chapter 12. In this chapter, you will explicitly control many aspects of GPU parallelism in your program. But our programming model will better prepare you for making important programming decisions regarding higher performance and scalability across a wide range of GPU hardware.

If you're using a higher-level programming language, such as the pragma-oriented GPU languages ​​described in Chapter 11, do you really need to understand all the details of the GPU programming model? Even with pragmas, it's still useful to understand how work is distributed. When you use a pragma, you're trying to guide the compiler and library down the right path. In some ways, this is more difficult than writing a program directly.

5 of 55

The goal of this chapter is to help you design an application for the GPU. It is largely independent of any GPU programming language. There are some questions that should be answered upfront. How will you organize your work, and what performance can you expect? Or, more simply, should you port your application to the GPU at all or stick with the CPU? GPU-based platforms, promising lower power consumption and orders of magnitude performance improvements, are attractive. But they are not a panacea for every application and use case. Let's dive into the details of the GPU programming model and see what it can do for you.

NOTE: We recommend consulting the examples in this chapter at https://github.com/EssentialsofParallelComputing/Chapter10.

6 of 55

1. GPU Programming Abstractions: A Widely Used Framework

GPU programming abstractions are possible for a number of reasons. The basic characteristics, which we'll explore in more detail shortly, are listed below. Then, we'll briefly discuss several basic terms of GPU parallelism.

  • Graphics operations are massively parallel.
  • Operations cannot be coordinated across operational tasks.

Massively Parallel

These abstractions are based on the components required for high-performance graphics with GPU processors. GPU worker threads have certain special characteristics that help foster commonality in GPU processing techniques. High-frequency and high-quality graphics require processing and displaying large numbers of pixels, triangles, and polygons.

Due to the large volume of data, GPU processors are massively parallel. Data operations are generally identical, so GPU processors employ similar techniques, applying a single command to multiple data elements to achieve a different level of efficiency. Figure 10.1 shows common programming abstractions widely used across various GPU vendors and models. These can be summarized into three or four basic techniques.

7 of 55

1. GPU Programming Abstractions: A Widely Used Framework

We start with the computational domain and iteratively decompose the work into the following components. We discuss each of these subdivisions of work in Sections 10.1.4–10.1.8:

  • data decomposition;
  • portioning work to manipulate some shared local memory;
  • operating on multiple data elements with a single instruction;
  • vectorization (across multiple GPUs).

Figure 10.1 Our mental model of GPU parallelization contains common programming abstractions that are widespread across most GPU hardware.

8 of 55

1. GPU Programming Abstractions: A Widely Used Framework

One important conclusion follows from these GPU parallel abstractions: that, in principle, there are three or perhaps four different levels of parallelization that you can apply to a compute cycle. For the primary graphics use case, there's little need to go beyond two or three dimensions and the corresponding number of parallelization levels. If your algorithm has more dimensions or levels, you must combine multiple compute cycles to fully parallelize your problem.

Inability to Maintain Coordination Among Operational Tasks

Graphics workloads don't require much coordination within operations. However, as we'll see in subsequent sections, there are algorithms, such as reductions, that require coordination. We'll need to develop complex schemes to handle these situations.

GPU Parallelism Terminology

Terminology for GPU parallelism components varies from vendor to vendor, which can be somewhat confusing when reading programming documentation or articles. To assist in finding matches among the terms used, we have summarized each supplier's official terms in Table 10.1.

9 of 55

1. GPU Programming Abstractions: A Widely Used Framework

OpenCL is an open standard for GPU programming, so we use its terminology as a baseline. OpenCL runs on all GPU hardware and many other devices, such as CPUs, and even on more exotic hardware such as field-programmable gate arrays (FPGAs) and other embedded devices. CUDA, NVIDIA's proprietary language for its GPUs, is the most widely used language for computing on GPU processors and, therefore, is used in much of the GPU programming documentation. HIP (Heterogeneous Computing Interface for Portability) is a portable CUDA-derived language developed by AMD for its GPUs. It uses the same terminology as CUDA. AMD's native Heterogeneous Computing (HC) compiler and Microsoft's C++ AMP language share many of the same terms. (At the time of this writing, C++ AMP is in maintenance mode and has not yet entered active development.) When trying to achieve portable performance, it is also important to consider the relevant features and conditions for the CPU, as shown in the last column of Table 10.1.

10 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Table 10.1 Programming abstractions and associated terminology for GPU processors

11 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Decomposing Data into Independent Units of Work: NDRange or Lattice

Data decomposition is at the heart of how GPU processors achieve their performance. GPUs break a task into many smaller data blocks. They then break these blocks down again and again.

GPUs must draw many triangles and polygons to generate high frame rates. These operations are completely independent of each other. For this reason, high-level data decomposition for GPU computation also creates independent and asynchronous work.

Due to the large volume of work being performed, GPUs hide latency (stuttering due to memory load) by switching to another workgroup ready for computation. Figure 10.2 shows a case where, due to resource constraints, only four subgroups (warps or wavefronts) can be scheduled. When subgroups reach memory reads and begin to stall, execution switches to other subgroups. An execution switch, also known as a context switch, hides latency not through a deep cache hierarchy, but through compilation. If you only have a single thread with a single instruction on a single data element, the GPU will be slow because it has no way to hide latency. But if you have a lot of data being operated on, it will be incredibly fast.

12 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Fig. 10.2 The GPU warp scheduler switches to different warp groups to accommodate memory reads and command stalls. Multiple warp groups allow work to be performed even when a warp group is synchronized.

13 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Table 10.2 shows the hardware limits for current NVIDIA and AMD schedulers. For these devices, we need a large number of candidate workgroups and subgroups to keep the processing elements busy.

Table 10.2 GPU Subgroup (Warp or Wavefront) Scheduler Limits

Data movement, and in particular, moving data up and down the cache hierarchy, is a significant portion of the processor's power consumption. Therefore, reducing the need for a deep cache hierarchy has significant benefits. Power consumption is significantly reduced. Furthermore, valuable silicon real estate is freed up on the processor. This space can then be filled with even more arithmetic logic units (ALUs).

14 of 55

1. GPU Programming Abstractions: A Widely Used Framework

We illustrate the data decomposition operation in Figure 10.3, where a two-dimensional computational domain is broken down into smaller two-dimensional data blocks. In OpenCL, this is called an NDRange, short for N-dimensional range (the term "lattice" in CUDA is somewhat more convenient). An NDRange in this case is a 3x3 set of 8x8 tiles. The data decomposition process breaks the global computational domain, GyxGx, into smaller blocks or tiles of size TyxxTx.

Fig. 10.3 Partitioning the computational domain into small independent units of work

15 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Let's take a closer look at the example to see what this step entails.

Example: Data Decomposition of a 1024x1024 Two-Dimensional Computational Domain

If we set the tile size to 16x8, the data decomposition will be:

NTx = Gx/Tx = 1024/16 = 64;

NTy = Gy/Ty = 1024/8 = 128;

NT = 64 × 128 = 8192 tiles.

At this first level of decomposition, the dataset begins to be broken down into a large number of smaller blocks, or tiles. GPU processors make several assumptions about the characteristics of the workgroups created as a result of this data decomposition. These assumptions are that the workgroups:

  • are completely independent and asynchronous;
  • have access to global and persistent memory.

Each created workgroup is an independent unit of work. This means that each of these workgroups can be operated on in any order, providing a level of parallelism that can be distributed across multiple compute modules on the GPU. These are the same properties we attributed to fine-grained parallelism in Section 7.3.6. The same changes to loops regarding independent iterations for OpenMP and vectorization also enable GPU parallelism.

16 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Table 10.3 shows examples of how this data decomposition can occur for one-dimensional, two-dimensional, and three-dimensional compute domains. For best performance, the fastest-changing tile size, Tx, should be a multiple of the cache line length, the memory bus width, or the subgroup size (wavefront or warp). The number of tiles, NT, in total and in each dimension, results in a large number of workgroups (tiles) being distributed across compute engines and GPU processing elements.

Table 10.3: Decomposition of the compute domain into tiles or blocks

17 of 55

1. GPU Programming Abstractions: A Widely Used Framework

In algorithms that require adjacent information, the optimal tile size for memory accesses must be balanced with minimizing the tile surface area (Figure 10.4). Adjacent data must be loaded more than once for adjacent tiles, making this consideration important.

Figure 10.4: Each workgroup must load adjacent data from the dotted rectangle, resulting in duplicate loads in the shaded areas, where more duplicate loads are required on the left. This must be balanced against optimal loads of continuous data in the x-direction.

18 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Work groups provide an optimally sized chunk of work.

A work group distributes work across threads on a computing device. Each GPU model has a maximum size, which is specified for the hardware. OpenCL reports this as CL_DEVICE_MAX_WORK_GROUP_SIZE in its device query. PGI reports this as Maximum Threads per Block in the output of its pgaccelinfo command (see Figure 11.3). The maximum work group size is typically between 256 and 1024. This is simply a maximum. For computing, the specified work group sizes are typically much smaller, resulting in more memory resources per work item or thread.

A work group is subdivided into subgroups or groups (Figure 10.5). A subgroup is a set of threads that execute in unison. On NVIDIA, the warp size is 32 threads. On AMD, this is called a wavefront, and the size is typically 64 work items. The size of the working group must be a multiple of the size of the subgroup.

19 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Fig. 10.5 A multidimensional workgroup is linearized into a one-dimensional tape, where it is divided into subgroups of 32 or 64 work elements. For performance reasons, workgroups should be multiples of the subgroup size.

Typical characteristics of workgroups on GPU processors are that they:

  • are dedicated to processing each subgroup;
  • have local memory (shared memory) and other resources shared within the group;
  • can synchronize within a workgroup or subgroup.

Local memory provides fast access and can be used as a kind of programmable cache or scratchpad memory (or super-RAM). If the same data is needed by multiple threads within a workgroup, performance can typically be improved by loading it into local memory when the computing core starts.

20 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Subgroups, warps, or wavefronts execute in unison.

To further optimize graphics operations, GPUs recognize that the same operations can be performed on many data elements. Therefore, GPUs optimize by operating on data sets with a single instruction, rather than issuing separate instructions for each element. This reduces the number of instructions to manipulate. This technique is called "single instruction, multiple data elements" (SIMD) on CPUs. All GPUs emulate this technique using a thread group, where it is called "single instruction, multithread" (SIMT). See Section 1.4 for an initial discussion of SIMD and SIMT.

Since a SIMT operation simulates SIMD operations, it is not necessarily limited to support vector hardware in the same way that SIMD operations are. SIMT operations execute in unison, with each thread in the subgroup executing all paths through a branch if any one thread must take a branch (Figure 10.6). This is similar to how a SIMD operation is executed using a mask. However, since the SIMT operation is emulated, this principle can be relaxed at the expense of greater flexibility in the instruction pipeline, which can support more than one instruction.

21 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Figure 10.6. The shaded rectangles show the executed language instructions by thread and lane. SIMD and SIMT operations execute all instructions in unison, applying masks to those that are false. Large blocks of conditional constructs can cause problems on GPU processors due to branch divergence.

Small partitions of conditional blocks in GPU processors don't have a significant impact on overall performance. However, if some threads take thousands of cycles longer than others, a serious problem arises. If threads are grouped so that all long branches are in the same subgroup (wavefront), then thread divergence will be negligible or even absent.

22 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Work Item: Basic Unit of Operation

The basic unit of operation in OpenCL is called a work item. Depending on the hardware implementation, this work item can be associated with a thread or a processing core. In CUDA, it is simply called a thread because that is how it is associated in NVIDIA GPUs. Calling a work item a thread confuses the programming model with how it is implemented in the hardware, but it is a bit clearer for the programmer.

A work item can enable another level of parallelism on GPUs with vector hardware units, as shown in Figure 10.7. This work item model also corresponds to a CPU, where a thread can perform a vector operation.

SIMD or Vector Hardware

Some GPUs also have vector hardware units and can perform SIMD (vector) operations in addition to SIMT operations. In the graphics world, vector hardware units handle spatial or color models. Scientific computing applications are more complex and not necessarily portable across GPU hardware types. A vector operation is performed for each work item, increasing the amount of useful resources available to the computing core. However, additional vector registers often compensate for the additional work. When used correctly, the effective utilization of vector units can significantly improve performance.

23 of 55

1. GPU Programming Abstractions: A Widely Used Framework

Vector operations are available in OpenCL and AMD languages. Since CUDA hardware does not have vector units, the same level of support is lacking in CUDA languages. However, OpenCL code with vector operations will run on CUDA hardware, so it can be emulated on CUDA hardware.

Fig. 10.7. Each workload on an AMD or Intel GPU can perform SIMD or vector operations. It also maps well

to the vector unit on the CPU.

24 of 55

2. Code structure for the GPU programming model

Now we can begin to examine the structure of GPU code, which includes a programming model. For convenience and generality, we will refer to the CPU as the host and use the term "device" to refer to the GPU.

The GPU programming model separates the loop body from the array range or index set applied to the function. The loop body creates a GPU compute kernel. The index set and arguments will be used on the host to invoke the kernel. Figure 10.8 shows the transformation from a standard loop to the body of a GPU compute kernel. This example uses OpenCL syntax. However, the CUDA kernel is similar and replaces the call to get_global_id with

gid = blockIdx.x *blockDim.x + threadIdx.x

Fig. 10.8 Correspondence between the standard cycle and the code structure of the GPU computing kernel

25 of 55

2. Code structure for the GPU programming model

In the next four sections, we'll look separately at how a loop body becomes a parallel compute kernel and how to associate it with an index defined on the host. Let's break this down into four steps.

1. Extract the parallel compute kernel.

2. Map local data to global data.

3. Compute the decomposition on the host into data blocks.

4. Allocate any necessary memory.

Ego Programming: The Parallel Computing Kernel Concept

GPU programming is the ideal language for the Ego generation. In the compute kernel, everything happens relative to you. Take, for example,

c[i] = a[i] + scalar*b[i];

26 of 55

2. Code structure for the GPU programming model

This expression contains no information about the extent of the loop. This loop could have i, i.e., a global index i that spans the range from 0 to 1000, or just the single value 22. Each data element knows what to do with itself and only itself. In effect, we have an "Ego" programming model, where I care only about myself. The power of this model is that the operations on each data element become completely independent. Let's look at a more complex example of the stencil operator. Even though we have two indices, i and j, and some references are to adjacent data values, the line of code below is still fully defined once we specify the values ​​of i and j.

xnew[j][i] = (x[j][i] + x[j][i–1] + x[j][i+1] + x[j–1][i] + x[j+1][i])/5.0;

Separating the loop body and index set in C++ is accomplished using functors or lambda expressions. Lambda expressions have been available in C++ since the C++11 standard. Lambdas are used by compilers to ensure portability of a single source code across CPUs and GPUs. Listing 10.1 shows a C++ lambda.

DEFINITION: Lambda expressions are unnamed local functions that can be assigned to a variable and used locally or passed to a subroutine.

27 of 55

2. Code structure for the GPU programming model

28 of 55

2. Code structure for the GPU programming model

A lambda expression consists of four main components:

  • the lambda body – the function executed by the call. In this case, the body is

c[i] = a[i] + scalar * b[i];

  • arguments – the argument (int i) used in the subsequent call of the lambda expression;
  • the closure of the capturing context – the list of variables in the function body that are defined externally and the method for passing them to the subroutine, which is specified by [&] in Listing 10.1. The & sign indicates that the variable is accessed by reference, and the = sign indicates that it should be copied by value. A single & specifies that variables are accessed by reference by default. We can specify the variables more fully by specifying the capturing context as [&c, &a, &b, &scalar];
  • call – The for loop on lines 13 through 16 in Listing 10.1 calls the lambda on the given array values.

29 of 55

2. Code structure for the GPU programming model

Lambda expressions form the basis for more natural code generation for GPU processors in new C++-like languages ​​such as SYCL, Kokkos, and Raja. We'll briefly discuss SYCL in Chapter 12 as a higher-level C++ language (originally built on top of OpenCL). Two languages, Kokkos, from Sandia National Laboratories (SNL), and Raja, from Lawrence Livermore National Laboratory (LLNL), are higher-level languages ​​designed to simplify writing portable scientific applications for a wide range of modern computing hardware. We'll also introduce Kokkos and Raja in Chapter 12.

Stream Indexes: Mapping a Local Tile to the Global World

The key to understanding how a compute kernel constructs its local operation is that, as a product of data decomposition, we provide each workgroup with some information about where it resides in the local and global domains. In OpenCL you can get the following information below.

30 of 55

2. Code structure for the GPU programming model

  • Dimension – Obtains from a kernel invocation the number of dimensions (one, two, or three dimensions) for that kernel.
  • Global Information – the global index in each dimension, which corresponds to a local unit of work, or the global size in each dimension, i.e., the size of the global computational domain in each dimension.
  • Local (Tile) Information – the local size in each dimension, which corresponds to the tile size in that dimension, or the local index in each dimension, which corresponds to the tile index in that dimension.
  • Group Information – the number of groups in each dimension, corresponding to the number of groups in that dimension, or the group index in each dimension, corresponding to the group index in that dimension.

Similar information is available in CUDA, but the global index must be calculated from the local thread index plus block (tile) information:

gid = blockIdx.x *blockDim.x + threadIdx.x;

In Fig. 10.9 shows workgroup (block or tile) indexing for OpenCL and CUDA. First, an OpenCL function is called, followed by a variable defined in CUDA. All this indexing support is automatically handled for you using GPU data decomposition, significantly simplifying global-to-tile mapping.

31 of 55

2. Code structure for the GPU programming model

Fig. 10.9 Mapping the index of an individual job item into the global index space. First, an OpenCL call is specified, followed by a variable defined in CUDA.

32 of 55

2. Code structure for the GPU programming model

Index Sets

The index size for each work group must be the same. This is accomplished by expanding the global compute domain to a multiple of the local work group size. This expansion can be accomplished with some simple integer arithmetic, resulting in one additional work group and an expanded global work size. The following example shows an approach using basic integer operations, followed by a second example using the C-internal ceil function.

global_work_sizex = ((global_sizex + local_work_sizex – 1)/

local_work_sizex) * local_work_sizex

NOTE: In GPU compute kernels, it is important to avoid out-of-bounds reads and writes, as these lead to random compute kernel crashes without an error message or output.

33 of 55

2. Code structure for the GPU programming model

Example: Calculating Work Group Sizes

The code snippet below uses arguments to call the compute kernel to get uniform work group sizes.

int global_sizex = 1000; int local_work_sizex = 128;

int global_work_sizex = ((global_sizex + local_work_sizex - 1)/

local_work_sizex) * local_work_sizex = 1024;

int number_work_groupsx = (global_sizex + local_work_sizex - 1)/ local_work_sizex = 8

// or, alternatively,

int global_work_sizex = ceil(global_sizex/local_work_sizex) *

local_work_sizex = 1024;

int number_work_groupsx = ceil(global_sizex/local_work_sizex) = 8;

To avoid out-of-bounds reads, we should check the global index in each core and skip the read if it's past the end of the array, using something like this:

if (gidx > global_sizex) return;

34 of 55

2. Code structure for the GPU programming model

How to Access Memory Resources in Your GPU Programming Model

Memory management remains the most important consideration, as it impacts your application programming plan. Fortunately, modern GPUs have abundant memory. NVIDIA V100 and AMD Radeon Instinct MI50 GPU processors support 32 GB of RAM. Compared to well-provisioned CPU HPC nodes with 128 GB of memory, a GPU compute node with 4-6 GPU processors has the same memory. GPU compute nodes have the same amount of memory as CPU processors. Therefore, we can use the same memory allocation strategy as for the CPU and avoid having to transfer data back and forth due to limited GPU memory.

Memory allocation for the GPU should be performed on the CPU. It is not uncommon to allocate memory to both the CPU and GPU simultaneously, and then transfer data between them. However, if possible, you should allocate memory exclusively to the GPU. This avoids costly memory transfers to and from the CPU and frees up memory on the CPU. Algorithms that use dynamic memory allocation pose a challenge for GPUs and must be converted to a static memory algorithm, the size of which is known in advance. Newer GPUs are good at coagulating irregular or shuffled memory accesses into a single, coherent cache line load whenever possible.

DEFINITION: A coagulated memory load is the combination of individual memory loads from groups of threads into a single cache line load.

35 of 55

2. Code structure for the GPU programming model

Memory coagulation (i.e., combining) on ​​the GPU is performed at the hardware level in the memory controller. The performance gain from these coagulated loads is significant. But just as importantly, numerous optimizations from previous GPU programming tutorials are no longer needed, significantly reducing GPU programming overhead.

You can achieve some additional speedup by using local (shared) memory for data that is accessed more than once. This optimization was previously important for performance, but higher-quality caches on GPU processors make the speedup less significant. There are two strategies for using local memory, depending on the ability to predict the required local memory size. Figure 10.10 shows a regular grid approach on the left and an irregular grid approach on the right for unstructured and adaptive grid granularity. The regular grid has four adjacent tiles with overlapping halo regions. Adaptive grid detail only shows four cells; a typical GPU application loads 128 or 256 cells and then brings in the necessary neighboring cells located on the periphery.

36 of 55

2. Code structure for the GPU programming model

The processes in these two cases are as follows:

  • threads require the same memory load as their neighboring threads. A good example of this is the stencil operation, which we use throughout the book. Thread i needs the values ​​i-1 and i+1, meaning multiple threads will need the same values. In this situation, the best approach is to perform cooperative memory loads. Copying memory values ​​from global memory to local (shared) memory leads to significant speedup;
  • an irregular grid has an unpredictable number of neighbors, making loading into local memory difficult. This situation can be resolved by copying part of the grid being computed to local memory and then loading the adjacent data into registers for each thread.

37 of 55

2. Code structure for the GPU programming model

Fig. 10.10 For stencils on regular grids, load all data into local memory and then use local memory for computation. The inner solid rectangle is the computational tile. The outer dashed rectangle encloses adjacent data needed for the computation. Cooperative loads can be used to load the data in the outer rectangle into local memory for each workgroup. Since irregular grids have unpredictable size, only the region being computed should be loaded into local memory and the rest should be stored in per-thread registers.

38 of 55

3. Optimizing GPU resource usage

The key to good GPU programming is managing the limited resources available to execute compute kernels. Let's look at some of the most important resource limits in Table 10.4. Exceeding the available resources can lead to a significant performance penalty. NVIDIA's compute capability of 7.0 is designed for the V100 chip. The newer Ampere A100 chip uses compute capability of 8.0, with nearly identical resource limits.

Table 10.4: Several resource limits for current GPUs

39 of 55

3. Optimizing GPU resource usage

The most important control available to a GPU programmer is workgroup size. At first glance, it might seem desirable to maximize threads per workgroup. However, in the case of compute cores, their complexity, compared to graphics cores, means there is greater demand on compute resources. Colloquially, this is referred to as memory pressure or register pressure. Reducing the workgroup size gives each workgroup more resources to operate. It also provides more workgroups for context switching, which we discussed in Section 10.1.1. The key to achieving good GPU performance is finding the right balance between workgroup size and resources.

DEFINITION: Memory pressure is the impact of a compute core's resource demands on the performance of the GPU's compute cores. Register pressure is a similar term, referring to the register demands of a compute core.

A complete analysis of the resource requirements of a specific compute core and the resources available on the GPU requires careful consideration. We will provide examples of several types of such in-depth analysis. In the next two sections, we will consider:

  • the number of registers used by the computing core;
  • the degree of multiprocessor load, known as occupancy.

40 of 55

3. Optimizing GPU resource usage

How many registers does my compute kernel use?

You can find out the number of registers used in your code by adding the -Xptxas="-v" flag to the nvcc compilation command. To achieve a similar result in OpenCL for NVIDIA GPUs, use the -cl-nv-verbose flag in the OpenCL compilation line.

Example: Getting register usage data for your compute kernel on an NVIDIA GPU

First, we build BabelStream with additional compiler flags:

git clone git@github.com:UoB-HPC/BabelStream.git cd BabelStream

export EXTRA_FLAGS='-Xptxas="-v"' make -f CUDA.make

The output from the NVIDIA compiler shows the register usage for the stream triad:

ptxas info: Used 14 registers, 4096 bytes smem, 380 bytes cmem[0]

ptxas info: Compiling entry function '_Z12triad_kernelIfEvPT_PKS0_S3_'

for 'sm_70'

ptxas info: Function properties for _Z12triad_kernelIfEvPT_PKS0_S3_ 0 bytes stack frame, 0 Bytes spill stores, 0 bytes spill loads

In this simple compute kernel, we use 14 registers out of the 255 available on NVIDIA GPUs.

41 of 55

3. Optimizing GPU resource usage

Occupancy: Allowing More Work to Be Scheduled per Workgroup

We've discussed the importance of latency and context switching for good GPU performance. The benefit of "right-sized" workgroups is that more workgroups can execute simultaneously. This is important for the GPU because when a workgroup's progress is stalled due to memory latency, it needs other workgroups to execute to cover the latency. To properly size a workgroup, we need a metric. On GPU processors, a metric called "occupancy" is used to analyze workgroups. Occupancy is a measure of how busy the compute units are during a computation. This metric is complex because it depends on a large number of factors, such as required memory and registers used. The precise definition of occupancy is:

Occupancy = Number of active threads / Maximum number of threads per compute unit.

Since the number of threads in a subgroup is fixed, an equivalent definition is based on subgroups, also called wavefronts or warps:

Occupancy = Number of active subgroups / Maximum number of subgroups per compute module.

42 of 55

3. Optimizing GPU resource usage

The number of active subgroups or threads is determined by the workgroup or thread resource that is exhausted first. This number is often represented by the number of registers required by a workgroup or by local memory that prevents another workgroup from starting. For this, we need a tool like the CUDA Occupancy Calculator (shown in the example below). NVIDIA programming guides place a lot of emphasis on maximum occupancy. While this is an important measure in itself, switching between workgroups simply requires a sufficient number of them to hide latencies and bottlenecks.

Example: CUDA Occupancy Calculator

1. Download the CUDA Occupancy Calculator spreadsheet from https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html.

2. Enter the number of registers obtained from the NVCC compiler output (Section 10.3.1) and the workgroup size (1024).

The following figure shows the results for the Occupancy Calculator. The spreadsheet also contains graphs for varying block sizes, number of registers, and local memory usage (not shown).

43 of 55

3. Optimizing GPU resource usage

Результат на выходе из Калькулятора занятости CUDA для потоковой триады, показывающий использование ресурсов для вычислительного ядра. Третий блок сверху показывает меры занятости

44 of 55

4. The redux pattern requires synchronization between workgroups

So far, the computational loops we've considered over cells, particles, points, and other computational elements have been handled using the approach shown in Figure 10.8, in which we extracted for loops from the computational body to create a GPU computational kernel. This transformation is quick and easy and can be applied to a huge number of loops in a scientific application. However, there are other situations where converting code to GPUs is extremely difficult. We'll consider algorithms that require a more sophisticated approach. Take, for example, a single line of Fortran code that uses array syntax:

xmax = sum(x(:))

In Fortran, this looks fairly straightforward, but on GPUs, things get much more complicated. The difficulty stems from the inability to perform cooperative work or comparisons between workgroups. This can only be achieved by breaking out of the kernel. Figure 10.11 illustrates a general strategy that addresses this situation.

45 of 55

4. The redux pattern requires synchronization between workgroups

Рис. 10.11 Редукционная схема на GPU требует двух вычислительных ядер для синхронизирования многочисленных рабочих групп. Мы выходим из первого ядра, представленного прямоугольником, а затем запускаем еще одно ядро размером с одну рабочую группу, чтобы обеспечить кооперацию потоков для заключительного прохода

46 of 55

4. The redux pattern requires synchronization between workgroups

For ease of illustration, Figure 10.11 shows an array with a length of 32 elements. A typical array for this method would consist of hundreds of thousands or even millions of elements, so it would far exceed the workgroup size. In the first step, we find the sum of each workgroup and store it in a scratchpad array of length equal to the number of workgroups or blocks. The first pass reduces the array size by the size of our workgroup, which could be 512 or 1024. At this point, we can't share data between workgroups, so we exit the compute kernel and start a new kernel with only one workgroup. The remaining data may exceed the workgroup size of 512 or 1024, so we loop through the scratchpad array, summing the values ​​in each workgroup. Since we can share data between workgroups within a workgroup, we can reduce it to a single global value, summing as we go.

Complicated! The source code for performing this operation on a GPU takes dozens of lines and requires two compute cores to perform the same operation, which can be accomplished in one line of code on a CPU. We'll see more of the actual source code for the reduction in Chapter 12, when we turn to programming in CUDA and OpenCL. The performance achieved on a GPU is higher than on a CPU, but it requires a significant amount of programming effort. We'll also begin to understand that one of the characteristics of GPU processors is the difficulty of synchronization and comparisons.

47 of 55

5. Asynchronous computing via queues (operation flows)

We'll see how GPUs can be utilized more fully by overlapping data transfers and computations. Both data transfers can occur simultaneously with computation on the GPU.

The fundamental nature of GPU processing is asynchronous. Work is queued on the GPU and typically only executed when a result or synchronization request is required. Figure 10.12 shows a typical set of commands sent to the GPU for computation.

Figure 10.12. Work scheduled on the GPU in the default queue completes only when a wait is requested.

We've scheduled a copy of a graphic image (picture) to the GPU. We've then scheduled a mathematical operation on the data to modify it. We've also scheduled a third operation to return it. Neither operation should begin until we request a wait.

48 of 55

5. Asynchronous computing via queues (operation flows)

We can also schedule work across multiple independent and asynchronous queues. Using multiple queues, as shown in Figure 10.13, opens up opportunities for overlapping data transfers and computations. Most GPU languages ​​support some form of asynchronous work queues. In OpenCL, commands are queued, while in CUDA, operations are placed in threads. While this creates the potential for parallelism, whether it actually occurs depends on the hardware capabilities and the details of the coding.

Fig. 10.13 Breakdown of work into stages for three images in parallel queues

49 of 55

5. Asynchronous computing via queues (operation flows)

If we have a GPU that can simultaneously perform the following operations:

  • copying data from host to device,
  • computing the kernel(s),
  • copying data from device to host,

then the work performed in the three separate queues in Figure 10.13 can overlap the computations and data exchanges, as shown in Figure 10.14.

50 of 55

5. Asynchronous computing via queues (operation flows)

Figure 10.14. Overlapping computation and data transfer reduces the time for three images from 75 to 45 ms. This is possible because the GPU can simultaneously perform computation, a host-to-device data transfer, and another device-to-host transfer.

51 of 55

6. Developing a plan for parallelizing an application for GPU processors

Now let's move on to using our understanding of the GPU programming model to develop a parallelization strategy for our application. We'll use a couple of example applications to demonstrate this process.

Case 1: 3D Atmospheric Simulation

Your application is an atmospheric simulation ranging in size from 1024x1024x1024 to 8192x8192x8192, with x serving as the vertical dimension, y as the horizontal dimension, and z as the depth. Let's look at the options you might consider.

  • Option 1: Distribute the data in a one-dimensional format along the z (depth) dimension. For GPU processors, we need tens of thousands of workgroups for effective parallelism. The GPU specification (Table 9.3) suggests 60–80 compute units with 32 double-precision arithmetic units for approximately 2000 simultaneous arithmetic paths. Additionally, we need more workgroups to hide latency through context switching. Distributing the data across the z dimension allows for 1024 to 8192 workgroups, which is a low level of GPU parallelism.

52 of 55

6. Developing a plan for parallelizing an application for GPU processors

Let's consider the resources required for each workgroup. The minimum dimensions would be a 1024x1024 plane plus any necessary adjacent data in ghost cells. We'll allow one ghost cell in each direction. Therefore, we'll need 1024x1024x3x8 bytes, or 24 MiB, of local data. Looking at Table 10.4, GPU processors contain 64–96 KiB of local data, so we won't be able to prefetch data into local memory for faster processing.

  • Option 2: Distribute the data in two-dimensional vertical columns across the y and z dimensions. Distributing across two dimensions would give us over a million potential workgroups, giving us plenty of independent workgroups for the GPU. Each workgroup would have between 1024 and 8192 cells. We have a native cell plus 4 neighbors for 1024 × 5 × 8 = 40 KiB of minimum local memory requirement. For larger problems and with more than one variable per cell, we wouldn't have enough local memory.
  • Option 3: Distribute data in 3D cubes across the x, y, and z dimensions. Using the template from Table 10.3, let's try using a 4×4×8-cell tile for each workgroup. With neighbors, this is 6 × 6 × 10 × 8 bytes for a minimum local memory requirement of 2.8 KiB. We could have more variables per cell, and we could experiment with increasing the tile size.

53 of 55

6. Developing a plan for parallelizing an application for GPU processors

The total memory requirements for a 1024x1024x1024-cell tile with 8 bytes is 8 GiB. We have a large problem. GPU processors have up to 32 GiB of RAM, so this problem could probably fit on a single GPU. Larger problems would potentially require up to 512 GPU processors. Therefore, we must also plan for distributed memory parallelism using MPI.

Let's compare this to a CPU, where the above design decisions will have different results. We might have to distribute the work across 44 processes, each with fewer resource constraints. While a 3D approach might work, 1D and 2D approaches are also feasible. Now let's compare this to an unstructured computing grid, where all data is contained in 1D arrays.

Case 2: Using an Unstructured Computational Grid

In this case, your application is a 3D unstructured computational grid using tetrahedral or polyhedral cells ranging from 1 to 10 million cells. However, the data is a one-dimensional list of polygons with x, y, and z dimensions that contain spatial location. In this case, there is only one option: a one-dimensional data distribution.

54 of 55

6. Developing a plan for parallelizing an application for GPU processors

Since the data is unstructured and contained in one-dimensional arrays, the choices are simpler. We distribute the data in a single dimension with a tile size of 128. This gives us between 8,000 and 80,000 workgroups, providing the GPU with enough work to handle switching and hide latency. Memory requirements are 128 x 8 bytes double precision = 1 KB, allowing for multiple data values ​​per cell.

We also need space for some integer mapping and neighbor arrays to ensure connectivity between cells. Neighbor data is loaded into registers for each thread, so we don't need to worry about impacting local memory and possibly exceeding the memory limit. For the largest grid size of 10 million cells, 80 MB is required, plus space for face, neighbor, and mapping arrays. These connectivity arrays can significantly increase memory usage, but a single GPU should have enough memory to perform the computations even on the largest grid sizes.

To achieve the best results, we will need to ensure some locality for unstructured data using a data partitioning library or a spatial occupancy curve that keeps spatially adjacent cells in the array close to each other.

55 of 55

8. Exercises

1. You have a photo classification application that takes 5 ms to transfer each file to the GPU, 5 ms to process, and 5 ms to return. On the CPU, processing takes 100 ms per photo. You need to process a million photos. Your CPU has 16 processing cores. Will a GPU-based system run faster?

2. The GPU transfer time in Problem 1 is based on a third-generation PCI bus. If you can get a Gen4 PCI bus, how would that change the design? What about a Gen5 PCI bus? For photo classification, you don't need to return a modified photo. How would that change the calculations?

3. What size 3D application could you run on your discrete GPU (or NVIDIA GeForce GTX 1060 if you don't have one)? Assume there are four double-precision variables per cell and that there's a limit of half the GPU memory to allow for temporary arrays. How does this change the application if you use single precision?