2 of 98

Plan:

Algorithm Analysis for Parallel Computing Applications
Performance Models vs. Algorithmic Complexity
Parallel Algorithms: What Are They?
What is a Hash Function?
Spatial Hashing: A Highly Parallel Algorithm
The Prefix Summation (Scan) Pattern and Its Importance in Parallel Computing
Parallel Global Sum: A Solution to the Associativity Problem
Future Research in Parallel Algorithms
Exercises

3 of 98

Algorithms are at the heart of computing. Along with the data structures discussed in the previous chapter, algorithms form the foundation of all computing applications. For this reason, it's important to carefully consider the key algorithms in your code. First, let's define what we mean by parallel algorithms and parallel patterns.

A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency in solving a problem. Examples of algorithms include sorting, searching, optimization, and matrix operations.
A parallel pattern is a concurrent, shared piece of code that appears with some frequency in a wide variety of applications. These code fragments, by themselves, typically do not solve the complete problems of interest to developers. Examples of patterns include reductions, prefix scans, and ghost cell updates.

4 of 98

Analysis of algorithms for parallel computing applications

The development of parallel algorithms is a young field. Even the terminology and methods for analyzing parallel algorithms are still stuck in the sequential world. One of the most traditional approaches to evaluating algorithms is analyzing their algorithmic complexity. Our definition of algorithmic complexity follows.

DEFINITION: Algorithmic complexity is a measure of the number of operations required to complete an algorithm. Algorithmic complexity is a property of an algorithm and is a measure of the amount of work or operations in a procedure.

5 of 98

Analysis of algorithms for parallel computing applications

Complexity is typically expressed in asymptotic notation. Asymptotic notation is a type of expression that defines performance bounds. Essentially, it determines whether the running time grows linearly or accelerates with increasing problem size. Asymptotic notation uses various forms of the letter O, such as O(N), O(N log N), or O(N 2). N is the size of a long array, such as the number of cells, particles, or elements. The combination of O() and N indicates how the cost of the algorithm scales as the array size N increases. O can be thought of as an "order of magnitude," as in the phrase "order scale." Typically, a simple loop over N elements will have O(N) complexity, a double nested loop will have O(N 2) complexity, and a tree-based algorithm will have O(N log N) complexity. By convention, leading constants are discarded. The most commonly used asymptotic notations are:

big O is the worst-case performance bound of an algorithm. Examples include a doubly nested for loop for a large array of size N, which will have complexity O(N 2);
big Ω (big omega) is the best-case performance limit of the algorithm;
big Θ (big theta) is the average-case performance of the algorithm.

6 of 98

Analysis of algorithms for parallel computing applications

In traditional algorithm analysis, the terms "algorithmic complexity," "computational complexity," and "time complexity" are used interchangeably. We will define these terms slightly differently to facilitate the evaluation of algorithms on modern parallel computing hardware. Time is independent of the amount of work, nor is computational effort or cost. Therefore, we make the following adjustments to the definitions of computational complexity and time complexity:

Computational complexity (also known as step complexity) is the number of steps required to complete an algorithm. This complexity measure is an attribute of the implementation and the type of hardware used for the computation. It includes the possible amount of parallelism. If you are using a vector or multi-core computer, a step (cycle) may consist of four or more floating-point operations. Can you use these additional operations to reduce the number of steps?
Time complexity takes into account the actual cost of an operation on a typical modern computing system. The biggest adjustment in time is to account for the costs of memory loads and data caching.

7 of 98

Performance models versus algorithmic complexity

We first introduced performance models in Chapter 4 to analyze the relative performance of different data structures. In a performance model, we construct a much more complete description of an algorithm's performance than in algorithmic complexity analysis. The biggest difference is that we don't hide a constant factor before the algorithm. But for scaling, there are also differences in terms, such as log N. The actual operation count is derived from a binary tree and should be written in log2 N.

In traditional algorithmic complexity analysis, the difference between two logarithmic terms is a constant, and it is absorbed by the constant factor. In real-world problems, these constants are often significant and do not cancel out; therefore, we need to use a performance model to distinguish between different approaches to a similar algorithm. To further understand the benefits of using performance models, let's start with an everyday example.

Example

You are co-organizing a conference with 100 attendees. You want to distribute registration packets to each attendee. Here are some algorithms you could use.

1. Ask all 100 participants to line up while you search through 100 folders to find the package to hand to each participant in turn. In the worst case, you'll need to search through all 100 folders. On average, you'll search through 50. If there's no package, you'll need to search through all 100.

2. Pre-sort the packages alphabetically before registration. Now you can use bisection search to find each folder.

8 of 98

Performance models versus algorithmic complexity

Returning to the first algorithm in the example, let's assume for simplicity that the folders remain after the document packets are delivered to the participant, so the number of folders remains unchanged at the initial value. There are N participants and N folders, thus creating a double nested loop. The computation has the order of operations of N2, or O(N2) in asymptotic big O notation for the worst case. If the folders were to be reduced each time, the computation would be (N + N – 1 + N – 2 ...), or an O(N2) algorithm. The second algorithm can use a sorted order of folders with bisection search to run the algorithm in O(N log N) operations in the worst case.

Asymptotic complexity tells us how well the algorithm performs when we reach large sizes, such as a million participants. But we will never have a million participants. We will have a finite size of 100 participants. For finite sizes, a more complete picture of the algorithm's performance is needed. For illustration purposes, we'll apply a performance model to one of the simplest computer algorithms to see how it can yield deeper insights. We'll use a time-based model, which takes into account real-world hardware costs rather than a per-operation calculation. In this example, we'll consider bisection search, also known as binary search. It's one of the most common computer algorithms and numerical optimization methods. Standard asymptotic analysis suggests that binary search is much faster than linear search. We'll show that, given the operating characteristics of real computers, the speedup is not as large as one might expect. This analysis will also help explain the search results in the table in Section 5.5.1.

9 of 98

Performance models versus algorithmic complexity

Example: Bisection Search vs. Linear Search

The bisection search algorithm can be used to find the correct entry in a sorted array of 256 integer elements. This algorithm takes the midpoint, recursively dividing the remaining possible range in half. In the performance model, bisection search would have log2 256 steps, or 8, while linear search would have a worst-case of 256 steps and an average of 128. If we calculate cache line hits for a four-byte integer array, linear search would require only 16 cache line hits in the worst case and 8 on average. Binary search would require four cache line hits and about four in the average case. It's important to emphasize that this linear search would only be twice as slow as binary search, not 16 times slower than we might expect. In this analysis, we assume that any operation on data in the cache is essentially free (effectively requiring a couple of cycles), while loading a cache line takes about 100 cycles. We simply count cache line loads and ignore comparison operations. For our time-based performance model, we would estimate the cost of a linear search to be (n/16)/2 = 8, and a bisectional search to be log2(n/16) = 4.

The nature of cache behavior significantly changes the result for these short arrays. Bisectional search is still faster, but not as much as might be expected from a simpler analysis.

10 of 98

Performance models versus algorithmic complexity

Although asymptotic complexity is used to understand performance as an algorithm scales, it does not provide an equation for absolute performance. For a given problem, linear search, which scales linearly, may outperform bisection search, which scales logarithmically. This is especially true when parallelizing the algorithm, as scaling a linearly scalable algorithm is much easier than scaling a logarithmically scalable algorithm. Furthermore, by design, a computer performs a linear traversal of the array and fetches data preemptively, which may slightly improve performance. Finally, in a given problem, an element may end up at the beginning of the array, where bisection search performance is significantly worse than linear search.

As an example of other parallel considerations, consider implementing search on 32 processor threads on a multi-core CPU or GPU. The set of threads must wait for the slowest one to complete during each operation. Bisection search always requires four cache loads. Linear search depends on the number of cache lines required by each thread. The worst-case estimate determines the amount of time the operation takes, making the cost closer to 16 cache lines than the average of 8 cache lines.

A natural question arises: how does this work in practice? Let's consider the two table search code examples described in the example. You can test the following algorithms on your system; the PerfectHash code is included in the source code for this chapter at https://github.com/EssentialsofParallelComputing/Chapter5. First, the following listing shows the version of the code using a linear table search algorithm.

11 of 98

Performance models versus algorithmic complexity

12 of 98

Performance models versus algorithmic complexity

Linear search on two axes is performed in lines 278 and 279. The coding is simple and straightforward and results in a cache-friendly implementation. Now let's look at bisection search in the listing below.

13 of 98

Performance models versus algorithmic complexity

The bisectional code is slightly longer than the linear search (Listing 5.1), but it should have lower operational complexity. In Section 5.5.1, we'll consider other table search algorithms and show their relative performance in Figure 5.8.

Spoiler: bisectional search isn't significantly faster than linear search, as you might expect, even accounting for the cost of interpolation. While this is true, this analysis shows that linear search isn't as slow as you might expect.

14 of 98

Parallel algorithms: what are they?

Now let's take another example from everyday life to introduce some ideas about parallel algorithms. This first example demonstrates how a comparison-free and less synchronous algorithmic approach is simpler to implement and shows better performance on highly parallel hardware. We'll discuss additional examples in subsequent sections that highlight spatial locality, reproducibility, and other important attributes of parallelism, and then summarize all the ideas in Section 5.8 at the end of the chapter.

Example: Comparison Sort vs. Hash Sort

You want to sort the 100 participants in the auditorium from the previous example. First, you try comparison sort.

1. Sort the room by asking each person to compare their last name with their neighbor in their row and move left if their last name is earlier in alphabetical order and right if it is later.

2. Continue for up to N steps, where N is the number of people in the room.

15 of 98

Parallel algorithms: what are they?

For GPU processors, a workgroup cannot communicate with another workgroup. Let's assume that each row in the classroom is a workgroup. When you reach the end of a row, it's as if you've reached the limit of the GPU workgroup, and you need to exit the kernel to perform a comparison with the next row. Having to exit the kernel means multiple kernel calls are required, which increases coding complexity and execution time. The details of GPU operation will be discussed in Chapters 9 and 10.

Now let's look at another sorting algorithm: hash sort. Hash sort is best understood by looking at the following example. We'll look at the elements of the hash function in more detail in Section 5.4.

For each letter, place its sign at the front of the room.
Each person approaches the table with the first letter of their last name.
For all letters with higher numbers of participants, repeat for the second letter of the last name. For all small numbers of participants, perform any simple sorting, including the one described earlier.

16 of 98

Parallel algorithms: what are they?

The first sort is a comparison sort using the bubble sort algorithm, which typically exhibits poor performance. Bubble sort traverses a list, comparing adjacent elements and swapping them if they are out of order. This algorithm iterates through the list until the list is sorted. In the best case, comparison sort has an algorithmic complexity bound of O(N log N). Hash sort overcomes this barrier because it does not use comparisons. On average, hash sort is a Θ(1) operation for each participant and Θ(N ) for all participants. The improved performance is significant; more importantly, the operations for each participant are completely independent. Having completely independent operations facilitates parallelization of the algorithm even on less synchronous GPU architectures. Putting all this together, we can use hash sort to alphabetize participants and folders and add parallelism by dividing multiple rows by alphabet on the registration table. Two hash sorts will have an algorithmic complexity of Θ(N) and, in parallel, Θ(N/P), where P is the number of processors. For 16 processors and 100 participants, the parallel speedup, compared to the sequential brute-force method, is 1002/(100/16) = 1600x. We recognize that the hash sort design is similar to what we see at well-organized conferences or in school registration lines.

17 of 98

What is a hash function?

IN THIS SECTION, WE WILL DISCUSS THE IMPORTANCE OF THE HASH FUNCTION. Hashing techniques emerged in the 1950s and 1960s, but their adaptation to many applications was slow. Specifically, we will cover the components that make up a perfect hash, spatial hashing, perfect spatial hashing, and all the promising use cases.

A hash function associates a key with a value, just as when looking up a word in a dictionary, it is used as the key to look up its definition. In Figure 5.1, the word Romero is the key that is hashed to look up the value, which in this case is a nickname or username. Unlike a physical dictionary, a computer requires at least 26 possible storage locations, multiplied by the maximum length of the dictionary key. Therefore, it is imperative for a computer to encode the key into a shorter form, called a hash. The terms "hash" or "hashing" refer to "splitting" a key into a shorter form for use as an index pointing to the storage location of a value. A storage location for a collection of values for a specific key is called a bucket. There are many different ways to generate a hash from a key; the best approaches typically depend on the specific task.

18 of 98

What is a hash function?

Fig. 5.1 Hash table for searching for a computer nickname by last name.

In ASCII, R is 82 and O is 79. We can then calculate the first hash key using

82 – 64 + 26 + 79 – 4 = 59. The value stored in the hash table is the user name, sometimes called a nickname.

A perfect hash is a hash in which each bucket contains at most one entry. Perfect hashes are easy to maintain but can require more memory. A minimal perfect hash consists of only one entry in each bucket and has no empty buckets. Computing minimal perfect hashes takes longer, but for fixed sets of programming language keywords, for example, the extra time is worth it. Most of the hashes we'll discuss here will be generated on the fly, queried, and discarded, so faster generation time is more important than memory footprint. Where a perfect hash is impossible or requires too much memory, a compact hash can be used. A compact hash compresses the hash so that it requires less memory for storage. As always, there are tradeoffs between different hashing methods in terms of programming difficulty, execution time, and memory requirements.

19 of 98

What is a hash function?

The load factor is the fraction of the hash that is full. It is calculated as n/k, where n is the number of data elements in the hash table and k is the number of buckets. Compact hashes still perform well with load factors of .8 to .9, but after that, their effectiveness degrades due to collisions. Collisions occur when multiple keys attempt to store their value in the same bucket. It is important to have a good hash function that distributes keys more evenly, avoiding clustering of entries, which helps increase load factors. With a compact hash, the key and value are stored in such a way that upon retrieval, the key can be checked to ensure that the correct data element is present.

In the previous examples, we used the first letter of the last name as a simple hash key. While effective, using the first letter certainly has its drawbacks. One is that the number of surnames beginning with each letter of the alphabet is unevenly distributed, resulting in unequal numbers of data elements in each bucket. Instead, we could use an integer representation of the string value. This representation creates a hash for the first four letters of the name. However, the character set yields only 52 possible values for 256 storage locations for each byte, resulting in only a small fraction of the possible integer keys. A dedicated hash function that expects only characters would require far fewer storage locations.

20 of 98

Spatial Hashing: A Highly Parallel Algorithm

Our discussion in Chapter 1 used the uniformly spaced regular grid from the Krakatoa example in Figure 1.9. In this discussion of parallel algorithms and spatial hashing, we need to use more complex computational grids. In scientific simulations, more complex grids are typically defined with greater detail in the regions of interest. In big data, particularly in image analysis and classification, these more complex grids have not been widely adopted. However, this technique would be of great value there; when a cell in an image has mixed characteristics, it is simply necessary to split the cell.

The biggest obstacle to using more complex grids is that their encoding becomes more complex, requiring the adoption of new computational techniques. For complex grids, it is more difficult to find methods that perform well and scale well on parallel architectures. In this section, we demonstrate ways to implement several widely used spatial operations using highly parallel algorithms.

21 of 98

Spatial Hashing: A Highly Parallel Algorithm

Example: Simulating Waves from Krakatoa

Your team is working on their wave application and decides that for their simulation, they require finer resolution in some areas of the wave front and shoreline, as shown in Figure 1.9. However, they do not require finer resolution in other parts of the grid. The team decides to turn to adaptive mesh refinement, which allows them to set finer mesh resolution in the areas that require it.

Cell-based adaptive mesh refinement (AMR) refers to a class of unstructured grid processing techniques that no longer offer the simplicity of a structured grid in data localization. In cell-based AMR (Figure 5.2), the cell data arrays are one-dimensional, and the data can be arranged in any order. Grid locations are transferred to additional arrays that contain information about the size and location of each cell. Thus, the grid has some structure, but the data is completely unstructured. Going deeper into unstructured territory, a completely unstructured grid may contain cells of triangles, polyhedrons, or other complex shapes. This allows the cells to "fit" within the boundaries between land and ocean, but at the cost of more complex numerical operations. Since many of the same parallel algorithms for unstructured data are applicable to both, we will work primarily with the example of cellular AMR.

22 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.2. A cell-based AMR mesh for a wave simulation from the CLAMR widget. Black squares are cells, and squares with different shades represent the height of the wave radiating outward from the upper right corner.

23 of 98

Spatial Hashing: A Highly Parallel Algorithm

AMR techniques can be categorized into patch-based, block-based, and cell-based approaches. Patch-based and block-based approaches use patches of varying sizes or fixed-size blocks, which can—at least partially—exploit the regular structure of these cell groups. Cellular AMR contains truly unstructured data, which can be arranged in any order. The CLAMR mini-app, shallow cell-based AMR (https://github.com/lanl/CLAMR.git), was developed by Davis, Nikolaeff, and Trujillo in 2011 while they were summer interns at Los Alamos National Laboratory. They wanted to see if cell-based AMR applications could run on GPUs. Along the way, they discovered breakthrough parallel algorithms that also accelerated CPU implementations. The most important of these was spatial hashing.

Spatial hashing is a technique in which the key is based on spatial information. The hashing algorithm maintains the same average algorithmic complexity of Θ(1) operations for each search. All spatial queries can be performed using a spatial hash; many are significantly faster than alternative techniques. Its basic principle is to map objects onto a grid of buckets arranged in a regular pattern.

A spatial hash is shown in the center of Figure 5.3. The size of the buckets is chosen based on the characteristic size of the objects being mapped. For a cell-based AMR grid, a minimum cell size is used. For particles or objects, as shown on the right in the figure, the cell size depends on the interaction distance. This choice means that for interaction or collision calculations, only directly adjacent cells need to be queried. Collision calculations are one of the most important applications of spatial hashes, not only in scientific computations such as particle hydrodynamics, molecular dynamics, and astrophysics, but also in game engines and computer graphics. In many situations, spatial locality can be exploited to reduce computational costs.

24 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.3 Computational meshes, particles, and objects mapped to a spatial hash. Unstructured mesh polyhedra and rectangular cells of a cellular adaptive refinement mesh can be mapped to a spatial hash for spatial operations. Particles and geometric objects can also benefit from being mapped to a spatial hash, providing information about their spatial localization so that only nearby elements need to be considered.

25 of 98

Spatial Hashing: A Highly Parallel Algorithm

Both the AMR and the unstructured grid on the left in the figure are called differentially discretized data, because the cells are smaller where gradients are steeper to better account for physical phenomena. However, they have a limit on how small the cells can get. This limit prevents the bucket sizes from becoming too small. Both grids store their cell indices in all underlying buckets of the spatial hash. For particles and geometric objects, the particle indices and object IDs are stored within buckets. This provides a form of locality that prevents the computational cost from increasing as the problem size increases. For example, if the problem domain is zoomed in on the left and top, the interaction computation in the lower-right corner of the spatial hash remains the same. Thus, the algorithmic complexity for the particle calculations remains Θ(N ) instead of growing to Θ(N 2 ). The following listing shows pseudocode for the interaction loop, which, instead of searching through all particles, loops through nearby locations in the inner loop.

26 of 98

Spatial Hashing: A Highly Parallel Algorithm

Using Perfect Hashing for Spatial Grid Operations

We'll first address perfect hashing to focus on the application of hashing rather than its internal mechanics. All of these methods rely on the ability to guarantee that each cell contains only one data element, avoiding collision handling issues when a cell may contain more than one data element. For perfect hashing, we'll explore four of the most important spatial operations.

Neighbor Finding – Locating one or two neighbors on each side of a cell.

Reflow – Mapping another AMR grid onto the current AMR grid.

Table Lookup – Locating intervals in a two-dimensional table to perform interpolation.

Sorting – Sorting cell data in one or two dimensions.

The complete source code for the examples for the four operations in one and two dimensions is available at https://github.com/lanl/PerfectHash.git under an open source license. This source code is also linked to the examples in this chapter. The perfect hash code uses CMake and checks for OpenCL availability. If OpenCL capabilities are not available, the code will detect this and will not compile the OpenCL implementations. The remaining cases will still run on the CPU.

27 of 98

Spatial Hashing: A Highly Parallel Algorithm

Neighbor Finding Using Spatial Perfect Hash

Neighbor finding is one of the most important spatial operations. In scientific computing, material moved from one cell must move to an adjacent cell. We need to know which cell to move it to in order to calculate the amount of material and move it. In image analysis, the characteristics of an adjacent cell can provide important information about the composition of the current cell.

The rules for the AMR mesh in the CLAMR widget are that the granularity of a cell's edge can only have a single-level jump. Furthermore, for each cell on each side, the neighbor list is represented by only one of its neighboring cells, and the cell selected must be the bottom cell or the cell to the left of each pair, as shown in Figure 5.4. The second cell of a pair is found using the neighbor list of the first cell; for example, ntop[nleft[ic]]. The task then involves setting up neighbor arrays for each cell.

28 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.4 A left neighbor is the lower cell of the two to the left, and a bottom neighbor is the cell to the left of the two below. Similarly, a right neighbor is the lower cell of the two to the right,

and a top neighbor is the cell to the left of the two cells above.

29 of 98

Spatial Hashing: A Highly Parallel Algorithm

What are some possible algorithms for finding neighbors? The naive way is to search all other cells for a cell that is adjacent to it. This can be done by accessing the variables i, j, and level in each cell. The naive algorithm has O(N 2 ) algorithmic complexity. It shows good performance with a small number of cells, but the running time complexity quickly becomes large. Some common alternative algorithms are tree-based, such as k-D-tree and quadtree (octree in three dimensions) algorithms. These algorithms are comparison-based, scale as O(N log N ), and will be defined later. Source code for computing neighbors in two dimensions, including k-D-tree, brute force, and a hash implementation for CPU and GPU, can be found at https://github.com/lanl/PerfectHash.git, along with other applications of spatial perfect hash discussed later in this chapter.

A k-D-tree splits the grid into two equal halves in the x-dimension, then into two equal halves in the y-dimension, repeating until the object is found. The k-D-tree construction algorithm has O(N log N ) complexity, and each individual search also has O(N log N ) complexity.

A quadtree has four children for each parent, one for each quadrant. This exactly corresponds to the subdivision of the AMR cell grid. A full quadtree starts at the top, or root, with a single cell and subdivides down to the finest level of the AMR grid. A "prune" quadtree starts at the coarsest level of the grid and has a quadtree for each coarse cell, mapping down to the finest level. The quadtree algorithm is comparison-based and has O(N log N ) complexity.

30 of 98

Spatial Hashing: A Highly Parallel Algorithm

Limiting the search to a single-level edge jump is called a graded grid. In cellular AMR, graded grids are common, but other applications of quadtrees, such as n-body applications in astrophysics, result in much larger jumps in the quadtree data structure. A single-level jump in granularity allows us to improve the design of the neighbor-finding algorithm. We can start the search at the leaf representing our cell, and at most, we only need to climb two tree levels to find our neighbor. To find a close neighbor of similar size, the search should start at the leaves and use a quadtree. To find large irregular objects, a k-D tree should be used, and the search should start at the root of the tree. Proper use of tree-based search algorithms can provide a viable implementation on CPUs, but comparison and tree construction present difficulties for GPU processors, where cross-workgroup comparisons cannot be easily performed. This paves the way for developing a spatial hash to perform neighbor-finding operations. Collision-free operation in our spatial hash can be guaranteed by making the hash buckets the size of the most detailed cells in the AMR grid. The algorithm then takes the following form:

Allocate a spatial hash the size of the most detailed level of the AMR grid;
For each cell in the AMR grid, write the cell number to the hash buckets corresponding to the cell;
Calculate the index for the more detailed cell one cell outward from the current cell on each side;
Read the value stored in the hash bucket at that location.

For the grid shown in Figure 5.5, the write phase is followed by a read phase to find the index of the cell's right neighbor.

31 of 98

Spatial Hashing: A Highly Parallel Algorithm

This algorithm is well suited for GPU processors and is shown in Listing 5.5. In the first implementation, porting from CPU to GPU took less than a day. Implementing the original k-D tree on GPUs would require weeks or months. The algorithmic complexity also violates the O(log N) threshold and averages Θ(N) for N cells.

Fig. 5.5 Finding the right neighbor of cell 21 using spatial perfect hash

32 of 98

Spatial Hashing: A Highly Parallel Algorithm

This first implementation of perfect hash-based neighbor computation was an order of magnitude faster on a CPU than the k-D-tree-based method, and an additional order of magnitude faster on a GPU than on a single CPU core, for a total of 3157 times (Figure 5.6). The algorithm's performance was tested on an NVIDIA V100 GPU and a Skylake Gold 5118 CPU with a nominal clock frequency of 2.30 GHz. This architecture was also used in all the results presented in this chapter. The CPU core and GPU architecture are the best available in 2018, yielding the best parallel speedup comparison in 2018 (speedup notation is described in Section 1.6). However, this comparison is not an architectural comparison between CPUs and GPUs. If 24 virtual cores were used on this CPU, the CPU would also experience significant parallel speedup.

How difficult is it to code for this kind of performance? Let's look at the hash table code in Listing 5.4 for the CPU. The procedure's input is one-dimensional arrays, i, j, and level, where level is the level of detail, and i and j are the row and column of the cell in the grid at that cell's level of detail. The entire listing consists of about a dozen lines of code.

33 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.6 Algorithm and parallel acceleration of a total of 3157 times. The new algorithm provides parallel acceleration on the GPU

34 of 98

Spatial Hashing: A Highly Parallel Algorithm

35 of 98

Spatial Hashing: A Highly Parallel Algorithm

The loops in lines 459, 461, and 462 reference the one-dimensional arrays i, j, and level; level is the level of detail, where 0 is the coarse level and 1 to levmax are the fine-grained levels. The i and j arrays represent the row and column of a cell in the grid at that cell's level of detail.

Listing 5.5 shows the code for writing a spatial hash in OpenCL for the GPU, which is similar to Listing 5.4. Although we haven't yet covered OpenCL, the simplicity of the GPU code is obvious even without understanding all the details. Let's make a quick comparison to understand how the original GPU code would change. We define a macro to handle two-dimensional indexing and make the code more similar to the CPU version. The biggest difference then is the absence of a loop over cells. This is typical for GPU code; in it, outer loops are removed and instead handled by running the compute kernel. The cell index is provided internally for each processor thread by calling the get_global_id function. This example and OpenCL code writing in general will be discussed in more detail in Chapter 12.

36 of 98

Spatial Hashing: A Highly Parallel Algorithm

37 of 98

Spatial Hashing: A Highly Parallel Algorithm

The code for retrieving neighbor indices is also simple, as shown in Listing 5.6. It contains a simple loop through cells and a hash table read at the location where the neighbors' locations are at the most detailed grid level. Neighbor locations are found by incrementing a row or column by one cell in the desired direction. For a left or bottom neighbor, the increment is 1, while for a right or top neighbor, the increment is the full grid width in the x-direction or imaxsize.

38 of 98

Spatial Hashing: A Highly Parallel Algorithm

For GPU, we again eliminate the cell loop and replace it with a call to get_global_id, as shown in the following listing.

39 of 98

Spatial Hashing: A Highly Parallel Algorithm

Compare the simplicity of this code to the k-D tree code for CPU, which is a thousand lines long!

40 of 98

Spatial Hashing: A Highly Parallel Algorithm

Remapping Calculations Using a Spatial Ideal Hash

Another important numerical mesh operation is remapping from one mesh to another. Fast remappings allow for various physics operations to be performed on meshes, optimized for their individual needs.

In this case, we will consider remapping values from one AMR cell mesh to another AMR cell mesh. Mesh remapping can also be used with unstructured meshes or particle-based simulations, but these techniques are more complex. The setup phase is identical to neighbor-finding, where the cell index for each cell is written to a spatial hash. In this case, the spatial hash is created for the source mesh. The read phase, shown in Listing 5.8, then queries the spatial hash for the cell numbers corresponding to each cell in the target grid and sums the values from the source grid to the target grid after adjusting for cell size differences. For this demonstration, we simplified the source code from the example at https://github.com/EssentialsofParallelComputing/Chapter5.git.

41 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.7 shows the performance improvement achieved by recombining using a spatial perfect hash. The algorithm achieves a speedup and then an additional parallel acceleration when executed on a GPU, for a total of over 1000 times faster. The parallel acceleration on the GPU is made possible by the ease of implementing the algorithm on a GPU. Good parallel acceleration should also be possible on a multi-core CPU.

42 of 98

Spatial Hashing: A Highly Parallel Algorithm

Table Lookups Using Spatial Perfect Hash

Looking up values in tabular data represents another type of locality that can be exploited by spatial hashing. Hashing can be used to search intervals on both axes for interpolation. In this example, we used a 51x23 lookup table of equation of state values. The two axes are density and temperature, with equal spacing between values on each axis. We will use n for the axis length and N for the number of table lookups to perform. In this study, we used three algorithms.

The first is a linear search (brute-force), starting from the first column and row. The brute-force algorithm (using exhaustive search) should have an algorithmic complexity of O(n) for each data query or O(N×n) for all N, where n is the number of columns or rows for each axis, respectively.

The second is bisection search, which accesses the midpoint of a possible range and recursively narrows the location for the interval. The bisection search algorithm should have O(log n) complexity for each data query.

Finally, we used a hash for O(1) interval search for each axis. We measured the hash's performance on both a single CPU core and a GPU. To obtain the result, the test code performs interval searches on both axes and simple interpolation of data values from the table.

43 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.7 Speeding up the recombinant algorithm by switching the algorithm from a k-D tree to a hash on a single CPU core, and then transferring it to a GPU for parallel acceleration.

44 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.8 shows the performance results of the different algorithms. The results may contain some surprises. Bisection search is no faster than brute-force search (linear search), despite the fact that this algorithm has O(N log n) complexity, as opposed to O(N × n). This seems to contradict the simple performance model, which suggests a 4-5x speedup for searching along each axis. With interpolation, we still expect a roughly 2x improvement. However, there's a simple explanation, which you can guess from our discussion in Section 5.2.

In linear search, searching for an interval on each axis requires—at most—only two cache hits on one axis and four on the other! Bisection requires the same number of cache hits. Given the cache hits, we expect no difference in performance. The hashing algorithm can navigate directly to the correct interval, but doing so will still require a cache hit. The reduction in cache loads is approximately threefold. In the hashing algorithm, additional improvement is likely due to the reduction in conditional jumps. The observed performance matches expectations when including the cache hierarchy effect.

45 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.8 Algorithms used for table lookup show significant speedup of hashing algorithm on GPU

46 of 98

Spatial Hashing: A Highly Parallel Algorithm

Porting the algorithm to a GPU is a bit more challenging, and it demonstrates the performance improvements possible in the process. To understand what was done, let's first look at the CPU implementation of the hash in Listing 5.9. The code iterates through all 16 million values, finding intervals on each axis, and then interpolates the data in the table to obtain the resulting value. Using hashing, we can find the location of intervals using a simple arithmetic expression without conditional jumps.

47 of 98

Spatial Hashing: A Highly Parallel Algorithm

We could simply move this to the GPU, as we did in the previous cases, removing the for loop and replacing it with a call to the get_global_id function. However, the GPU has a small local memory cache shared by each workgroup, which can hold about 4000 double-precision values. We have 1173 table values and 51+23 axis values. These can fit in the local memory cache, which can be accessed quickly and shared by all CPU threads in the workgroup. The code in Listing 5.10 shows how this is done. The first part of the code cooperatively loads the data values into local memory using all threads. Then, synchronization is required to ensure that all the data is loaded before moving on to the interpolation kernel. The remaining code looks almost identical to the CPU code in Listing 5.9.

48 of 98

Spatial Hashing: A Highly Parallel Algorithm

The GPU hash code performance result shows the impact of this optimization, with a larger speedup than the single CPU core performance for other kernels.

49 of 98

Spatial Hashing: A Highly Parallel Algorithm

Sorting Grid Data Using Spatial Perfect Hash

Sorting is one of the most studied algorithms and underlies many other operations. In this section, we consider a special case of sorting spatial data. Spatial sorting can be used to find nearest neighbors, eliminate duplicates, simplify range detection, display graphics, and a host of other operations.

For simplicity, we will work with one-dimensional data with a minimum bin size of 2.0. All bins must be twice the minimum bin size. The test case allows up to four coarseness levels in addition to the minimum bin size for the following possibilities: 2.0, 4.0, 8.0, 16.0, and 32.0. Bucket sizes are generated randomly, and the bins are ordered randomly. Sorting is performed using quicksort and then hash sort on CPU and GPU. The spatial hash sort calculation uses information about the one-dimensional data. We know the minimum and maximum values of X and the minimum bucket size. With this information, we can calculate a bucket index guaranteed by a perfect hash using the formula

where bk is the bin in which to place the data item, Xi is the x-coordinate of the bin, Xmin is the minimum X value, and Δmin is the minimum distance between any two adjacent X values.

50 of 98

Spatial Hashing: A Highly Parallel Algorithm

We can demonstrate the hash sort operation (Figure 5.9). The minimum difference between values is 2.0, so a bucket size of 2 ensures no collisions. The minimum value is 0, so the bucket location can be calculated using Bi = Xi /Δmin = Xi /2.0. The hash table could store either a value or an index. For example, 8, the first key, could be stored in bucket 4, or the initial index position of zero could also be stored. If the value is stored, we retrieve 8 using hash[4]. If the index is stored, we retrieve its value using keys[hash[4]]. Storing the index position in this case is a little slower, but it is more general. It can also be used to reorder all arrays in a grid. In the test example, we use the index-storing method to study performance.

The spatial hash sort algorithm has an algorithmic complexity of Θ(N), while quicksort has a complexity of Θ(N log N). However, spatial hash sort is more specialized for the problem at hand and may temporarily consume more memory. Other questions concern how difficult it is to write this algorithm and what its performance is. The following listing shows the program code for the write phase of the spatial hash sort implementation.

51 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.9 Spatial Perfect Hash Sorting. This method stores the value in the hash bucket, but it can also store the value's index position in the original array. Note that the bin size 2, with a range from 0 to 24, is indicated by small numbers to the left of the hash table.

52 of 98

Spatial Hashing: A Highly Parallel Algorithm

53 of 98

Spatial Hashing: A Highly Parallel Algorithm

Note that the code in the listing is barely a dozen lines long. Compare this to the quicksort code, which is five times longer and significantly more complex.

Figure 5.10 shows the performance of spatial sorting on both a single CPU core and a GPU. As we will see, parallel implementations on CPUs and GPUs require some effort to ensure good performance. The read phase of the algorithm requires a well-implemented prefix sum to enable parallel retrieval of sorted values. Prefix sums are an important pattern in many algorithms; we will discuss them further in Section 5.6.

In this example, the GPU implementation uses a well-implemented prefix sum, and as a result, the performance of spatial hash sorting is excellent. In earlier tests with an array size of two million, this sort on GPUs was 3x faster than the fastest general sort on GPUs, and the sequential version on CPUs was 4x faster than standard quicksort. With the current CPU architecture and a larger array size of 16 million, our spatial hash sort is shown to be almost 6x faster (Figure 5.10). Remarkably, our sort, written in two to three months, is significantly faster than the fastest current benchmark sorts on CPUs and GPUs, especially since benchmark sorts are the result of decades of research and the efforts of many researchers!

54 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.10 Our spatial hash sort shows speedup on a single CPU core and further parallel acceleration on a GPU. Our sort is 6x faster than the current fastest sort.

55 of 98

Spatial Hashing: A Highly Parallel Algorithm

Using Compact Hashing for Spatial Operations on a Grid

We haven't yet completed our exploration of hashing methods. The algorithms in the perfect hashing section can be significantly improved. In the previous section, we explored the use of compact hashes for neighbor discovery and recombinant operations. Key observations include the fact that we don't need to write to each bucket of a spatial hash, and we can improve the algorithms by manipulating collisions. This enables spatial hashes to be compressed and uses less memory. This will give us more options for algorithms with different memory requirements and execution times.

Neighbor Discovery with Compact Hashing and Write Optimizations

The previous simple perfect hash algorithm for neighbor discovery shows good performance for a small number of granularities on an AMR grid. But when there are six or more granularity levels, a coarse cell writes to 64 hash buckets, while a fine cell only needs to write to one, leading to load imbalance and thread divergence issues in parallel implementations.

56 of 98

Spatial Hashing: A Highly Parallel Algorithm

Stream divergence is a situation where the amount of work for each thread varies, and threads end up waiting for the slowest one. We can further improve the perfect hash algorithm using the optimizations shown in Figure 5.11. The first optimization is to realize that neighbor searches only the outer hash buckets of a cell, so there is no need to write to inner buckets. Further analysis reveals that only the corners or midpoints of the cell's hash representation will be queried, further reducing the number of required writes. In the figure, the example shown at the far right of the sequence further optimizes writes to just one per cell and performs multiple reads if a record exists for a finer-grained, equal-sized, or coarser neighboring cell. This last technique requires initializing the hash table with a sentinel value, such as -1, to indicate the absence of a data element. But now that less data is written to the hash, we have a lot of empty space, or sparsity, and we can compress the hash table down to 1.25 times the size of the number of entries, significantly reducing the algorithm's memory requirements. The reciprocal of the size factor is called the hash load factor and is defined as the number of filled positions in the hash table divided by the size of the hash table. For a size factor of 1.25, the hash load factor is 0.8. We typically use a much smaller load factor, around 0.333, or a size factor of 3. This is because in parallel processing, we want to avoid having one processor be slower than the others. Hash sparsity represents the empty space in the hash. Sparsity indicates the potential for compression.

57 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.11 Optimization of neighbor-finding computation using ideal spatial hash by reducing the number of write and read operations

Figure 5.12 shows the process of creating a compact hash. Due to compression to a compact hash, two data elements attempt to store their value in slot 1. The second element sees that a value already exists there, so it searches for the next open slot using a method called open addressing. In open addressing, we search for the next open slot in the hash table and store the value in that slot. There are other hashing methods besides open addressing, but they often require the ability to allocate memory during the operation. Allocating memory on a GPU is more difficult, so we stick with open addressing, where collisions are resolved by finding alternative storage locations in the already allocated hash table.

58 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.12 The sequence shown, from left to right, shows storing spatial data in an ideal spatial hash, compressing it into a smaller hash, and then – in case of a collision – finding the next available empty slot to store the data item in

59 of 98

Spatial Hashing: A Highly Parallel Algorithm

In open addressing, there are several options that can be used for trying the next open slot. These are:

linear probing – where the next location of a data element is simply the next bucket in the sequence until an open bucket is found;
quadratic probing – where the increment is squared, resulting in the chosen intervals being +1, +4, +9, etc., from the initial location;
double hashing – where a second hash function is used to move to a deterministic but pseudo-random distance from the first trial location.

The reason for the more complex options for the next try is to prevent clustering of values in part of the hash table, which leads to longer store and query sequences. We use the quadratic probing method because the first two tries are in the cache, resulting in improved performance. Once a slot is found, the key and value are stored. When reading a hash table, the stored key is compared with the read key, and if they don't match, the read tries the next slot in the table.

We could evaluate the effectiveness of these optimizations by counting the number of writes and reads. However, we need to adjust these write and read counts to account for the number of cache lines, not just the raw number of values. Furthermore, the code with optimizations contains more conditional branches. Therefore, the runtime improvement is modest and only improves for higher grid granularities. Parallel code on a GPU provides greater benefits, as thread divergence is reduced.

60 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.13 shows measured performance results for various hash table optimizations for an AMR grid sample with a relatively modest sparsity factor of 30. The source code is available at https://github.com/lanl/CompactHash.git. The final performance results shown in Figure 5.13 for both CPU and GPU are for compact hash runs. The cost of compact hash is offset by the lack of sufficient memory for initialization with a sentinel value of -1. As a result, compact hash has competitive performance with perfect hashing methods. For hash table sparsities greater than 30x, the compact hash in this test case is even faster than perfect hashing methods. Cellular AMR methods in general should have at least 10x compression and can often exceed 100x. These hashing methods were implemented in the CLAMR mini-app. The code switches between a perfect hashing algorithm for low sparsity levels and a compact hash when the hash contains a lot of empty space.

61 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.13. The optimized versions shown for CPU and GPU correspond to the methods shown in Figure 5.11. Compact refers to CPU compactness, and G Comp refers to GPU compactness for the last method in each set. The compact method is faster than the original perfect hash and requires significantly less memory. At higher granularity levels, methods that reduce the number of writes also show some performance advantage.

62 of 98

Spatial Hashing: A Highly Parallel Algorithm

Finding Edge Neighbors for Unstructured Meshes

So far, we haven't discussed algorithms for unstructured meshes because it's difficult to guarantee that a perfect hash can be easily generated for them. Most practical methods require a way to handle collisions and, therefore, compact hashing methods. Let's consider one case where using a hash is quite straightforward. Finding an edge neighbor for a polygonal mesh can be expensive. Many unstructured codes store a dictionary of neighbors because it's very expensive. The technique we demonstrate below is so fast that the dictionary of neighbors can be computed on the fly.

Example: Finding Edge Neighbors for an Unstructured Mesh

The following figure shows a small portion of an unstructured mesh with polygonal cells. One of the computational challenges for this type of mesh is finding a connectivity map for each edge of the polygons. A brute-force search of all remaining elements seems reasonable for a small number of polygons, but for a larger number of cells, this can take minutes or hours. A k-D tree search reduces the time, but is there an even faster way? Let's try a hash-based method instead. We overlay the figure on top of the hash table to the right of the figure. The algorithm is as follows:

we place the center point of each edge in the hash bucket where it fits;
each cell writes its cell number in the bucket at the center of each edge. If the edge is to the left and above the center, it writes its index in the first of two bucket locations; otherwise, it writes in the second location;
each cell checks each edge to see if the number is in the other bucket. If so, it's a neighboring cell. If not, we have an outer edge without a neighbor.

We've found our neighbors in one write and one read!

63 of 98

Spatial Hashing: A Highly Parallel Algorithm

Finding a neighbor for each edge of each cell using a spatial hash. Each edge writes data to one of two buckets in the spatial hash. If the edge is located to the left and above the center, it is written to the first bucket. If it is located to the right and below, it is written to the second. During the read pass, it checks whether the other bucket is full, and if it is, the cell number is its neighbor (graph and algorithm courtesy of Rachel Roby).

The proper size of a hash table is difficult to specify. The best solution is to choose a reasonable size based on the number of edges or the minimum edge length, and then handle collisions if they occur.

64 of 98

Spatial Hashing: A Highly Parallel Algorithm

Remapping with Write Optimization and Compact Hashing

Another operation, remapping, is slightly more difficult to optimize and tune for compact hashing because the ideal hash approach reads all relevant bins. First, we must come up with a method that doesn't require filling every hash bucket.

We write the bin indices for each bin in the lower-left corner of the corresponding hash. Then, during reading, if the value isn't found or the bin level in the input grid is incorrect, we look up the location where the bin would be written in the input grid if it were at the next coarser level. Figure 1 shows the remapping operation. Figure 5.14 shows this approach, where cell 1 in the output grid queries the hash location (0,2) and finds the value -1, so it then looks up where the next coarser cell (0,0) would be and finds the cell's index, which is 1. The density of cell 1 in the output grid is then set equal to the density of cell 1 in the input grid. For cell 9 in the output grid, the approach looks up the hash at (4,4) and finds the index of the input cell 3. It then looks up the level of cell 3 in the input grid, and since the level of the input grid cell is higher, it must also query the hash locations (6,4) to get the index of cell 9, the location (4,6), which returns the index of cell 4, and the location (6,6) to get the index of cell 7. The first two cell indices are at the same level, so they don't need to go any further. The index of cell 7 is at a more granular level, so we must recursively descend to that location to find the indices of cells 8, 5, and 6. This source code is shown in Listing 5.12.

65 of 98

Spatial Hashing: A Highly Parallel Algorithm

Figure 5.14. Implementation of a write-once, read-many layout using a spatial hash. The first query is for the location where a cell of the same size will be written from the input grid, and then, if the value is not found, the next query looks for the location where a cell at the next coarser level will be written.

66 of 98

Spatial Hashing: A Highly Parallel Algorithm

67 of 98

Spatial Hashing: A Highly Parallel Algorithm

Before writing, a perfect hash table is allocated and initialized with a sentinel value of -1 (Figure 5.10). Then, the cell indices from the input grid are written to the hash (Listing 5.13). The source code is available at https://github.com/lanl/CompactHashRemap.git in the file AMR_remap/singlewrite_remap.cc, along with examples of using a compact hash table and OpenMP. The OpenCL version for GPUs is in the file AMR_remap/h_remap_kern.cl.

68 of 98

Spatial Hashing: A Highly Parallel Algorithm

The code for the reading phase (Listing 5.14) has an interesting structure. The first part is essentially divided into two cases: a cell at the same location in the input grid has the same level or is coarser, or it is a set of more fine-grained cells. In the first case, we loop through the levels until we find the desired level and set the value in the output grid equal to the value in the input grid. If the input grid is finer, we recursively descend through the levels, summing the values along the way.

69 of 98

Spatial Hashing: A Highly Parallel Algorithm

70 of 98

Spatial Hashing: A Highly Parallel Algorithm

71 of 98

Spatial Hashing: A Highly Parallel Algorithm

Okay, it seems to work perfectly on a CPU, but how will it work on a GPU? Apparently, recursion isn't supported on the GPU.

It seems there's no easy way to write this phase without recursion. But we tested it on the GPU and found that the code works. It works perfectly on all the GPUs we tried for the limited number of levels of detail that would be used in any practical grid. Clearly, a limited amount of recursion works on the GPU! We then implemented versions of this approach with compact hashes, and they show good performance.

Hierarchical Hashing Technique for Reflow Operation

Another innovative approach to using hashing for the reflow operation involves a hierarchical set of hashes and a breadcrumb trail (i.e., a breadcrumb trail). A chain of multiple sentinel values has the advantage that we do not need to initialize the hash tables with the sentinel value at the very beginning (Figure 5.15).

72 of 98

Spatial Hashing: A Highly Parallel Algorithm

Fig. 5.15 A hierarchical hash table with a separate hash for each level. When a write is made at one of the more granular levels, a sentinel value is placed at each level above, forming a "breadcrumb" trail informing queries about the presence of data at more granular levels.

73 of 98

Spatial Hashing: A Highly Parallel Algorithm

The first step is to allocate a hash table for each level of the grid. Then, the cell indices are written to the hash of the corresponding level and recursively traversed up through the coarser hashes, leaving a sentinel value so that queries know that higher-level hash tables contain values. Looking at cell 9 of the input grid in Figure 5.15, we see that:

the cell index is written to the mid-level hash table, then the sentinel value is written to the hash buckets in the coarser hash table;
the read operation for cell 9 first goes to the coarsest level of the hash table, where it finds the value -1. Now it knows it must proceed to finer levels;
it finds three cells in the mid-level hash table and another sentinel value, telling the read operation to recursively descend to the highest level, where it finds four more values to add to the sum;
all other queries are stored in the coarsest hash table and in the values assigned to the output grid.

Each hash table can be either a perfect hash or a compact hash. This method has a recursive structure similar to the previous technique. It also works well on GPUs.

74 of 98

Prefix summation (scan) pattern and its importance in parallel computing

The prefix sum was an important element of the parallel operation of hash sort in Section 5.5.1. The prefix sum operation, also called a scan, is widely used in computations with irregular dimensions. Many computations with irregular dimensions require knowledge of where to start writing to enable parallelism. A simple example is the situation where each processor has a different number of particles. To be able to write to the output array or access data on other processors or in processor threads, each processor must know the relationship between local indices and global indices. In a prefix sum, the output array y is the running sum of all the numbers preceding it in the input array:

A prefix sum can be either an inclusive scan, which includes the current value, or an exclusive scan, which does not. The equation above is for an exclusive scan. Figure 5.16 shows both exclusive and inclusive scans. An exclusive scan is the starting index for the global array, while an inclusive scan is the ending index for each process or thread.

75 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Listing 5.15 shows the standard sequential source code for the scan operation.

After the scan operation completes, each process can perform its work in parallel because each process knows where to place its result. However, the scan operation itself appears to be inherently sequential. Each iteration depends on the previous one. However, there are efficient ways to parallelize it. In this section, we will consider algorithms with step efficiency, job efficiency, and large-array efficiency.

Fig. 5.16. The array x gives the number of particles in each cell. An exclusive and inclusive scan of the array gives the start and end addresses in the global data set.

76 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Parallel Scan Operation with Step Efficiency

A step-efficient algorithm uses the fewest number of steps. However, this may not be the fewest number of operations, because each step can have a different number of operations. This issue was discussed earlier in defining computational complexity in Section 5.1.

The prefix sum operation can be performed in parallel using a tree-based reduction pattern, as shown in Figure 5.17. Instead of waiting for the previous element to sum its values, each element sums its value and the previous one. Then it performs the same operation on the value two elements above, four elements above, and so on. The end result is an inclusive scan; during this operation, all processes were busy.

We now have a parallel prefix that runs in only log2n steps, but the workload is increased by the serial algorithm. Is it possible to design a parallel algorithm with the same workload?

77 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Fig. 5.17 Inclusive scan with step efficiency, parallel prefix sum computation uses O(log2n) steps

78 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Work-efficient Parallel Scan Operation

A work-efficient algorithm uses the fewest number of operations. This may not be the fewest number of steps, because each step can have a different number of operations. The decision to use a work-efficient or step-efficient algorithm depends on the number of possible parallel processes.

A work-efficient parallel scan operation uses two iterations over arrays. The first iteration is called the ascending iteration, although it more closely resembles the right iteration. This is shown in Figure 5.18 from top to bottom, rather than the traditional bottom-up approach, to simplify comparisons with a step-efficient algorithm.

79 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Figure 5.18. The phase of the ascending turn of a scan with work efficiency, shown from top to bottom, which performs significantly fewer operations than a scan with step efficiency. Essentially, all other values remain unchanged.

80 of 98

Prefix summation (scan) pattern and its importance in parallel computing

The second phase, called the descending loop phase, is more similar to the left loop. It begins by setting the last value to zero and then performs another tree loop (Figure 5.19), producing the final result. The amount of work is significantly reduced, but it requires more steps.

Fig. 5.19 In the descending phase of an exceptional scan with work efficiency, far fewer operations are performed than with a scan with step efficiency

81 of 98

Prefix summation (scan) pattern and its importance in parallel computing

When viewed this way, the performance scan exhibits an interesting pattern: the right loop starts with half the processor threads and decreases until only one is running. It then loops back to the left with one thread at the beginning and ends with all threads busy. These additional steps allow earlier computations to be reused, resulting in a total computation time of only O(N).

These two parallel prefix summation algorithms offer several different ways to leverage parallelism in this important operation. However, both are limited by the number of threads available in the GPU workgroup or the number of processors on the CPU.

82 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Parallel Scan Operations for Large Arrays

For large arrays, we also need a parallel algorithm. Figure 5.20 shows such an algorithm using three GPU compute cores. The first core starts with a reduction sum for each workgroup and stores the result in a temporary array that is smaller than the original large array by the number of threads in the workgroup. On GPUs, the number of threads in a workgroup is typically as large as 1024. The second core then loops through the temporary array, scanning each workgroup-sized block. This results in the temporary array now containing the offsets of each workgroup. The third core is then called to perform the scan operation on workgroup-sized chunks of the original array and the offset calculated for each thread at that level.

Because the parallel prefix sum is so important for operations like sorting, it is heavily optimized for GPU architectures. We will not go into detail at this level in this book. Instead, we encourage application developers to use libraries or open-source implementations for their own work.

For parallel prefix scans available for CUDA, you can find implementations such as the CUDA Parallel Primitives Library (CUDPP), available at https://github.com/cudpp/cudpp. For OpenCL, we offer an implementation from the CLPP Parallel Primitives Library or a scan implementation from our hash sort code, available in the file sort_kern.cl at https://github.com/LANL/PerfectHash.git. We will present a prefix scan implementation for OpenMP in Chapter 7.

83 of 98

Prefix summation (scan) pattern and its importance in parallel computing

Fig. 5.20. A large array scan is performed in three phases using three GPU cores. In the first phase, a reduction sum is performed, reducing the array to an intermediate array. In the second phase, a scan is performed to create workgroup shifts. Then, in the third phase, the original array is scanned and the workgroup shifts are applied, producing scan results for each array element.

84 of 98

Parallel global sum: solving the associativity problem

Not all parallel algorithms are aimed at speeding up computations. The global sum is a prime example of such a case. Parallel computing has suffered from non-reproducibility of sums across processors since its earliest days. In this section, we present one example of an algorithm that improves the reproducibility of a parallel computation, bringing it closer to the results of the original serial computation.

Changing the order of addition operations changes the answer in finite-precision arithmetic (or finite-precision). This is problematic because parallel computation changes the order of addition operations. This problem stems from the fact that finite-precision arithmetic is not associative. It worsens as the problem size increases, because adding the last value becomes a smaller and smaller part of the total sum. Ultimately, adding the last value may not change the sum at all. An even worse case for finite-precision addition operations occurs when adding two values that are nearly identical but have different signs. Subtracting one value from another when they are nearly identical leads to catastrophic cancellation. The result is only a few significant digits, and the rest is filled with noise.

85 of 98

Parallel global sum: solving the associativity problem

Example: Catastrophic Cancellation

Subtracting two nearly identical values will produce a result with a small number of significant digits. Place the following code snippet in a file called catastrophic.py and execute the Python program catastrophic.py.

The result in this example has only a few significant digits! Where do the remaining digits in the printed value come from? The problem with parallel computing is that instead of a sum representing a linear addition of the values in the array, the sum on two processors is a linear sum of half the array, then two partial sums added together at the end. Reversing the order results in a different global sum. The difference may be small, but now the question arises as to how well the code was parallelized, if at all. To compound the problem, new parallelization techniques and hardware, such as vectorization and threads, also introduce this problem. The pattern for this global sum operation is called reduction.

86 of 98

Parallel global sum: solving the associativity problem

DEFINITION: Reduction is an operation in which an array of one or more dimensions is reduced by at least one dimension, often resulting in a scalar value.

In parallel computing, this operation is one of the most common and is often associated with performance, and in this case, with correctness. An example is calculating the total mass or energy in a problem. It takes the global mass array in each cell and reduces it to a single scalar value.

As with all computer calculations, the results of a global sum reduction are imprecise. In serial calculations, this is not a serious problem because we always get the same imprecise result. In parallel, we will likely get a more accurate result with more correct significant digits, but it differs from the serial result. This problem is called the global sum problem. Any time the results between the serial and parallel versions differed slightly, the cause was related to this problem. But often, when taking the time to delve deeper into the code, the problem turned out to be a subtle parallel programming error, such as the failure to update ghost cells between processors. Ghost cells are cells that store values from neighboring processors needed by the local processor, and if they are not updated, slightly older values cause a slight error compared to a sequential run.

87 of 98

Parallel global sum: solving the associativity problem

Let's look at the size of a 134,217,728 problem on a single processor, with half the values in the high-energy state and the other half in the low-energy state. These two regions are separated by a diaphragm at the beginning of the problem. The problem size is large for a single processor, but relatively small for parallel computing. If the high-energy values are summed first, the next low-energy value added will have few significant digits. Reversing the order of the sum so that the low-energy values are summed first results in the small values in the sum being nearly the same size, and by the time the high-energy value is added, there will be more significant digits, resulting in a more accurate sum. This fact gives us a possible solution based on sorting. Simply sort the values in order from smallest to largest, and you will get a more accurate sum. There are several techniques for global summation that are much more convenient than the sorting method. The list of possible techniques presented here is as follows:

long double data type;
pairwise summation;
Kahan summation;
Knuth summation;
quadruple-precision summation.

You can try out these various methods in the exercises accompanying this chapter at https://github.com/EssentialsOfParallelComputing/Chapter5.git. The original study examined parallel OpenMP implementations and truncation techniques, which we will not discuss here.

88 of 98

Parallel global sum: solving the associativity problem

The simplest solution is to use the long double data type on the x86 architecture. On this architecture, long double is implemented as an 80-bit floating-point number in hardware, providing an additional 16-bit precision. Unfortunately, this technique is not portable. The long double type is only 64 bits on some architectures and compilers, while on others it is 128 bits and implemented in software. Some compilers also require rounding between operations to ensure consistency with other architectures. When using this technique, carefully consult your compiler's documentation for how it implements the long double type. The source code shown in the following listing represents a regular sum with the accumulator data type set to long double.

89 of 98

Parallel global sum: solving the associativity problem

Line 8 of the listing returns a double value to accommodate the idea of a higher-precision accumulator returning the same data type as the array. We'll see how this works later, but first, let's look at other methods for solving the global sum problem.

Pairwise summation is a surprisingly simple solution to the global sum problem, especially on a single processor. The source code is relatively simple, as shown in the listing below, but it requires an additional array half the size of the original.

90 of 98

Parallel global sum: solving the associativity problem

The simplicity of pairwise summation becomes slightly more complex when working with different processors. If the algorithm remains true to its basic structure, data exchange may be required at each step of the recursive sum.

Next comes Kahan summation. Kahan summation is the most practical of the possible global summation methods. It uses an additional double variable to perform the remainder of the operation, effectively doubling the effective precision. This technique was developed by William Kahan in 1965 (Kahan later became one of the key authors of the early IEEE floating-point standards). Kahan summation is most suitable for sliding summation when the accumulator is the larger of the two values. This technique is shown in the listing below.

91 of 98

Parallel global sum: solving the associativity problem

92 of 98

Parallel global sum: solving the associativity problem

Kahan summation takes about four floating-point operations instead of one. However, the data can be stored in registers or the L1 cache, making the operation less expensive than we might initially expect. Vectorized implementations can make the cost of the operation the same as standard summation. This is an example of an approach to exploiting the excess floating-point capabilities of a processor to obtain a higher-quality answer.

We will consider a vector implementation of Kahan summation in Section 6.3.4. Some new numerical methods attempt a similar approach, exploiting the excess floating-point capabilities of modern processors. They view the current machine budget of 50 flops per data load as an opportunity and implement higher-order methods that require more floating-point operations to utilize the unused floating-point resource, since it is essentially free.

The Knuth summation method manipulates additions in which any term can be larger. This technique was developed by Donald Knuth in 1969. It collects error for both members at the cost of seven floating point operations, as shown in the following listing.

93 of 98

Parallel global sum: solving the associativity problem

94 of 98

Parallel global sum: solving the associativity problem

The last technique, quadruple-precision sums, has the advantage of being easier to code, but since quadruple-precision types are almost always implemented in software, it's expensive. Furthermore, portability must always be considered, as not all compilers implement the quadruple-precision type. The source code for this is shown in the listing below.

95 of 98

Parallel global sum: solving the associativity problem

Now let's move on to evaluating the performance of these different approaches. Since half of the values are 1.0 e-1 and the other half are 1.0 e-10, we can obtain an exact answer against which comparisons can be made by multiplying rather than adding:

accurate_answer = ncells/2 * 1.0e-1 + ncells/2 * 1.0e-10

Table 5.1 shows the results of comparing the actual global sum values with the exact answer and measuring the execution time. Essentially, we achieve nine-digit precision with regular summation of double values. The long double type on a system with 80-bit floating-point representation improves the precision somewhat, but does not completely eliminate the error. Pairwise Kahane and Knuth summation reduces the error to zero with a modest increase in execution time. A vectorized implementation of Kahane and Knuth summation (shown in Section 6.3.4) eliminates the increase in execution time. However, considering the interprocessor communication and the cost of MPI calls, the increase in execution time is negligible.

Now that we understand the behavior of global sum techniques on a processor, we can consider a problem where arrays are distributed across multiple processors. To solve this problem, we need some understanding of MPI, so we'll show how to do this in Section 8.3.3, after covering the basics of MPI.

96 of 98

Parallel global sum: solving the associativity problem

Table 5.1 Precision and execution time results for different global summation techniques

97 of 98

Parallel global sum: solving the associativity problem

We've seen several characteristics of parallel algorithms, including those suitable for highly parallel architectures. Let's generalize them so we can look for them in other situations.

Locality is a term often used in describing good algorithms, but without any definition. It can have several meanings. Here are a couple:

- Cache locality – keeps values that will be used together close together to improve cache utilization;

- Operation locality – avoids working with all the data when not all of it is needed. A spatial hash for particle interactions is a classic example, keeping the algorithm complexity at O(N) instead of O(N 2 ).

Asynchrony – avoids interthread coordination, which can lead to synchronization.
Fewer conditional branches – in addition to the additional performance gained from conditional logic, thread divergence can be a problem in some architectures.
Reproducibility – Highly parallel techniques often violate the lack of associativity of finite-precision arithmetic. High-precision techniques help address this issue.
Higher arithmetic intensity – Modern architectures have added floating-point capabilities faster than memory bandwidth. Algorithms that increase arithmetic intensity can effectively exploit parallelism, such as vector operations.

98 of 98

Exercises

1. A cloud collision model in an ash plume is invoked for particles within 1 mm of each other. Write pseudocode to implement a spatial hash. In what order of complexity is this operation performed?

2. How are spatial hashes used by the postal service?

3. Big data uses a map-reduce algorithm to efficiently process large datasets. How does it differ from the hashing concepts presented here?

4. The wave simulation source code uses an AMR grid to further refine the coastline. The simulation requirement is to record wave heights over time for given locations where buoys and coastal structures are located. Since the cells are continuously refined, how could you implement it?