3 of 88

We first encountered affinity in Section 8.6.2 on MPI (Message Passing Interface), where we defined it and briefly demonstrated how to work with it. Below, we repeat our definition and also define the concept of process placement.

Affinity is the preference for a particular hardware component when scheduling a process, rank, or thread (virtual core). It is also called pinning or affinity.
Placement is the assignment of a process or thread to a hardware location.

In this chapter, we will examine affinity (syn. closeness, kinship, similarity), placement, and ordering of threads or ranks in more detail. Concern about affinity is a recent phenomenon.

In the past, having just a few processor cores on a CPU didn't offer much. As the number of processors increases and the complexity of the compute node architecture increases, affinity is becoming increasingly important. However, the benefits are relatively modest; Perhaps the biggest benefit is the reduction in execution-to-execution performance variability and improved per-node scaling. Occasional affinity control helps avoid truly catastrophic scheduling decisions by the compute kernel relative to your application's performance.

4 of 88

The decision about where to place a process or thread is made by the operating system's core processor. Core scheduling has a rich history and is key to the development of multitasking, multi-user operating systems. It's thanks to this capability that you can launch a spreadsheet, temporarily switch to a word processor, and then work on an important email. However, scheduling algorithms designed for the average user aren't always suitable for parallel computing. We can run four processes on a system with four processor cores, but the operating system schedules these four processes as it sees fit. It can place all four processes on a single processor or distribute them across four processors. Typically, the core processor does something reasonable, but it may interrupt one of the parallel processes to perform a system function, leaving all other processes idle and waiting.

In Chapter 1, in Figures 1.20 and 1.21, we showed question marks where processes are placed because we don't control the placement of processors or threads on processors. At least, they haven't done so until now. Recent releases of MPI, OpenMP, and packet schedulers have begun to offer functionality for managing placement and affinity. While some interfaces have seen significant changes in options, things seem to have calmed down with recent releases. However, we recommend checking the documentation for the releases you're using for any differences.

5 of 88

1. Why is affinity important?

Unlike most common desktop applications, parallel processes must be scheduled together. This scheduling is called gang scheduling.

DEFINITION: Gang scheduling is a kernel scheduling algorithm that activates a group of processes simultaneously.

Because parallel processes typically synchronize periodically during execution, scheduling one thread that ends up waiting for another idle process provides no benefit. The kernel scheduling algorithm is unaware that a process depends on the execution of another. This is also true for MPI, OpenMP, and GPU kernels. The best approach to gang scheduling is to allocate only as many processes as there are processors and affinity these processes to processors. It's important to remember that kernel and system processes need somewhere to run. Some advanced techniques reserve the processor only for system processes.

6 of 88

1. Why is affinity important?

Keeping every parallel process active and scheduled isn't enough. We also need processes to be scheduled in the same Non-Uniform Memory Access (NUMA) domain to minimize memory access costs. In OpenMP, we typically encounter a large number of issues related to "first touch" of data arrays on the processor where the data is being accessed (see Section 7.1.1). If the compute kernel then moves your process to a different NUMA domain, all your efforts are wasted. In Section 7.3.1, we saw that the penalty for accessing memory in the wrong NUMA domain is typically twofold or more. The primary goal for our processes is to stay in the same memory domain.

In a typical situation, a NUMA domain is aligned with the sockets on the node. If we can instruct a process to schedule affinity on the same socket, we will always achieve the same optimal main memory access time. However, the need for NUMA region affinity depends on the architecture of your CPU. Personal computing systems often have only a single NUMA region, while large HPC systems often have many more processor cores per node with two CPU sockets and two or more NUMA regions.

7 of 88

1. Why is affinity important?

Although pinning affinity to a NUMA domain optimizes main memory access times, we may still experience suboptimal performance due to poor cache utilization. A process fills the L1 and L2 caches with the memory it needs. But then, if it is removed and replaced by another processor in the same NUMA domain with different L1 and L2 caches, cache performance suffers. The caches then need to be refilled. With frequent data reuse, this leads to a performance loss. With MPI, we want to lock processes or ranks to a processor. But with OpenMP, this results in all threads running on a single processor because affinity is inherited by spawned threads. With OpenMP, we want each thread to have affinity with its own processor.

Some processors also have a new feature called hyperthreading. Hyperthreading adds another layer of complexity to process placement considerations. First, we need to define hyperthreading and understand what it is.

DEFINITION: Hyperthreading is an Intel technology that makes a single processor act like two virtual processors by sharing hardware resources between two threads in the operating system.

8 of 88

1. Why is affinity important?

Hyper-threads share a single physical core and its caching system. Because the cache is shared, moving between hyper-threads is not penalized as much. However, this also means that each virtual core has half the cache of a real physical core if processes don't share data. For our memory-bound applications, halving the cache size can be a significant hit. Therefore, the effectiveness of these virtual cores is mixed. Many HPC systems disable them because some programs are slowed down by hyper-threading. Not all hyper-threads are created equal, either at the hardware or operating system level, so don't assume that if you haven't seen benefits in a previous implementation, you won't see them in your current system. If we use hyper-threading, we want the process location to be close so that the shared cache benefits both virtual processors.

9 of 88

2. Feeling your architecture

To effectively leverage affinity for performance gains, we need to understand the details of our hardware architecture. This task is complicated by the diversity of hardware architectures; Intel alone has over a thousand CPU models. In this section, we'll discuss how to understand your architecture. This understanding is a prerequisite for leveraging affinity.

The best way to get a better understanding of your architecture is with the lstopo utility. We first saw lstopo in Section 3.2.1 in Figure 3.2, with a printout for a Mac laptop. This laptop has a simple architecture with four physical processing cores, which, with hyperthreading enabled, appear as eight virtual cores to the operating system. In Figure 3.2, we also see that the L1 and L2 caches are private to the physical core, while the L3 cache is shared across all processors. We also note that there is only one NUMA domain. Now let's look at a more complex processor. In Figure 3.2, Figure 14.1 shows the Intel Skylake Gold CPU architecture.

10 of 88

2. Feeling your architecture

Figure 14.1 Intel Skylake Gold architecture with two NUMA domains and 88 processing cores reveals the complexity of higher-end compute nodes

11 of 88

2. Feeling your architecture

The gray rectangles in Figure 14.1, each labeled with a core and containing two light-colored rectangles labeled PUs (processing units), represent physical cores. Each of these gray rectangles contains two rectangles within them, which are virtual processors created by hyperthreading. The L1 and L2 caches are private to each physical processor, while the L3 cache is shared across the NUMA domain. We also see that the network and other peripherals to the right of the figure are located closer to the first NUMA domain. We can obtain some information about most Linux or Unix systems using the lscpu command (Figure 14.2).

12 of 88

2. Feeling your architecture

Fig. 14.2 Output from the lscpu command for an Intel Skylake Gold CPU

13 of 88

2. Feeling your architecture

The lscpu output confirms that there are two threads and two NUMA domains per core. The processor numbering seems a bit odd, but by having the first 22 processors on the first NUMA node and then skipping to include the next 22 processors on the second node, we're leaving the hyperthreads numbered last. Remember that the definition of a node in NUMA utilities differs from our definition, where it's a separate distributed memory system.

So, what's the affinity and process placement strategy for this architecture? Well, it all depends on the application. Each application has different scaling and threading performance needs that must be taken into account. We want to ensure that processes remain in their NUMA domains to ensure optimal throughput to main memory.

14 of 88

3. Stream Affinity with OpenMP

Thread affinity is vital when optimizing applications with OpenMP. Thread affinity is crucial for achieving good latency and memory throughput. We put a lot of effort into making the first touch, with the goal of placing memory close to the thread, as discussed in Section 7.1.1. If threads are migrated to different processors, we lose all the benefits we should gain from this extra effort.

With OpenMP 4.0, OpenMP affinity controls were expanded to include the close, spread, and primary keywords in addition to the existing true or false options. Three options were also added for the OMP_PLACES environment variable: sockets, cores, and threads. Thus, we now have the following affinity and placement controls:

OMP_PLACES = [sockets|cores|threads] or an explicitly specified list of locations;
OMP_PROC_BIND = [close|spread|primary] or [true|false].

15 of 88

3. Stream Affinity with OpenMP

OMP_PLACES sets limits on where threads can be scheduled. In fact, there is one option not listed: node. This is the default value and allows each thread to be scheduled anywhere within a "location." If there is more than one thread at the default node location, there is a chance that the scheduler will move threads or collide with two or more threads scheduled on the same virtual processor. One reasonable approach is to have no more threads than the number at the specified location. Perhaps a more appropriate rule is to specify a location that has a number greater than the desired number of threads. We'll demonstrate how this works with an example later in this section.

The OMP_PROC_BIND environment variable has five possible values, but they have some overlap in meaning. The values close, spread, and primary are special versions of true.

NOTE: We also note that the primary keyword replaces the deprecated master keyword in the OpenMP v5.1 standard. As compilers implement the new standard, you may continue to see the old usage.

16 of 88

3. Stream Affinity with OpenMP

When set to false, the compute kernel scheduler can move threads freely. When set to true, the compute kernel does not move the thread once it is scheduled. However, it can be scheduled anywhere within the location constraint and can vary from execution to execution. The primary value is a special case that handles thread scheduling on the main processor. The close value schedules threads together, while the spread value distributes threads. The choice of which of these two values to use has some subtle consequences, which you'll see in the example in this section.

NOTE: You can also specify placement using an itemized list. This use case is more advanced and will not be discussed here. An itemized list can provide finer control, but it is less portable to other CPU types.

OpenMP environment variables specify affinity and placement locations for the entire program. You can also set affinity for individual loops by adding an expression to the parallel directive. The specified expression has the following syntax:

proc_bind([primary|close|spread])

These affinity controls are shown in action in the following example, in our simple vector addition program from Section 7.3.1. You can also add affinity reporting routines to your source code to see their impact.

17 of 88

3. Stream Affinity with OpenMP

Example: Vector Addition with All Possible Settings of the OMP_PLACES and OMP_PROC_BIND Environment Variables

In this example, we specify each combination of the OpenMP affinity and placement environment variables. First, we modify the vector addition from Section 7.3.1 in the call to the procedure that reports thread placement, as shown in the following listing.

18 of 88

3. Stream Affinity with OpenMP

19 of 88

3. Stream Affinity with OpenMP

The main work is done in the place_report_omp subroutine. We use an ifdef around the call to easily enable and disable reporting. So, now let's take a look at the reporting routine in the listing below.

20 of 88

3. Stream Affinity with OpenMP

21 of 88

3. Stream Affinity with OpenMP

The CPU affinity bitmask must be converted to a more readable format for printing. This procedure is shown in the listing below.

22 of 88

3. Stream Affinity with OpenMP

23 of 88

3. Stream Affinity with OpenMP

In the placement reporting routine, we query OpenMP settings, report them, and then display the placement and affinity for each thread. To try it out, compile the source code with the verbose option and run it with 44 threads, or any number of threads that makes sense for your system, without special environment variable settings. The sample source code is located at https://github.com/EssentialsofParallelComputing/Chapter14.git in the OpenMP subdirectory.

Example: Querying OpenMP settings from the placement reporting routine

To query OpenMP settings, report them, and then display the placement and affinity for each thread, follow these steps:

mkdir build && cd build

cmake -DCMAKE_VERBOSE=on ..

make

export OMP_NUM_THREADS=44

./vecadd_opt3

24 of 88

3. Stream Affinity with OpenMP

Running this sequence of commands on Intel Skylake-Gold using GCC 9.3 produces the output shown below.

The printout shows the affinity and placement report without setting any environment variables. Threads can run on any processor from 0 to 87.

Kernel affinity allows a thread to run on any of the 88 virtual cores.

25 of 88

3. Stream Affinity with OpenMP

Let's see what happens when we place threads on hardware cores and set the affinity binding to close.

export OMP_PLACES=cores

export OMP_PROC_BIND=close

./vecadd_opt3

The result with these affinity and placement settings is shown in Figure 14.3.

Wow! We actually have control over the compute core! Threads are now bound to two virtual cores belonging to a single hardware core. The execution time of 0.0166 ms is the last number in the printout. This execution time is a significant improvement over 0.0221 ms in the previous run, reducing the compute time by 25%. You can experiment with different values of the environment variables and observe how threads are placed on the node.

26 of 88

3. Stream Affinity with OpenMP

Figure 14.3 Affinity and Placement Report for OMP_PLACES=cores

and OMP_PROC_BIND=close. Each thread can execute on two possible virtual cores. These two virtual processors belong to the same hardware core due to hyperthreading.

We're going to automate reconnaissance analysis of all values and how they scale with different thread counts. We'll disable the verbosity option to reduce the amount of output we have to deal with. Only the execution time will be printed. Remove the previous build and rebuild the source code as follows:

mkdir build && cd build

cmake ..

make

27 of 88

3. Stream Affinity with OpenMP

We then execute the script in the following listing to get the performance for all cases.

28 of 88

3. Stream Affinity with OpenMP

29 of 88

3. Stream Affinity with OpenMP

Due to space limitations, Figure 14.4 shows only a few results. All values are speedups over a single thread, without affinity or placement values.

The first thing to note from Figure 14.4 in our analysis is that the program generally runs fastest for all values with only 44 threads. Hyperthreading doesn't help. The exception is the close value for threads, because if we don't have more than 44 threads with this value, there will be no processes in the second slot. Having threads only in the first slot limits the total memory bandwidth obtained. With the full 88 threads, the close value for threads provides the best performance, albeit only slightly. The close value generally shows the same effect of limited memory bandwidth due to having threads only in the first slot. Furthermore, you can see that with a larger number of processes, performance is better with process affinity than without process affinity.

30 of 88

3. Stream Affinity with OpenMP

Figure 14.4 OpenMP affinity and placement values for OMP_PROC_BIND=spread increase parallel scaling by 50%. The lines show the different thread counts for each value and are ordered roughly from high to low in the legend.

31 of 88

3. Stream Affinity with OpenMP

Let's note a few key points to take away from this analysis.

Hyper-threading doesn't help with simple memory-constrained compute cores, but it doesn't hurt either.
For memory-bound compute cores across multiple sockets (NUMA domains), both sockets should be kept busy.

We don't show the results of setting the OMP_PROC_BIND environment variable to primary because this forces all threads to run on the same processor and slows down program execution by a factor of two. We also don't show setting the OMP_PLACES environment variable to sockets because it yields lower performance than the results shown above.

32 of 88

4. Affinity of processes with MPI

Using affinity with MPI applications also has benefits, as described in Section 14.2. It helps achieve full memory bandwidth and cache performance while preventing the operating system's compute kernel from migrating processes to other processor cores. This discussion of affinity with OpenMPI is motivated by the fact that it has the most widely available tools for affinity analysis and process placement. Other MPI implementations, such as MPICH, must be compiled with SLURM support enabled, which is not applicable to personal computers. In Section 14.6, we'll discuss command-line tools that can be used in more general situations. For now, let's continue our exploratory analysis of affinity in OpenMPI!

Default Process Placement with OpenMPI

Instead of leaving process placement to the compute kernel scheduler, OpenMPI specifies default placement and affinity. The default values for OpenMPI vary depending on the number of processes. They are as follows:

processes <= 2 (core affinity);
processes > 2 (socket affinity);
processes > processors (no affinity).

33 of 88

4. Affinity of processes with MPI

Some HPC centers may set other default values, such as always affinity for cores. This affinity policy may make sense for most MPI jobs, but it can cause problems with applications that use MPI and OpenMP threading. All threads will be bound to a single, serialized processor.

The latest versions of OpenMPI have extensive support for process placement and affinity. Using these tools typically yields a performance boost. The gain depends on how the operating system's process scheduler optimizes placement. Most programs are tuned for general-purpose computing, such as word processing and spreadsheets, but not for parallel applications. By coaxing the scheduler to "do the right thing," you can potentially achieve a gain of 5-10%, but the percentage is often much higher.

34 of 88

4. Affinity of processes with MPI

Taking Control: Basic Techniques for Specifying Process Placement in OpenMPI

In most use cases, simple controls are sufficient for process placement and binding to hardware components. These controls are passed to the mpirun command as options. Let's start by considering distributing processes equally among multi-node tasks. This is easiest to demonstrate with an example.

Example: Distributing Processes Equally Among Multi-Node Tasks

We have an application that we want to run on 32 MPI ranks, but this application is memory-hungry, requiring half a terabyte of memory. A single node doesn't have enough memory. So how do we resolve this?

Looking at the system details, each node has two sockets populated with Intel Broadwell (E52695) processors. Each CPU has 18 hardware cores, which, with hyperthreading, gives us 36 virtual processors per socket. Each node has 128 gigabytes of memory.

35 of 88

4. Affinity of processes with MPI

From the lscpu command:

NUMA node0 CPU(s): 0-17.36-53

NUMA node1 CPU(s): 18-35.54-71

From the /proc/meminfo file:

MemTotal: 131728700 kB

In this example, we use our placement reporting tool for MPI applications. The two parts of the source code are shown in the following listings.

36 of 88

4. Affinity of processes with MPI

We need to insert a placement reporting call into our routine after MPI initialization. This can easily be added to your MPI application as well. Now let's look at the reporting routine in the following listing.

37 of 88

4. Affinity of processes with MPI

In the first execution of our application, we ask the mpirun command to simply start 32 processes:

mpirun -n 32 ./MPIAffinity | sort -n -k 4

We then need to sort the result by the data in the fourth column, because the process-ordered result is random (done with the sort -n -k 4 command). The output from this command with our placement reporting routine is shown in Figure 14.5.

Figure 14.5: For mpirun -n 32, all our processes are on node cn328. Affinity is assigned to the NUMA region (socket).

38 of 88

4. Affinity of processes with MPI

From the printout in Figure 14.5, we see that all ranks were running on node cn328. Referring to the default affinity values for OpenMPI at the beginning of this section, for more than two ranks, affinity is assigned to a socket. The output from the lscpu command shows that our first NUMA region contains virtual processor cores 0-17, 36-53. NUMA regions are typically aligned with each socket. In our printout, we see that the kernel affinity is 0-17, 36-53, confirming that affinity is assigned to socket.

Since the memory requirements in our real application exceed 128 GiB per node, the application fails when memory is allocated. Therefore, we need to find a way to distribute these processes. To do this, we add another option,

--npernode<#> or -N<#>, which tells MPI how many ranks to allocate on each node. We need four nodes to get enough memory for our task, so we need eight processes per node.

mpirun -n 32 --npernode 8 ./MPIAffinity | sort -n -k 4

Figure 14.6 shows our allocation report.

39 of 88

4. Affinity of processes with MPI

Figure 14.6 MPI processes are distributed across four nodes, 328 through 331. Affinity is still tied to the NUMA region.

40 of 88

4. Affinity of processes with MPI

The result in Figure 14.6 shows that we are running on four nodes. We should now have enough memory to run the application. Alternatively, we could specify the number of ranks per socket using the --npersocket option. We have two sockets per node, so we need four ranks per socket, hence:

mpirun -n 32 --npersocket 4 ./MPIAffinity | sort -n -k 4

Figure 14.7 shows the placement result per socket. The placement report in Figure 14.7 shows that the rank order places adjacent ranks in the same NUMA domain instead of interleaving ranks across NUMA domains. It would be even better if ranks shared data with their nearest neighbors.

41 of 88

4. Affinity of processes with MPI

Fig. 14.7 When the allocation is set to four processes per socket, the rank order is reversed. Now four adjacent ranks are in the same NUMA region.

42 of 88

4. Affinity of processes with MPI

So far, we've only worked on process placement. Now let's try to see what we can do with the affinity and binding of MPI processes. To do this, we add the --bind-to [socket | numa | core | hwthread] option to the mpirun command:

mpirun -n 32 --npersocket 4 --bind-to core ./MPIAffinity | sort -n -k 4

In the placement report in Figure 14.8, we see how this changes the affinity of processes.

Fig. 14.8 Core affinity changes the affinity of processes to a hardware core. Due to hyperthreading, each hardware core represents two virtual cores. For each process, we get two locations.

43 of 88

4. Affinity of processes with MPI

The placement results in Figure 14.8 show that process affinity is now more limited than before. Each process can run on two virtual cores. These two virtual cores belong to a single hardware core, thus demonstrating that the core affinity option is specific to the hardware core. Only four of the 18 processor cores are used on each socket. This is exactly what we want—more memory for each MPI rank. Let's try affinizing the process not to a core, but to hyperthreads, using the hwthread option. This should force the scheduler to place processes on one and only one virtual core.

mpirun -n 32 --npersocket 4 --bind-to hwthread ./MPIAffinity | sort -n -k 4

Again, we use our placement reporting program to visualize the placement, and its results are shown in Figure 14.9.

44 of 88

4. Affinity of processes with MPI

Figure 14.9 Process placement via the hwthread option limits the locations in which processes can run, restricting them to only one location.

45 of 88

4. Affinity of processes with MPI

Our final processor layout finally limits the locations where each process can execute to a single location, as shown in Figure 14.9. At first glance, this result seems good. However, upon closer inspection, we see that the first two ranks are located on a pair of hyperthreads (0 and 36) on a single hardware core. This is not a good idea, as it means that the two ranks share the cache and hardware components of that hardware core instead of having their own full set of resources.

The OpenMPI mpirun command also has a built-in option for reporting binding information. This is convenient for small problems, but the resulting output volume for nodes with a large number of processors and MPI ranks is so large that it is difficult to manage. Adding the --report-bindings option to the mpirun command used for Figure 14.9 produces the result shown in Figure 14.10.

46 of 88

4. Affinity of processes with MPI

The visual diagram is a bit easier to understand, and the output contains a lot of information. Each line indicates a rank in MPI_COMM_WORLD (MCW). The symbols between the forward slashes on the right side indicate the affinity location for that process. A set of two dots between the forward slashes indicates two hyperthreads per core. Two sets of parentheses outline two sockets on the node.

The examples we've covered in this section should give you a good idea of how to monitor placement and affinity. You should also have some tools to verify that you're getting the expected placement and affinity of processes.

47 of 88

4. Affinity of processes with MPI

Fig. 14.10 The placement report from the --report-bindings option in the mpirun command shows the places where ranks are bound with the letter B

48 of 88

4. Affinity of processes with MPI

Affinity is More Than Just Process Binding: The Big Picture

Now we'll explore the full picture of affinity for parallel computing. We'll use this as a way to introduce advanced options offered in OpenMPI for even greater control.

The concept of affinity stems from how the operating system sees things. At the operating system level, you can specify where each process is allowed to run. In Linux, this is done either with the taskset command or with the numactl command. These commands, and similar utilities in other operating systems, emerged as CPUs grew in sophistication to provide more information to the operating system scheduler. These indications can be interpreted by the scheduler as hints or demands. Using these commands, you can bind a server process to a specific processor to be closer to a particular hardware component or to achieve faster response times. This focus on affinity alone is sufficient when dealing with a single process.

49 of 88

4. Affinity of processes with MPI

Parallel programming requires additional considerations. We must consider the set of processes. Let's say we have 16 processors and are running a four-rank MPI job.

Where do we place the ranks? Do we place them across all slots, on all slots, pack them close together, or distribute them evenly? Do we place some ranks close to each other (ranks 1 and 2 together, or ranks 1 and 4 together)? To answer these questions, we need to address the following aspects:

relatedness (process placement);
rank order (which ranks are close to each other);
affinity (the affinity or connection of a process to a location or locations).

We'll look at each of these in turn, and how OpenMPI lets you control them.

50 of 88

4. Affinity of processes with MPI

Mapping Processes to Processors or Other Locations

When we think about a parallel application, we have a set of processes and a set of processors. How do we map processes to processors? In the example given in Section 14.4.2, we wanted to distribute processes across four nodes so that each process would have more memory than if it were on a single node. A more general form of process mapping in OpenMPI is the -mapby hwresource option, where the hwresource argument is any of a large number of hardware components. The most common include the following:

The --map-by option to the mpirun command maps processes to this hardware resource in a round-robin fashion. The default for this option is socket. Most of these hardware locations are self-explanatory, with the exception of slot. Slots are a list of possible locations for processes from the environment, scheduler, or host file. This form of the --map-by option is still limited in its meaning and therefore in its effect.

51 of 88

4. Affinity of processes with MPI

A more general form uses the ppr option, or processes per resource, where n is the number of processes. Instead of mapping in a round-robin fashion per resource, you can specify a block of processes per hardware resource:

--map-by ppr:n:hwresource

Or more explicitly:

In our previous examples, we used the simpler --npernode 8 option. In this more general form, it would be shorthand for:

--map-by ppr:8:node

If the level of control provided by the previous options for the mpirun command is insufficient, you can specify a list of processor numbers to map with the --cpu-list <logical processor numbers> option, where processor numbers is a list corresponding to the list from lstopo or lscpu. This option also simultaneously maps processes to a logical (virtual) processor.

52 of 88

4. Affinity of processes with MPI

MPI Rank Ordering

Another thing you might want to control is the order of your MPI ranks. You might want adjacent MPI ranks to be close to each other in physical processor space if they communicate a lot with each other. This reduces the cost of communicating between such ranks. Usually, controlling this using the allocation block size during mapping is sufficient, but additional control can be achieved with the --rank-by option:

An even more general option is to use a rank file:

--rankfile <filename>

While you can fine-tune the placement of your MPI ranks with these commands and perhaps improve performance by a couple of percent, finding the optimal formula is quite difficult.

53 of 88

4. Affinity of processes with MPI

Binding Processes to Hardware

The last thing to control is affinity itself. Affinity is the process of binding a process to a hardware resource. This option is similar to the previous ones:

The default value of core is sufficient for most MPI applications (without the --bind-to option, socket is used by default for more than two processes, as described in Section 14.4.1). However, there are cases where this affinity value can cause problems.

54 of 88

4. Affinity of processes with MPI

As we saw in the example in Figure 14.8, affinity is assigned to two hyperthreads on a hardware core. We might want to try --map-to core --bind-to hwthread to distribute processes across cores, but bind each process more tightly to a single hyperthread. The performance difference from such fine-grained control is likely small. A larger issue arises when we try to implement a hybrid MPI and OpenMP application. It's important to understand that child processes inherit affinity settings from their parents. If we use the npersocket 4 --bind-to core options and then launch two threads, we have two locations for thread execution (two hyperthreads per core), so everything is fine. If we launch four threads, they will share only two logical processors, and performance will be limited. Earlier in this section, we saw that there are many options for controlling process, placement, and affinity. Indeed, there are too many combinations to even explore them fully, as we did in Section 14.3 for OpenMP. In most cases, we should be satisfied with obtaining reasonable settings that reflect the needs of our applications.

55 of 88

5. Affinity for MPI plus OpenMP

In this section, our goal is to understand how to specify affinity for hybrid MPI and OpenMP applications. Obtaining the correct affinity for these hybrid situations can be challenging. For this exploratory analysis, we created an example of a hybrid threading triad with MPI and OpenMP. We also modified the placement report used in this chapter to output information for hybrid MPI and OpenMP applications. The following listing shows the modified place_report_mpi_omp.c routine.

56 of 88

5. Affinity for MPI plus OpenMP

57 of 88

5. Affinity for MPI plus OpenMP

58 of 88

5. Affinity for MPI plus OpenMP

We begin this example by compiling the stream triad application. The stream triad source code is located at https://github.com/EssentialsofParallelComputing/Chapter14 in the StreamTriad directory. Compile this source code with:

mkdir build && cd build

./cmake -DCMAKE_VERBOSE=1 ..

make

We ran this code on our Skylake Gold CPU with 44 hardware cores and two hyperthreads each. We placed two OpenMP threads on the hyperthreads, and then an MPI rank on each hardware core. The following commands achieve this scheme:

export OMP_NUM_THREADS=2

mpirun -n 44 --map-by socket ./StreamTriad

The stream triad source code contains a call to our placement report from Listing 14.2. Figure 14.2 shows the output. 14.11 shows the printout.

59 of 88

5. Affinity for MPI plus OpenMP

Figure 14.11. MPI ranks are placed in a round-robin pattern across two-slot sockets, providing space for two OpenMP threads. Placement is constrained by a NUMA domain to keep memory close to the threads. Processes are not tightly bound to any specific virtual core, and the scheduler can freely move them around within the NUMA domain.

As can be seen from the printout in Figure 14.11, we were able to distribute ranks across NUMA domains in a round-robin pattern, keeping two threads together. This should ensure good main memory bandwidth. Affinity constraints are sufficient only to keep processes within the NUMA domain and allow the scheduler to move processes at its discretion. The scheduler can place thread 0 on any of 44 different virtual processors, including 0-21 or 44-65. The numbering may be confusing; 0 and 44 are two hyper-threads on a single physical core.

60 of 88

5. Affinity for MPI plus OpenMP

Now let's try to get more affinity constraints. To do this, we need to use the form –mapby ppr:N:socket:PE=N . This command allows us to distribute processes at a given interval and specify the number of MPI ranks to place on each socket. The complexity of this option is quite difficult to understand.

Let's start with the ppr:N:socket part. We want half of our MPI ranks to be on each socket. This should be 22 MPI ranks per socket, or ppr:22:socket. The last part specifies the number of processors we want between process placements. We need two threads for each MPI rank, so we need two virtual processors in each block. The specified specification is for hardware cores. It's important to know that each hardware core contains two virtual processors. Therefore, you only need one hardware core (PE=1). Then we attach the threads to the hardware thread. For rank 0, we need to get the first hardware core with virtual processors 0 and 44. As a result, we have the following commands:

export OMP_NUM_THREADS=2

export OMP_PROC_BIND=true

mpirun -n 44 --map-by ppr:22:socket:PE=1 ./StreamTriad

61 of 88

5. Affinity for MPI plus OpenMP

Phew! That was tricky. Did we do everything right? Okay, let's check the command output, as shown in Figure 14.12.

Figure 14.12. Process and thread affinity is now bounded by the logical core, and two OpenMP threads per rank are located in hyperthread pairs (0 and 44 in the figure). The ranks are tightly packed to reduce communication costs for more complex programs. MPI ranks are bound to hardware cores, and thread affinity is bound to the hyperthread.

From the printout in Figure 14.12, we can see that the threads are bound where we want them. We also have an MPI rank bound to a hardware core. This can be verified by unsetting the OMP_PROC_BIND environment variable, and the result (Figure 14.13) confirms that the rank is bound to two logical processors that make up a single hardware core.

62 of 88

5. Affinity for MPI plus OpenMP

Figure 14.13. Printout without OMP_PROC_BIND=true shows that MPI ranks are assigned to the hardware core.

We've worked through one case and were able to obtain the affinity values we wanted. But now you want to know if we can run more than two OpenMP threads and what the program's performance is. Let's look at the set of commands that test any number of threads that is evenly divisible by the number of processors. The following listing shows the key script commands.

63 of 88

5. Affinity for MPI plus OpenMP

64 of 88

5. Affinity for MPI plus OpenMP

65 of 88

5. Affinity for MPI plus OpenMP

To make this script portable, we obtain the hardware characteristics using the lscpu command. Then, we set the necessary OpenMP environment parameters. We could have set OMP_PROC_BIND to true, close, or spread, with the same result for this case where all slots are full. We then calculate the variables needed by the mpirun command and run the job.

In the full example of the thread triad in Listing 14.2, we tested a combination of thread sizes and MPI ranks that are evenly divisible by 88 processes. We followed this with 44 processes, where we skip hyperthreads because we didn't actually get any improvement with them (Section 14.3). The performance results are fairly consistent across the entire test suite. This is because only main memory bandwidth is measured. There is little work there and no communication over the MPI line. The benefits of a hybrid of MPI and OpenMP in this situation are limited. We would expect to see benefits in much larger simulations, where replacing the OpenMP thread with an MPI rank would:

reduce MPI buffer memory requirements;
create larger domains that consolidate and reduce areas of ghost cells;
reduce contention between processors on a node for a single network interface;
access vector units and other underutilized processor components.

66 of 88

6. Affinity control from the command line

Additionally, there are general approaches to controlling affinity from the command line. Command-line tools are helpful in situations where your MPI or specialized parallel application lacks built-in affinity control options. These tools also assist general-purpose applications by binding them to important hardware components, such as video cards, network ports, and storage devices. In this section, we'll look at two command-line options: the hwloc and likwid toolchains. These tools are designed with high-performance computing in mind.

Using the hwloc-bind Tool for Affinity Assignment

The hwloc project was developed by INRIA, the French National Institute for Research in Computer Science and Automation. A subproject of the OpenMPI project, hwloc implements OpenMPI's affinity placement and assignment capabilities, which we discussed in Sections 14.4 and 14.5. The hwloc package is also a standalone package with command-line tools. Since there are numerous hwloc tools, we'll just cover a couple of them as an introduction. We'll use hwloc-calc to list the hardware cores and hwloc-bind to bind them.

67 of 88

6. Affinity control from the command line

Using the hwloc-bind tool is very simple. Simply prefix your application with hwloc-bind and then add the hardware location where you want to bind it. In our application, we'll use the lstopo command. The lstopo command is also part of the hwloc tools. Here's our one-line way to run a task on all hardware cores and bind processes to cores:

for core in `hwloc-calc --intersect core --sep " " all`; do hwloc-bind \ core:${core} lstopo --no-io --pid 0 & done

The --intersect core option uses only hardware cores. The --sep " " clause tells the printout to separate numbers with spaces instead of commas. The output of this command on our typical Skylake Gold processor launches 44 graphical lstopo windows, each of which looks similar to those shown in Figure 14.14. In each window, the bound locations are highlighted in green.

68 of 88

6. Affinity control from the command line

Figure 14.14. In the lstopo image, the green bound location (shaded core) is shown in the lower left corner. This shows that process 22 is bound to the 22nd and 66th virtual cores, which are hyperthreads for the same physical core.

69 of 88

6. Affinity control from the command line

We could use a similar command to start two processes on the first core of each socket. For example:

for socket in `hwloc-calc --intersect socket \

--sep " " all`; do hwloc-bind \

socket:${socket}.core:0 lstopo --no-io --pid 0 & done

The following listing shows how the general-purpose mpirun command is constructed with binding.

70 of 88

6. Affinity control from the command line

Now we can run our MPI affinity application from Section 14.4 on the first core of each socket using the following command:

./mpirun_distrib.sh "1 22" ./MPIAffinity

This mpirun_distrib script builds and executes the following command:

mpirun -np 1 hwloc-bind core:1 ./MPIAffinity : -np 1 hwloc-bind core:22

./MPIAffinity

Using likwid-pin: The Affinity Tool in the likwid Tool Suite

The likwid-pin tool is one of many great tools from the likwid ("like I know what I'm doing") team at the University of Erlangen. We encountered our first likwid tool, likwid-perfctr, in Section 3.3.1. The likwid tools in this section are command-line tools for setting affinity. We'll look at tool options for OpenMP, MPI, and hybrid MPI/OpenMP applications. The basic syntax for selecting processor sets in likwid uses the following options:

71 of 88

6. Affinity control from the command line

Default (physical numbering);
N (node-level numbering);
S (socket-level numbering);
C (last-level cache numbering);
M (NUMA memory domain numbering).

The following syntax is used to set affinity:

-c <N, S, C, M>:[n1, n2, n3-n4]. To list the numbering schemes, use the likwid-pin -p command. Understanding how likwid-pin works is best gained through examples and experimentation.

Pinning OpenMP Threads with likwid-pin

This example shows how to use likwid-pin with OpenMP applications:

export OMP_NUM_THREADS=44

export OMP_PROC_BIND=spread

export OMP_PLACES=threads

./vecadd_opt3

72 of 88

6. Affinity control from the command line

To achieve the same pinning result with likwid-pin for OpenMP applications, we use the socket (S) option. Next, we allocate 22 threads on each socket, where two sets of pins are separated and concatenated with the @ symbol:

likwid-pin -c S0:0-21@S1:0-21 ./vecadd_opt3

The OMP environment variables are not necessary when using likwid-pin and are generally ignored. The number of threads is determined from the pin set lists. For this command, it is 44. We ran the vecadd example from Section 14.3, configured with the -DCMAKE_VERBOSE option, to obtain the allocation report, as shown in Figure 14.15.

73 of 88

6. Affinity control from the command line

Figure 14.15. The likwid-pin output is at the top of the figure, followed by a printout of our allocation report. The output shows that threads are mapped to 44 physical cores.

74 of 88

6. Affinity control from the command line

Our placement report shows that the OMP environment variables are not set and that OpenMP did not place or pin any threads in the OpenMP sockets. However, we obtain the same placement and pinning with the likwid-pin tool, with identical results. We have just confirmed that OMP environment variables are not necessary for likwid-pin, as we stated in the previous paragraph. It should be noted that if you set the OMP_NUM_THREADS environment variable to a value different from the number of threads in the pin sets, likwid will distribute the threads from the OMP_NUM_THREADS variable across the processors specified in the pin sets. When there are more threads than processors, the tool wraps the thread placement around available processors.

MPI Rank Pinning with likwid-mpirun

likwid's pinning functionality for MPI applications is included in the likwid-mpirun tool. This tool can be used as a replacement for mpirun in most MPI implementations. Let's look at the MPIAffinity example from Section 14.4.

75 of 88

6. Affinity control from the command line

Example: Pinning MPI Ranks with the likwid-mpirun Tool

Run the MPIAffinity example on 44 ranks and use the likwid-mpirun command to assign ranks to hardware cores. By default, likwid-mpirun assigns ranks to cores, so we need to use the likwid-mpirun command to get what we typically want without any additional options:

likwid-mpirun -n 44 ./MPIAffinity |sort -n -k 4

Figure 14.16 shows the results of our placement report for this example.

Figure 14.16. The placement report for likwid-mpirun shows that each rank is assigned to cores in numerical order.

That was easy! As shown in Figure 14.16, the likwid-mpirun tool assigns ranks to hardware cores. Let's move on to an example where we need to provide the team with some options.

76 of 88

6. Affinity control from the command line

Example: Options for MPI Rank Pinning with the likwid-mpirun Tool

We start with the basic command:

likwid-mpirun -n 22 ./MPIAffinity |sort -n -k 4

The ranks are distributed across the first 22 hardware cores on socket 0 and none on socket 1. We previously demonstrated that you need to distribute processes across both sockets to get full bandwidth from main memory. Adding the -nperdomain option allows us to specify how many sockets are per NUMA domain, and the S:11 pin set receives the correct numbers for the 11 ranks on the socket. The command now looks like this:

likwid-mpirun -n 22 -nperdomain S:11 ./MPIAffinity |sort -n -k 4

77 of 88

7. Future: Setting and changing affinity at runtime

What if the user didn't need to worry about affinity? It's quite difficult to force users to use complex calls for proper process placement and pinning. In many cases, it makes more sense to embed pinning logic in the executable. One approach is to query the hardware information and set the affinity accordingly. Not many applications use this approach yet, but we expect more to do so in the future.

Some applications not only set their affinity at runtime but also modify it to adapt to changing runtime characteristics! This innovative technique was developed by Sam Gutiérrez of Los Alamos National Laboratory in his QUO library. Perhaps you have an application that uses all MPI ranks on a node, but it calls a library that uses a combination of MPI ranks and OpenMP threads. The QUO library provides a simple interface, built on top of hwloc, for setting the appropriate affinities. It can then push settings onto the stack, freeze processors, and set a new affinity policy. We'll look at examples of initiating process affinity within your application and changing it at runtime in the following sections.

78 of 88

7. Future: Setting and changing affinity at runtime

Setting Affinity in the Executable

Setting process placement and affinity in your application means you no longer have to deal with confusing mpirun commands or portability between MPI implementations. Here, we use the QUO library to implement this affinity for all Skylake Gold processor cores. The open-source QUO library is available at https://github.com/LANL/libquo.git. First, we create the executable in the Quo directory and run the application with the number of hardware cores on your system:

make autobind

mpirun -n 44 ./autobind

The autobind source code is shown in Listing 14.5. The program consists of the following steps. Our placement reporting routine is called before and after to show process affinities.

1. Initialize QUO.

2. Set affinities to hardware cores.

3. Distribute processes and bind them to cores.

4. Restore initial affinities.

79 of 88

7. Future: Setting and changing affinity at runtime

80 of 88

7. Future: Setting and changing affinity at runtime

81 of 88

7. Future: Setting and changing affinity at runtime

Don't forget to synchronize processes when changing bindings. To ensure synchronization, in the following listing, we use an MPI barrier and a microsleep call in the SyncIT subroutine.

The output from the autobind application (Figure 14.17) clearly shows the bindings that have changed from sockets to hardware cores.

82 of 88

7. Future: Setting and changing affinity at runtime

Changing Process Affinities at Runtime

Suppose we have an application where one part wants to use all MPI ranks, while another works best with OpenMP threads. To resolve this situation, we need to switch affinities at runtime. This is exactly the scenario for which the QUO library is designed! The steps are as follows:

1. Initialize QUO;

2. Assign process affinities to cores for the MPI region;

3. Expand affinities across the entire node for the OpenMP region;

4. Revert to MPI settings.

83 of 88

7. Future: Setting and changing affinity at runtime

Fig. 14.17 The output of the autobind demo program shows that the cores are initially bound to sockets, but they are subsequently bound to hardware cores

84 of 88

7. Future: Setting and changing affinity at runtime

In the following listing, let's see how this is done using QUO.

85 of 88

7. Future: Setting and changing affinity at runtime

86 of 88

7. Future: Setting and changing affinity at runtime

The dynaffinity application can be run with the number of hardware cores in our system using the following commands:

make dynaffinity

mpirun -n 44 ./dynaffinity

We again use our reporting routines to check the process affinities for the MPI and OpenMP regions. Figure 14.18 shows the output.

The output in Figure 14.18 shows that the process affinities changed between the MPI and OpenMP regions, resulting in dynamic affinity changes during execution.

87 of 88

7. Future: Setting and changing affinity at runtime

Fig. 14.18 For the MPI region, processes are bound to hardware cores. When we enter the OpenMP region, affinities are extended to the entire node.

88 of 88

8. Exercises

1. Create a visual representation of a pair of different hardware architectures. Disclose the hardware characteristics of these devices.

2. Run the benchmark suite on your hardware using the script in Listing 14.1. What did you learn about how to best utilize your system?

3. Modify the program used in the vector addition example (vecadd_opt3.c) in Section 14.3 to include more floating-point operations. Take the compute kernel and change the operations in the loop to the Pythagorean formula:

c[i] = sqrt(a[i]*a[i] + b[i]*b[i]);

How have your results and conclusions about the best placement and bindings changed? Do you now see any benefit from hyperthreading (if you have it)?

4. In the MPI example from Section 14.4, include the vector addition compute kernel and generate a scaling plot for the kernel. Then modify the kernel with the Pythagorean formula used in Exercise 3.

5. Combine vector addition and the Pythagorean formula in the following procedure (either in a single loop or in two separate loops) to allow for greater data reuse:

c[i] = a[i] + b[i];

d[i] = sqrt(a[i]*a[i] + b[i]*b[i]);

How does this change the results of the placement and affinity exploration?

6. Add source code to assign placement and affinity within the application from one of the previous exercises.