1 of 69

Vectorization: flops for free

Jumanazarov Mardonbek

2 of 69

Plan:

  1. Vectorization and SIMD (Single Instruction, Multiple Data Elements) Overview
  2. Hardware Vectorization Trends
  3. Vectorization Techniques
  4. Programming Style for Better Vectorization
  5. Vectorization-Related Compiler Flags for Various Compilers
  6. OpenMP SIMD Directives for Better Portability
  7. Exercises

3 of 69

Processors have dedicated vector processing units that can load and process more than one data element at a time. If we're limited to floating-point operations, vectorization is essential to maximize the hardware's capabilities. Vectorization is the process of grouping operations to perform multiple operations at once. However, adding more flops to the hardware's capabilities has limited benefits when the application is memory-bound. It's important to remember that most applications are memory-bound in some way. Compilers can be powerful, but as you'll see, achieving real performance gains from vectorization is often not as straightforward as the compiler documentation suggests. However, performance gains from optimization can be achieved with minimal effort and shouldn't be ignored.

In this chapter, we'll show how programmers, with a little effort and knowledge, can improve performance through vectorization. Some of these techniques simply require the right compiler flags and programming styles, while others require much more work. Real-world examples demonstrate various ways to achieve vectorization.

NOTE: We recommend referencing the examples in this chapter at https://github.com/EssentialsofParallelComputing/Chapter6.

4 of 69

Vectorization and SIMD (Single Instruction, Multiple Data Elements) Overview

In Section 1.4, we introduced the single-instruction, multiple-data (SIMD) architecture as one component of Flynn's taxonomy. This taxonomy is used as a parallel classification of instruction and data streams in the architecture. In SIMD, as the name suggests, there is a single instruction that is executed across multiple data streams. A single vector add instruction replaces eight separate scalar add instructions in the instruction queue, reducing pressure on the instruction queue and cache. The biggest benefit is that performing eight additions in the vector processing unit requires approximately the same processing power as a single scalar addition. Figure 6.1 shows a vector unit with a 512-bit vector width, offering a vector length of eight double-precision values.

Fig. 6.1 A scalar operation performs one double-precision addition in one cycle. Processing a 64-byte cache line requires eight cycles. By comparison, a vector operation on a 512-bit vector unit can process all eight double-precision values ​​in one cycle.

5 of 69

Vectorization and SIMD (Single Instruction, Multiple Data Elements) Overview

Let's briefly summarize the terminology of vectorization:

vector (SIMD) lane – the route through a vector operation on vector registers for a single data element, much like a traffic lane on a multi-lane highway;

vector width – the width of the vector module, usually expressed in bits;

vector length – the number of data elements that can be processed by the vector in a single operation;

vector instruction sets (SIMD) – a set of instructions that extend the regular instructions of a scalar processor by using a vector processor.

Vectorization is performed through both software and hardware. The requirements are as follows:

instruction generation – vector instructions must be generated by the compiler or specified manually through compiler internals[1] or assembly coding;

map instructions to the processor's vector module – if there is a mismatch between instructions and hardware, newer hardware will usually be able to process the instructions, but older hardware will simply refuse to work. (AVX instructions don't run on ten-year-old chips. Sorry!)

6 of 69

Vectorization and SIMD (Single Instruction, Multiple Data Elements) Overview

There's no fancy process that converts regular scalar instructions on the fly. If you're using an older version of your compiler, as many programmers do, it won't be able to generate instructions for the latest hardware. Unfortunately, compiler authors take time to incorporate new hardware features and instruction sets. It may also take some time for compiler authors to optimize these features.

Conclusion : When using the latest processors, you should only use the latest compiler versions.

You should also specify the appropriate set of vector instructions to generate. By default, most compilers take the safe route and generate SSE2 (Streaming SIMD Extensions) instructions to ensure code runs on any hardware. SSE2 instructions only perform two double-precision operations at a time, instead of the four or eight operations that can be performed on more modern processors. For high-performance applications, there are more advanced options.

You can compile for the architecture you're running on. You can compile for any architecture manufactured in the last 5 or 10 years. Specifying AVX (Advanced Vector Extensions) instructions will produce a 256-bit vector and will work on any hardware since 2011.

You can ask the compiler to generate more than one vector instruction set. It will then use the best one for the hardware being used.

Conclusion: specify the most advanced vector instruction set in compiler flags that can be used wisely.

7 of 69

Vectorization and SIMD (Single Instruction, Multiple Data Elements) Overview

[1] Compiler intrinsics are similar to traditional library functions, except that they are built into the compiler. They are either faster than traditional library functions (the compiler knows more about them and optimizes them better) or handle a smaller range of input data than library functions. Intrinsics also provide processor-specific functionality, and therefore can be used as an intermediate link between standard C and machine-oriented assembly language.

8 of 69

Hardware trends in vectorization

Fig. 6.2. The emergence of hardware in the form of vector modules

for commodity processors began around 1997 and over the past 20 years has slowly grown in both vector width (size) and the types of operations supported.

9 of 69

Hardware trends in vectorization

Table 6.1 Vector hardware releases over the past decade have significantly improved vector functionality

Release

Functionality

MMX (trademark without official meaning)

was aimed at the graphics market, but GPU processors soon took over this function. Vector units shifted from graphics to compute. AMD released its version, called 3DNow!, with single-precision support.

SSE (Streaming SIMD Extensions)

Intel's first vector unit to offer single-precision floating-point operations.

SSE2

Added double-precision support.

AVX (Advanced Vector Extensions)

Doubled the vector length. AMD added the FMA vector instruction to its competing hardware, effectively doubling the performance for some cycles.

AVX2

Intel added FMA to its vector processor.

AVX512

First introduced on the Knights Landing processor, it appeared in the mainline multi-core processor lineup in 2017. From 2018 onward, Intel and AMD (Advanced Micro Devices, Inc.) created several variants of AVX512 as incremental enhancements to their vector hardware architectures.

10 of 69

Vectorization methods

There are several ways to achieve vectorization in your program. In order of increasing programmer effort, they include:

optimized libraries;

automatic vectorization;

compiler hints;

compiler intrinsic vector functions;

machine-dependent assembly instructions.

OPTIMIZED LIBRARIES PROVIDE PERFORMANCE WITH LOW EFFORT

To minimize vectorization effort, programmers should explore the available libraries that can be used for their application. Many low-level libraries provide highly optimized routines for performance-minded programmers. Some of the most commonly used libraries are:

BLAS (Basic Linear Algebraic System) – the core component of high-performance linear algebraic software;

LAPACK – a linear algebraic package;

SCALAPACK – a scalable linear algebraic package;

FFT (Fast Fourier Transform) – various implementation packages are available;

Sparse solvers – various implementations of sparse solvers are available.

11 of 69

Vectorization methods

The Intel® Math Kernel Library (MKL) implements optimized versions of BLAS, LAPACK, SCALAPACK, FFT, sparse solvers, and math functions for Intel processors. While this library is available in some commercial Intel packages, it is also offered free of charge. Many other library developers produce packages for a variety of purposes. Additionally, hardware vendors provide libraries optimized for their own hardware under various licensing agreements.

AUTOMATIC VECTORIZATION: A SIMPLE WAY TO ACCELERATE VECTORIZATION (IN MOST CASES)

Automatic vectorization[1] is the recommended choice for most programmers because its implementation requires the least programming effort. However, compilers cannot always recognize places where vectorization can be safely applied. In this section, we first consider source code that a compiler can vectorize automatically. We'll then show you how to confirm that you're actually getting the vectorization you expect. You'll also learn about programming styles that enable the compiler to vectorize code and perform other optimizations. These include using the restrict keyword for C and the restrict or restrict attributes for C++.

With the continued improvement of architectures and compilers, automatic vectorization can provide significant performance gains. The right compiler flags and programming style can further improve performance.

DEFINITIONS: Automatic vectorization is the vectorization of source code by a compiler for standard C, C++, or Fortran.

12 of 69

Vectorization methods

[1] It's important to note that while automatic vectorization often leads to significant performance improvements, it sometimes slows down the original code. This is because the overhead of tuning vector instructions exceeds the performance gain. The compiler typically makes vectorization decisions using a cost function. The compiler performs vectorization if the cost function indicates that the code will be faster, but it only guesses the array length and assumes that all data is taken from the first level of the cache.

EXAMPLE: AUTOMATIC VECTORIZATION

Let's look at automatic vectorization in a simple loop from the stream triad[1] of the STREAM Benchmark application, presented in Section 3.2.4. In the listing below, we separate the triad code from the STREAM Benchmark application into a separate test task.

[1] For reference, a stream triad is a three-operand compute kernel with multiplication and addition of the form a(i) = b(i) + q × c(i) within the STREAM Banchmark benchmark application.

13 of 69

Vectorization methods

14 of 69

Vectorization methods

We discuss compiler flags in more detail in Section 6.4, and the timer.c and timer.h files in Section 17.2. Compiling stream_triad.c with GCC version 8 returns the following compiler response:

stream_triad.c:19:7: note: loop vectorized

stream_triad.c:12:4: note: loop vectorized

GCC vectorizes the initialization loop and the thread triad loop! You can execute the thread triad using

./stream_triad

You can confirm that the compiler uses vector instructions using the likwid tool (Section 3.3.1).

likwid-perfctr -C 0 -f -g MEM_DP ./stream_triad

Take a look at the output of this command with the lines below:

| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 |

| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 98 |

| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 640000000 |

| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 0 |

15 of 69

Vectorization methods

In the printout, you can see that the majority of the operation counts fall into the 256B_PACKED_DOUBLE category on the third line. Why all these 256-bit operations? Some versions of the GCC compiler, including version 8.2 used in this test, generate 256-bit vector instructions for the Skylake processor instead of 512-bit ones. Without a tool like likwid, we would have to carefully check the vectorization reports or inspect the generated assembly instructions only to discover that the compiler didn't generate the proper instructions. For the GCC compiler, we can modify the generated instructions by adding the -mprefer-vector-width=512 compiler flag and then try again. Now we get AVX512 instructions with eight double-precision values ​​evaluated simultaneously:

| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 |

| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 320000000 |

EXAMPLE: AUTOMATIC VECTORIZATION IN A FUNCTION

In this example, we'll try a slightly more complex version of the code from the previous example, in which the stream triad loop is located in a separate function. The following listing shows how to do this.

16 of 69

Vectorization methods

Цикл потоковой триады в отдельной функции

autovec_function/stream_triad.c

  1. #include <stdio.h>
  2. #include <sys/time.h>
  3. #include "timer.h"

4

  1. #define NTIMES 16
  2. #define STREAM_ARRAY_SIZE 80000000
  3. static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE];

8

void stream_triad(double* a, double* b, Цикл потоковой триады

double* c, double scalar){ в отдельной функции

for (int i=0; i<STREAM_ARRAY_SIZE; i++){

a[i] = b[i] + scalar*c[i];

12 }

13 }

14

15 int main(int argc, char *argv[]){

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31 }

struct timeval tstart;

double scalar = 3.0, time_sum = 0.0;

for (int i=0; i<STREAM_ARRAY_SIZE; i++) { a[i] = 1.0;

b[i] = 2.0;

}

for (int k=0; k<NTIMES; k++){

cpu_timer_start(&tstart); stream_triad(a, b, c, scalar); time_sum += cpu_timer_stop(tstart);

Вызов функции потоковой триады

// это чтобы компилятор не оптимизировал цикл

c[1] = c[2];

}

printf("Среднее время выполнения составляет %lf msecs\n", time_sum/NTIMES);

17 of 69

Vectorization methods

Let's look at the printout from the GCC compiler for the program code in the thread triad loop listing above:

stream_triad.c:10:4: note: loop vectorized

stream_triad.c:10:4: note: loop versioned for vectorization because of possible aliasing

stream_triad.c:10:4: note: loop vectorized

stream_triad.c:18:4: note: loop vectorized

The compiler can't tell for sure whether a function's arguments point to the same data or to overlapping data. This causes the compiler to create more than one version of the function and produce code that tests the arguments to determine which version to use. We can fix this by adding the restrict attribute to the function definition as part of the arguments. This keyword was added to the C99 standard. Unfortunately, the restrict keyword is not standardized in C++, but the restrict attribute works for GCC, Clang, and Visual C++. Another common form of this attribute in C++ compilers is restrict :

9 void stream_triad(double* restrict a, double* restrict b,double* restrict c, double scalar){

18 of 69

Vectorization methods

We used GCC to compile the source code with the restrict keyword added and got:

stream_triad.c:10:4: note: loop vectorized

stream_triad.c:10:4: note: loop vectorized

stream_triad.c:18:4: note: loop vectorized

This compiler now generates fewer function versions. It's also worth noting that the -fstrict-aliasing flag instructs the compiler to generate code aggressively, assuming no aliasing.

DEFINITION: Aliasing [1] is a situation where pointers point to overlapping memory locations. In this situation, the compiler is unable to determine whether the location is the same memory and will unsafely generate vectorized code or other optimizations.

[1] Aliasing is a situation where a memory address has multiple aliases, resulting from the overlap of multiple pointers to the same address.

19 of 69

Vectorization methods

In recent years, the strict aliasing option has become the default in GCC and other compilers (the -fstrict-aliasing flag is set by the -O2 and -O3 optimization levels). This has broken a lot of code where aliased variables actually existed. As a result, compilers have toned down the aggressiveness with which they generate more efficient code. This is all to communicate that you can get different results with different compilers, and even with different versions of the same compiler.

By using the restrict attribute, you promise the compiler that aliasing will not occur. We recommend using both the restrict attribute and the -fstrict-aliasing compiler flag. The attribute is carried with the source code across all architectures and compilers. You will need to apply the compiler flags for each compiler, but they affect all your source code. These examples might suggest that the best way for a programmer to achieve vectorization is to simply let the compiler handle it automatically. Although compilers are improving, they often fail to recognize the possibility of safely vectorizing a loop in more complex sections of code. Therefore, the programmer must assist the compiler with hints. We'll discuss this technique later.

20 of 69

Vectorization methods

TRAINING THE COMPILER THROUGH HINTS: PRAGMAS AND DIRECTIVES

So, the compiler isn't quite capable of understanding loops and generating vectorized code; is there anything we can do about it? In this section, we'll cover how to give the compiler more precise instructions. This, in turn, will give you more control over the source code vectorization process. Here, you'll learn how to use pragmas and directives to pass information to the compiler to ensure a portable vectorization implementation.

DEFINITION: A pragma is a command to the C or C++ compiler that helps it interpret source code. A command is a preprocessor command beginning with #pragma. (In Fortran, where this command is called a directive, its form is a comment line beginning with !$.)

EXAMPLE: USING MANUAL COMPILER HINTS FOR VECTORIZATION

We need an example in which the compiler doesn't vectorize the code without assistance. For this, we'll use the example source code in the following listings. Together, they calculate the wave velocity in a cell to determine the time step to use in the calculation. The time step can be no greater than the minimum time required for the wave to intersect any grid cell.

21 of 69

Vectorization methods

22 of 69

Vectorization methods

23 of 69

Vectorization methods

In this example, we added the -fopt-info-vec-missed compiler flag to report unsuccessful loop vectorizations. Compiling this source code yields the following output:

main.c:10:4: note: loop vectorized

timestep.c:9:4: missed: couldn't vectorize loop

timestep.c:9:4: missed: not vectorized: control flow in loop.

This vectorization report indicates that the timestep loop wasn't vectorized due to a conditional block within the loop. Let's see if we can optimize this loop by adding a pragma. Add the following line just before the for loop in timestep.c (on line 9):

#pragma omp simd reduction(min:mymindt)

Now compiling the code shows conflicting messages about the time step loop being vectorized:

main.c:10:4: note: loop vectorized

timestep_opt.c:9:9: note: loop vectorized

timestep_opt.c:11:7: note: not vectorized: control flow in loop.

24 of 69

Vectorization methods

We need to check the executable with a performance tool like likwid to see if the compiler is actually vectorizing it:

likwid-perfctr -g MEM_DP -C 0 ./timestep_opt

The output from likwid shows that no vector instructions were executed:

With GCC 9.0, we were able to make it vectorized by adding the -fno-trapping-math flag. If the conditional block contains division, this flag tells the compiler not to worry about throwing an exception with an error, so it will be vectorized. If the conditional block contains sqrt, the -fno-math-errno flag will allow the compiler to perform vectorization. For optimal portability, the pragma should also inform the compiler that some variables are not preserved during loop iterations and therefore do not depend, or "anti-depend," on the execution order. These dependencies will be discussed after the listing in the example below.

#pragma omp simd private(wavespeed, xspeed, yspeed, dt) reduction(min:mymindt)

25 of 69

Vectorization methods

An even more optimal way to specify that variable scope is limited to each loop iteration is to declare the variables within the loop scope:

double wavespeed = sqrt(g*H[ic]);

double xspeed = (fabs(U[ic])+wavespeed)/dx[ic];

double yspeed = (fabs(V[ic])+wavespeed)/dy[ic]; double dt=sigma/(xspeed+yspeed);

Now we can remove the private statement and the variable declarations before the loop. We can also add the restrict attribute to the function interface to tell the compiler that pointers do not overlap:

double timestep(int ncells, double g, double sigma, int* restrict celltype, double* restrict H, double* restrict U, double* restrict V, double* restrict dx, double* restrict dy);

Even with all these changes, we couldn't get GCC to vectorize this code. Upon further investigation using GCC version 9, we finally succeeded by adding the -fno-trapping-math flag. If the conditional block contains division, this flag tells the compiler not to worry about throwing an exception with an error, and therefore it will be vectorized. If the conditional block contains sqrt, the -fno-math-errno flag allows the compiler to vectorize. Meanwhile, the Intel compiler vectorizes all versions.

26 of 69

Vectorization methods

One of the most common operations is the array sum. We encountered this type of operation as a reduction in Section 4.5. We'll add a bit of complexity to this operation by including a conditional block that limits the sum to real cells in the grid. Here, real cells are elements not on the boundary or ghost cells from other processors. We discuss ghost cells in Chapter 8.

27 of 69

Vectorization methods

The OpenMP SIMD pragma should automatically set the reduction variable to zero, but when the pragma is ignored, initialization is required on line 6. The OpenMP SIMD pragma on line 7 informs the compiler that we are using the variable summer in the reduction sum. The conditional expression on line 9 of the loop can be implemented in vector operations with a mask. Each vector stripe has its own copy of summer, and they will then be combined at the end of the for loop.

The Intel compiler successfully recognizes the reduction to a sum and automatically vectorizes the loop without the SIMD pragma. GCC also vectorizes with versions 9 and later of the compiler.

EXAMPLE: USING THE COMPILER'S VECTORIZATION REPORT AS A GUIDE TO ADDING PRAGMAS

Since the Intel compiler generates more detailed vectorization reports, we will use it in this example. The source code for this example is taken from Listing 4.14. The main loop is shown in the listing below, with line numbers.

28 of 69

Vectorization methods

29 of 69

Vectorization methods

In the previous example, the endianness dependency [1] arises due to the possibility of aliasing between x and xnew . The compiler is more conservative in this case than necessary. The exit dependency is only invoked when attempting to vectorize the outer loop. The compiler cannot be sure that subsequent iterations of the inner loop will not write to the same location as the previous iteration. Before we continue, let's define a few terms.

Endianness dependency is when a variable inside a loop is read after it is written, and is called read-after-write (RAW).

Endianness dependency is when a variable inside a loop is written after it is read, and is called write-after-read (WAR).

Output dependency – a variable is written more than once in a loop.

In the case of the GCC v8.2 compiler, the vectorization report is as follows:

stencil.c:57:10: note: loop vectorized

stencil.c:57:10: note: loop versioned for vectorization because of possible aliasing

stencil.c:51:7: note: loop vectorized stencil.c:37:7: note: loop vectorized

stencil.c:37:7: note: loop versioned for vectorization because of possible aliasing

30 of 69

Vectorization methods

[1] A direct dependency arises from two instructions that access or modify the same resource. An instruction S2 depends on the direct execution order of S1 if and only if S1 modifies a resource read by S2, and S1 precedes S2 in execution. An instruction S2 inversely depends on S1 if and only if S2 modifies a resource read by S1, and S1 precedes S2 in execution. See https://en.wikipedia.org/wiki/Dependence_analysis.

The GCC compiler decides to create two versions and tests which one to use at runtime. The report is so good that it gives us a clear idea of ​​the cause of the problem. We can solve such problems in two ways. We can help the compiler by adding a pragma before the loop on line 57 like this:

#pragma omp simd

for (int i = 1; i < imax-1; i++){

Another approach to solving this problem is to add the restrict attribute to the definitions of x and xnew:

double** restrict x = malloc2D(jmax, imax);

double** restrict xnew = malloc2D(jmax, imax);

31 of 69

Vectorization methods

The vectorization report for Intel now shows that the inner loop is vectorized using a vectorized peel loop, a vectorized main loop, and a vectorized remainder loop. A few more definitions are needed here.

  • A peel loop is a loop executed on unaligned data so that the main loop can then align it. Often, a peel loop is executed conditionally at runtime if the data is detected to be unaligned.
  • A remember loop is a loop executed after the main loop to process a partial data set that is too small to be the full vector length.

A peel loop is added to process unaligned data at the beginning of the loop, and the remainder loop processes any additional data at the end of the loop. The reports for all three loops look identical. Looking at the main loop report, we see that the estimated speedup is more than six times faster:

32 of 69

Vectorization methods

LOOP BEGIN at stencil.c(55,21)

remark #15388: vec support: reference xnew[j][i] has aligned access [ stencil.c(56,13) ]

remark #15389: vec support: reference x[j][i] has unaligned access [ stencil.c(56,28) ]

remark #15389: vec support: reference x[j][i-1] has unaligned access [ stencil.c(56,38) ]

remark #15389: vec support: reference x[j][i+1] has unaligned access [ stencil.c(56,50) ]

remark #15389: vec support: reference x[j-1][i] has unaligned access [ stencil.c(56,62) ]

remark #15389: vec support: reference x[j+1][i] has unaligned access [ stencil.c(56,74) ]

remark #15381: vec support: unaligned access used inside loop body

remark #15305: vec support: vector length 8

remark #15399: vec support: unroll factor set to 2

remark #15309: vec support: normalized vectorization overhead 0.236

remark #15301: OpenMP SIMD LOOP WAS VECTORIZED

remark #15449: unmasked aligned unit stride stores: 1

remark #15450: unmasked unaligned unit stride loads: 5

remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 43

remark #15477: vector cost: 6.620

remark #15478: estimated potential speedup: 6.370

remark #15486: divides: 1

remark #15488: --- end vector cost summary ---

remark #25015: Estimate of max trip count of loop=125

LOOP END

33 of 69

Vectorization methods

Please note that the estimated speedup is carefully labeled as a potential speedup. It is unlikely that you will achieve the full estimated speedup unless:

  • your data is in a high cache level;
  • the actual array length is not long;
  • you are not bandwidth-limited in loading data from main memory.

In the previous implementation, the actual measured speedup on a Skylake Gold processor with the Intel compiler was 1.39x faster than the non-vectorized version. The vectorization report is intended to speed up the processor, but we still have a main-memory compute core with a memory bandwidth limit that must be dealt with.

For the GCC compiler, the SIMD pragma successfully eliminates versioning of the two loops. In reality, adding the restrict clause had no effect, and both loops are still versioned. Furthermore, since the vectorized version is used in all these cases, performance is unchanged. To understand the speedup, we can compare the performance with the version where vectorization is disabled and find that the speedup with vectorization using GCC is about 1.22x faster.

34 of 69

Vectorization methods

CRAPPY LOOPS, WE HAVE THEM: USE COMPILER INTRINSICS

For problematic loops that refuse to be vectorized even with hints, compiler intrinsics offer another option. In this section, we'll explore approaches to using intrinsics for greater control over vectorization. The downside of intrinsics is that they are less portable. Here, we'll look at several examples that use intrinsics for successful vectorization to demonstrate the use of intrinsics to vectorize the Kahan sum demonstrated in Section 5.7. In that section, we noted that the cost of a Kahan sum under a conventional summation was about four floating-point operations instead of one. But if we can vectorize the Kahan summation, its cost becomes much lower.

The implementations in these examples use a 256-bit internal vector function, which serves to speed up the operation by almost four times compared to the sequential version. We demonstrate three different approaches to implementing the Kahan summation kernel in the example listings below. The full implementation is located at https://github.com/lanl/GlobalSums.git, which is extracted from the global sums example. It is included in the GlobalSumsVectorized directory in the source code for this chapter.

35 of 69

Vectorization methods

EXAMPLE: IMPLEMENTING THE CAHAN SUMMATION USING INTRINSIC VECTOR FUNCTIONS

The first example of compiler intrinsics, developed by Andy Dubois and Brett Neumann of Los Alamos National Laboratory, uses Intel x86 intrinsic vector functions. This is the most commonly used set of intrinsics and can be executed on both Intel and AMD processors that support AVX vector instructions.

36 of 69

Vectorization methods

37 of 69

Vectorization methods

38 of 69

Vectorization methods

EXAMPLE: IMPLEMENTING KAHANE SUM USING GCC'S INTERNAL VECTOR FUNCTIONS

The second implementation of Kahane summation uses GCC vector extensions. These vector instructions are capable of supporting other architectures, in addition to the AVX vector units on x86 architectures. However, the portability of GCC's vector extensions is limited to the locations where the GCC compiler can be used. If a longer vector length is specified than the hardware supports, the compiler generates instructions using combinations of shorter vector lengths.

39 of 69

Vectorization methods

40 of 69

Vectorization methods

EXAMPLE: IMPLEMENTING THE KAHANAU SUMMATION USING INTERNAL C++ VECTOR FUNCTIONS

For C++ code, Agner Fogh of the Technical University of Denmark wrote a C++ vector class library under an open-source license. This class library is portable across all hardware architectures and automatically adapts to older hardware with shorter vector lengths. Fogh's vector class library is well-developed and includes an extensive manual. More detailed information about this library is provided in the further reading section at the end of the chapter.

In this example, we'll write a vector version of the Kahanau summation in C++ and then call it from our main C program. We'll also handle the remaining block of values ​​that don't fit the full vector width using the partial_load function. We won't always have arrays that are evenly divided into groups of width 4. Sometimes we pad arrays with extra zeros to give them the desired size, but a better approach is to handle the remaining values ​​in a separate block of code at the end of the loop. Note that the Vec4d data type is defined in the vector class header.

41 of 69

Vectorization methods

42 of 69

Vectorization methods

The load and store commands in this vector class library are more natural than the syntax of some other internal vector functions.

43 of 69

Vectorization methods

We then tested the Kahane sum implemented using three intrinsic vector functions against the original sequential sum and the original Kahane sum. We used GCC version 8.2 and ran the tests on a Skylake Gold processor. GCC is unable to vectorize the original code for the sequential sum and the original Kahane sum. Adding the OpenMP pragma allows vectorization of the sequential sum, but for the Kahane sum, the dependency carried by its loop prevents the compiler from vectorizing its code.

In the performance results below, it is important to note that the vectorized versions of the sequential sum and Kahane sum with all three intrinsic vector functions (highlighted in bold) have nearly identical execution times. We can perform more floating-point operations in the same amount of time while simultaneously reducing the numerical error. This is a great example of how, with some effort, floating-point operations are free.

44 of 69

Vectorization methods

SETTINGS INFO -- ncells 1073741824 log 30

Initializing mesh with Leblanc problem, high values first relative diff runtime

Description

8.423e-09 1.273343 Serial sum

0 3.519778 Kahan sum with double double accumulator

4 wide vectors serial sum

-3.356e-09 0.683407 Intel vector intrinsics Serial sum

-3.356e-09 0.682952 GCC vector intrinsics Serial sum

-3.356e-09 0.682756 Fog C++ vector class Serial sum

4 wide vectors Kahan sum

0 1.030471 Intel Vector intrinsics Kahan sum

0 1.031490 GCC vector extensions Kahan sum

0 1.032354 Fog C++ vector class Kahan sum

8 wide vector serial sum

-1.986e-09 0.663277 Serial sum (OpenMP SIMD pragma)

-1.986e-09 0.664413 8 wide Intel vector intrinsic Serial sum

-1.986e-09 0.664067 8 wide GCC vector intrinsic Serial sum

-1.986e-09 0.663911 8 wide Fog C++ vector class Serial sum

8 wide vector Kahan sum

-1.388e-16 0.689495 8 wide Intel Vector intrinsics Kahan sum

-1.388e-16 0.689100 8 wide GCC vector extensions Kahan sum

-1.388e-16 0.689472 8 wide Fog C++ vector class Kahan sum

45 of 69

Vectorization methods

NOT FOR THE FAINT-HEARTED: USING ASSEMBLY CODE FOR VECTORIZATION

In this section, we'll explore situations where it makes sense to write vector assembly code in your application. We'll also discuss what vector assembly code looks like, how to disassemble compiled code, and how to determine which vector instruction set the compiler generated.

Programming vector modules directly with vector assembly instructions offers an excellent opportunity to achieve maximum performance. However, this requires a deep understanding of the performance behavior of a large number of vector instructions across many different processors. Programmers without such experience will likely achieve higher performance by using intrinsic vector functions, as shown in the previous section, than by directly writing vector assembly instructions. Furthermore, the portability of vector assembly code is limited; it will only run on a small set of processor architectures. For these reasons, it rarely makes sense to write vector assembly instructions.

EXAMPLE: OVERVIEW OF VECTOR ASSEMBLY INSTRUCTIONS

To get a feel for what vector assembly instructions look like, we can print these instructions from an object file using the objdump command. The listing shows the output of the following command:

objdump -d -M=intel --no-show-raw-insn <object code file.o>

46 of 69

Vectorization methods

If you see ymm registers, vector instructions were generated. zmm registers indicate that AVX512 vector instructions are present. xmm registers are generated for both scalar and SSE vector instructions. We can tell that the list was taken from the kahan_gcc_vector.c.o file because there are no zmm instructions in the output. If we look at the kahan_gcc_vector8.c.o file, which generates 512-bit instructions, we'll see zmm instructions.

Since it rarely makes sense to do more than simply view the assembly instructions generated by the compiler, we won't cover an example of writing an assembly routine from scratch.

47 of 69

Programming style for better vectorization

We suggest adopting a programming style that is more compatible with the needs of vectorization and other forms of loop parallelization. Based on the lessons learned from the examples given in this chapter, certain programming styles help the compiler generate vectorized code. Adopting the following programming styles results in improved performance out of the box and reduced optimization efforts.

General suggestions:

  • Use the restrict attribute for pointers in arguments and function declarations (C and C++);
  • Use pragmas or directives where necessary to inform the compiler;
  • Be careful with compiler-specific optimizations using #pragma unroll and other techniques; You can limit the possible variants of compiler transformations1;
  • Place exceptions and error checking with print statements in a separate loop.

48 of 69

Programming style for better vectorization

Regarding data structures:

  • Try to use a long data structure for the innermost loop;
  • Use the smallest data type necessary (short instead of int);
  • Use contiguous memory accesses. Some newer instruction sets implement gather/scatter memory loading, but they are less efficient;
  • Use a structure of arrays (SoA) instead of an array of structures (AoS);
  • Use memory-aligned data structures whenever possible.

Regarding loop structures:

  • Use simple loops without special exit conditions;
  • Make loop boundaries a local variable, copying global values ​​and then using them;
  • Use a loop index for array addresses whenever possible;
  • Expand the size of loop boundaries so that it is known to the compiler. If a loop lasts only three iterations, the compiler may unroll the loop instead of generating a four-lane-wide vector instruction;
  • Avoid array syntax in performance-critical loops (Fortran).

[1] The #pragma unroll pragma tells the compiler to insert the specified number of instructions to replace loop iterations and to reduce the number of loop iterations in the loop control statement or remove it entirely.

49 of 69

Programming style for better vectorization

In the loop body:

  • Define local variables inside the loop to make it clear that they don't carry over to subsequent iterations (C and C++);
  • Variables and arrays inside the loop should be write-only or read-only (only on the left-hand side or right-hand side of the equal sign, except for reductions);
  • Don't use local variables in the loop for other purposes—create a new variable. The memory you waste is far less important than the confusion it creates for the compiler;
  • Avoid function calls and instead inline function bodies (manually or with the compiler);
  • Limit the use of conditional blocks inside the loop and, when necessary, use simple forms that can be masked.

Regarding compiler settings and flags:

  • Use the latest version of the compiler and prefer compilers that perform better vectorization;
  • Use the strict aliasing compiler flag;
  • Write code for the most powerful vector instruction set you can work with.

50 of 69

Vectorization-related compiler flags for various compilers

Tables 6.2 and 6.3 show the compiler flags recommended for vectorization for the latest versions of various compilers. Compiler flags for vectorization change frequently, so consult the documentation for the compiler version you are using.

The strict aliasing flag listed in the second column of Table 6.2 should help with automatic vectorization for C and C++, but make sure it doesn't break any code. The third column of Table 6.2 lists various options for specifying the set of vectorization instructions to use for certain compilers. Those shown in the table should be a good starting point. Vectorization reports can be generated with the compiler flags in the second column of Table 6.2. Compiler reports are continually being improved for most compilers and are likely to change. For GCC, the opt and missed flags are recommended. Receiving loop optimization reports simultaneously with vectorization can be useful because you can see whether loops have been unrolled or mutually modified. If you don't use the rest of OpenMP, but instead use OpenMP SIMD directives, you should use the flags in the last column of Table 6.3.

51 of 69

Vectorization-related compiler flags for various compilers

Table 6.2 Vectorization flags for various compilers

52 of 69

Vectorization-related compiler flags for various compilers

Table 6.3 OpenMP SIMD vectorization report flags for various compilers

53 of 69

Vectorization-related compiler flags for various compilers

Vector instructions can be specified for any single instruction set, such as AVX2, or for multiple instruction sets. We'll show you how to do both. For a single instruction set, the flags shown in the tables above require the compiler to use the vector instruction set for the processor being compiled (march=native, -xHost, and -qarch=pwer9). Without this flag, the compiler uses the SSE2 instruction set. If you're interested in working with a wide range of processors, you can specify an older instruction set or simply use the default. This will incur some performance penalty compared to older instruction sets.

With the Intel compiler, you can add support for more than one vector instruction set. In practice, this is commonly done for the Intel Knights Landing processor, where the instruction set for the host processor may differ. To do this, you must specify both instruction sets:

-axmic-avx512 -xcore-avx2

-ax adds an additional instruction set. Note that the host keyword cannot be used when requesting two instruction sets. We briefly mentioned the use of a floating-point flag to encourage vectorization in the reduce-to-sum computation kernel when discussing Listing 6.1. When vectorizing loops with a conditional block, the compiler inserts a mask that uses only part of the vector results. However, masked operations can cause floating-point errors through division by zero or taking the square root of a negative number. The GCC and Clang compilers require that the additional floating-point flags shown in the last column of Table 6.2 be set to vectorize loops with conditional blocks and any potentially problematic floating-point operations.

54 of 69

Vectorization-related compiler flags for various compilers

In some situations, you may want to disable vectorization. Disabling vectorization allows you to see the improvements and speedups achieved with vectorization. It allows you to verify that you get the same answer with and without vectorization. Sometimes automatic vectorization gives you the wrong answer, and therefore you might want to disable it. You might also want to vectorize only those files that are computationally expensive and skip the rest.

The vectorization and performance results in this chapter were obtained using GCC v8 and v9, as well as the Intel compiler v19. As noted in Table 6.1, support for 512-bit vectors was added to GCC starting with version 8, and to Intel starting with version 18. Therefore, the capabilities of the new 512-bit vector hardware are quite recent.

Table 6.4 Compiler Flags for Disabling Vectorization

55 of 69

Vectorization-related compiler flags for various compilers

A CMAKE MODULE FOR SETTING COMPILER FLAGS

Debugging all the flags for compiler vectorization is tedious and difficult. Therefore, we've created a CMake module you can use, similar to the FindOpenMP.cmake and FindMPI.cmake modules. Then the main CMakeLists.txt file simply needs

find_package(Vector)

if (CMAKE_VECTOR_VERBOSE)

set(VECTOR_C_FLAGS "${VECTOR_C_FLAGS} ${VECTOR_C_VERBOSE}")

endif()

set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${VECTOR_C_FLAGS}")

The CMake module is shown in FindVector.cmake in the main examples directory for this chapter at https://github.com/EssentialsofParallelComputing/Chapter6.git. Also, see the GlobalSumsVectorized code example using the FindVector.cmake module. We'll also be porting this module to other examples to keep our CMakeLists.txt file clean. The following listing is an excerpt from the C compiler module. Flags for C++ and Fortran are specified by similar code in the FindVector.cmake module.

56 of 69

Vectorization-related compiler flags for various compilers

57 of 69

Vectorization-related compiler flags for various compilers

58 of 69

Vectorization-related compiler flags for various compilers

59 of 69

Vectorization-related compiler flags for various compilers

60 of 69

Vectorization-related compiler flags for various compilers

61 of 69

Vectorization-related compiler flags for various compilers

62 of 69

OpenMP SIMD directives for improved portability

With the release of OpenMP 4.0, we now have the ability to use a more portable set of SIMD directives. These directives are implemented as commands rather than hints. We already saw the use of these directives in Section 6.3.3. These directives can be used solely to request vectorization, or they can be combined with the for/do directive to request both threading and vectorization. The general syntax for C and C++ pragmas is:

#pragma omp simd / Vectorizes the next loop or code block. #pragma omp for simd / Threads and vectorizes the next loop.

The general syntax for Fortran directives is:

!$omp simd / Vectorizes the next loop or code block.

!$omp do simd / Threads and vectorizes the next loop.

63 of 69

OpenMP SIMD directives for improved portability

The basic SIMD directive can be augmented with additional expressions to convey additional information. The most common additional expression is some variation of the private expression. It eliminates spurious dependencies by creating a separate private variable for each vector lane. Here's an example of this syntax:

#pragma omp simd private(x) for (int i=0; i<n; i++){

x=array(i); y=sqrt(x)*x;

}

In the case of a simple private expression, the recommended approach for C and C++ programs is to define the variable in a loop to clarify your intent:

double x=array(i);

64 of 69

OpenMP SIMD directives for improved portability

The firstprivate expression initializes a private variable for each thread to the value entering the loop, while the lastprivate expression sets the variable after the loop to the logical last value it would have in the sequential form of the loop.

The reduction expression creates a private variable for each lane and then performs the specified operation between the values ​​for each lane at the end of the loop. Reduction variables are initialized for each vector lane as appropriate for the specified operation.

The aligned expression informs the compiler that the data is aligned on a 64-byte boundary, eliminating the need to create detached loops. Aligned data can be loaded into vector registers more efficiently. However, the memory must first be allocated with alignment. A number of functions can be used for memory alignment, but portability issues remain. Here are some of the possibilities:

void *memalign(size_t alignment, size_t size);

int posix_memalign(void **memptr, size_t alignment, size_t size);

void *aligned_alloc(size_t alignment, size_t size);

void *aligned_malloc(size_t alignment, size_t size);

65 of 69

OpenMP SIMD directives for improved portability

You can also use memory definition attributes to specify alignment:

double x[100] attribute ((aligned(64)));

Another important modifier is the collapse expression. It tells the compiler to combine nested loops into a single loop for a vectorizable implementation. The argument to this expression specifies the number of loops to collapse:

#pragma omp collapse(2)

for (int j=0; j<n; j++){ for (int i=0; i<n; i++){

x[j][i] = 0.0;

}

}

Loops must be perfectly nested. Perfectly nested loops contain state only in the innermost loop, with no extraneous instructions before or after each loop block.

66 of 69

OpenMP SIMD directives for improved portability

The following expressions are intended for more specialized cases.

  • The linear expression specifies that a variable is modified at each iteration by a linear function.
  • The safelen expression tells the compiler that dependencies are separated by a specified length, allowing the compiler to vectorize to vector lengths less than or equal to the safe length argument.
  • The simdlen expression generates vectorization of a specified length instead of the default length.

OPENMP SIMD FUNCTIONS

We can also vectorize an entire function or subroutine, making it callable from within the vectorized code region. The syntax is slightly different for C/C++ and Fortran. For C/C++, we will use an example in which the radial distance of an array of points is calculated using the Pythagorean theorem:

#pragma omp declare simd

double pythagorean(double a, double b){ return(sqrt(a*a + b*b));

}

67 of 69

OpenMP SIMD directives for improved portability

For Fortran, the subroutine or function name must be specified as an argument to a SIMD expression:

subroutine pythagorean(a, b, c)

!$omp declare simd(pythagorean) real*8 a, b, c

c = sqrt(a**2+b**2)

end subroutine Pythagorean

The OpenMP SIMD function directive can also accept several of the same expressions and some new ones, as shown below.

  • The inbranch or notinbranch expression tells the compiler whether the function is called from within a conditional block or not.
  • The uniform expression indicates that the argument specified in the expression remains constant across all calls and does not need to be configured as a vector in a vectorized call.

68 of 69

OpenMP SIMD directives for improved portability

  • The expression linear(ref, val, uval) tells the compiler that the variable in the expression argument is linear in some form. For example, Fortran passes arguments by reference and when passing subsequent array locations. In the previous Fortran example, the expression would look like this:

!$omp declare simd(pythagorean) linear(ref(a, b, c))

This expression can also be used to indicate that a value is linear and whether the step is a larger constant, as might occur with single-step access.

  • The aligned and simdlen expressions are similar to their use in OpenMP SIMD loop-oriented expressions.

69 of 69

Exercises

  1. Experiment with the automatic vectorization loops from the multi-material source code in Section 4.3 (https://github.com/LANL/MultiMatTest.git). Add the vectorization and loop reporting flags and see what your compiler tells you.
  2. Add OpenMP SIMD pragmas to the loop you selected in the first exercise to help the compiler vectorize loops.
  3. In one of the examples with intrinsic vector functions, change the vector length from four double-precision values ​​to an eight-width vector. Review the source code in this chapter for working code examples of eight-width implementations.
  4. If you are using an older processor, how well does your program from Exercise 3 run? What is the impact of this choice on performance?