1 of 24

Writing Performant C++�Code

The road to slow code is paved with wrong assumptions

2 of 24

About Me

Larry Bank

I optimize other people's code for a living

Web: https://www.bitbanksoftware.com

Email: larry@bitbanksoftware.com�Twitter: @fast_code_r_us�Github: bitbank2�Blog: https://bitbanksoftware.blogspot.com

3 of 24

Have empathy for your C++ compiler

This might sound silly on the surface, but it just means to try to see your code from the point of view of the compiler.

Your code tells a story of how you want to process your data
Providing more detail can create a richer story
Understanding how your 'verbs' affect the code might cause you to choose different ones
Provide too little info and the compiler will fill in the blanks with choices you may not appreciate

4 of 24

Know the cost of your choices

One line of code is not equivalent to another; choose wisely.

A = B + C; is very different from memcpy(pA, pB, iLen);

The first can usually turn into 1 instruction while the second can take any number of cycles to complete. Modern compilers are smart enough to fix silly mistakes like this:

memcpy(dest, src, 4);

5 of 24

What does the compiler know?

We have to balance a lot of information to accomplish any task. This collection of info (and assumptions) doesn't always translate well into code. Constants versus variables and volatility are some of the most important info that can get 'lost in translation'. An example:��int iLen;�const int LEN=4;

memcpy(pDst, pSrc, iLen); vs memcpy(pDst, pSrc, LEN);��In the first example, if iLen == 4, the code will call memcpy() and spend hundreds of cycles copying 4 bytes while in the second, the compiler can probably turn it into a single instruction.

6 of 24

The slowness of OOP

The compiler treats writes to member variables as if they're volatile�
Methods that modify member variables repeatedly will waste many cycles�
Worst case scenarios (read/modify/write) will occur frequently�
Default behavior is inefficient if your fn is the only one changing the data�
Use the __restrict modifier on pointers to data only your fn will modify

7 of 24

Accessing member variables through an object pointer will generate slow code because the compiler can't know how volatile the variables actually are. It will write new values to memory as soon as they change.

The auto-vectorizer won't even touch it. The best the compiler can do is unroll the loop because it's compelled to write the new value to the member variable as it changes.

8 of 24

If you're going to make multiple changes to a member variable, either reserve a local variable to do the work and write it into the structure once you've finished or mark the pointer to the structure/array/variable as __restrict to tell the compiler that the data will only be modified by your function.

For this case, the auto-vectorizer can do a decent job and is free to wait until the end of the loop before it has to write the updated value into the structure.

9 of 24

Know your target machine

Code which is aware of your target machine's capabilities can make a huge difference in performance. Some of the features to think about:

Speed disparity between memory and instruction execution�
Memory cache / special regions�
ISA: Floating point / SIMD / hardware divide / special instructions�
Protected mode OS / bare metal

10 of 24

Not all memory is created equal

Is your code running on a PC or embedded microcontroller? This can make a huge difference in how you design your project.

Embedded CPUs can usually access on-chip SRAM and FLASH in 1 clock cycle�
PC CPUs can have > 200x disparity between SDRAM speed and CPU cycles�
malloc/free evicts data from the cache too - very expensive on a PC, no big deal on a MCU�
Place your working data set in the fastest memory/cache you can

11 of 24

Portable and Target aware?

Code portability and utilizing unique features of your target machine don't have to be mutually exclusive.�
Only a small portion of your code contains time-critical functions�
Keep a generic C++ reference version of all functions and #ifdef or use templates to coexist with optimized ones for your target platform(s)�
Use CPU intrinsics + SIMD if you really need the speed

12 of 24

Let's talk math...

Division/Modulus (int and float) is expensive - always�For a circular buffers, avoid modulus with compare+subtract�For hash tables, make the size a power of 2 and use AND�
Carelessly converting between float and int is expensive�
Math functions are even more expensive if you use the default (double-precision) versions instead of the 'f' 32-bit float (e.g. sqrt vs sqrtf)�
Yes, lookup tables are still a useful tool to avoid repeated calculations

13 of 24

System calls

Calls into a protected operating system (e.g. fwrite) have a high overhead�
Do as much work per call as possible (e.g. work with files in large chunks)

Slow Fast�int my_array[MY_SIZE]; int my_array[MY_SIZE];�for(i=0; i<MY_SIZE; i++) { fread(my_array, sizeof(int),� fread(&my_array[i], sizeof(int), MY_SIZE, h);� 1, h);�}

Manage your own buffers and threads (aka memory / thread pools)

14 of 24

Inlined Functions

Inlining functions tells the compiler to replicate the code for that function at the point it’s used instead of using a call+return to a single copy of the code. The code size increases, but the overhead of pushing+popping parameters and scratch registers from the stack can be worth the extra size if it’s used very frequently. Declaring a function static will usually let the compiler decide if it should be inlined too.

15 of 24

Non-Temporal Writes

When working with a multi-core CPU, there is the issue of cache coherency - if the cache contents and DRAM contents remain consistent across multiple CPUs (their individual cache’s copy of the data stored in DRAM). Intel and Arm created an optional type of data write called NT (non-temporal). NT writes skip cache coherence, but in exchange, they can complete faster. You can make use of such writes when you know that the data won’t be read immediately and future reads of that data will be read from DRAM (not cache).

Example use: converting a large image buffer to a different pixel format (color to grayscale).�Intel intrinsic: _mm_stream_si128

16 of 24

Registers - part 1

The CPU has a finite number of registers�
Load/Store machines can only do ‘work’ in registers (e.g. Arm)�
X86 has few usable registers, but can work from memory too�
Memory is much slower than working with data in a register�
Working with many vars/ptrs = increased register ‘pressure’�
Each unique C pointer must be loaded into a register to use

17 of 24

Registers - part 2

How to improve performance and reduce register pressure?

Reduce inner loop variable count
Use pointers to local variables - they’re all offsets from a single pointer (the stack)
Allocated memory or fn parameters will each occupy a register when used
Use a memory pool or C structure to let the compiler access each sub-unit as an offset from a single pointer
Fewer occupied registers = fewer memory accesses = faster code

18 of 24

A Brief word on SIMD / Vector code

SIMD = Single Instruction Multiple Data�
Special CPU instructions that work with wide registers (e.g. 128-512 bits)

Why should I care?

SIMD can accomplish much more work per clock than scalar instructions (e.g. Intel AVX can do 8x as many floating point operations per instruction)�
If you don't use them, you're wasting your computer's time and energy�
You normally have to opt-in (compiler flags) to enable them

19 of 24

Auto-Vectorization? Addirittura?

It tries to turn your C++ code into SIMD instructions

This is one of those topics that really irks me because people attribute magical powers to auto-vectorizers, but they're usually very limited in what they can accomplish without understanding their limitations. Here are 2 examples:

It can work with 'plain' C++ code in a limited way, but works considerably better when you give it some hints�
It usually gives up when there is anything interfering with its starting point of an ideal data loop

20 of 24

Given a simple loop with no info about the value for iLen, Clang generates scalar code and tests for doing the work with SIMD on groups of 8 values at a time. This example was created with Compiler Explorer. You can try it with the following URL:

https://godbolt.org/z/YFq6Mm

This conservative SIMD code improves the performance a bit over scalar code. It works with 2 x 32-bit words at a time and interleaves enough for dual-issue execution.

21 of 24

In this version, we give Clang enough info to know that we're working with a large data set and it can unleash a wider vector loop on the data. Various values of VECTOR_GROUPING work; I used 32 for this example. It now uses the 128-bit registers efficiently and reduces stalls by enabling quad-issue operations.

22 of 24

Auto-Vectorizer Limitations

The compiler can deal with certain conditional statements within the main loop, but usually gives up when individual elements of a SIMD register are treated differently. Hand-written SIMD code can do an efficient job of this by using compare+mask operations, but Clang is only able to optimize it by unrolling a scalar loop.

23 of 24

SIMD / Vector Challenges

Too much software never makes use of these powerful instructions because it takes extra effort to enable or add them to your code
Your algorithm needs to allow for operations to occur in parallel. Imaging/pixels are usually good candidates
Compilers are not very good at creating SIMD automatically. This means you may need to add intrinsics to your code to explicitly tell the compiler your intent.
Each platform has its own unique SIMD instructions. The implementations are usually similar, but custom code must be written to take advantage of special features in each platform.
Your data doesn’t always fit neatly into the SIMD register size, but with some creative thinking you can still take advantage of SIMD instructions

24 of 24

Rules of the Road

Programmers who write efficient code…

Do their work in local/register variables or use __restrict pointers
Avoid dynamic structures in inner loops
Avoid using too many nesting levels
Choose variable types and mathematical precision carefully
Minimize the number of integer and float conversions
Minimize the number of OS calls
Minimize the use of divides and trigonometric functions
Make use of SIMD and target-specific features when possible
Know the nature and statistics of their data
Test boundary conditions only on the boundaries
*** Test every assumption ***