Writing Performant C++�Code
The road to slow code is paved with wrong assumptions
About Me
Larry Bank
I optimize other people's code for a living
Web: https://www.bitbanksoftware.com
Email: larry@bitbanksoftware.com�Twitter: @fast_code_r_us�Github: bitbank2�Blog: https://bitbanksoftware.blogspot.com
Have empathy for your C++ compiler
This might sound silly on the surface, but it just means to try to see your code from the point of view of the compiler.
Know the cost of your choices
One line of code is not equivalent to another; choose wisely.
A = B + C; is very different from memcpy(pA, pB, iLen);
The first can usually turn into 1 instruction while the second can take any number of cycles to complete. Modern compilers are smart enough to fix silly mistakes like this:
memcpy(dest, src, 4);
What does the compiler know?
We have to balance a lot of information to accomplish any task. This collection of info (and assumptions) doesn't always translate well into code. Constants versus variables and volatility are some of the most important info that can get 'lost in translation'. An example:��int iLen;�const int LEN=4;
memcpy(pDst, pSrc, iLen); vs memcpy(pDst, pSrc, LEN);��In the first example, if iLen == 4, the code will call memcpy() and spend hundreds of cycles copying 4 bytes while in the second, the compiler can probably turn it into a single instruction.
The slowness of OOP
Accessing member variables through an object pointer will generate slow code because the compiler can't know how volatile the variables actually are. It will write new values to memory as soon as they change.
The auto-vectorizer won't even touch it. The best the compiler can do is unroll the loop because it's compelled to write the new value to the member variable as it changes.
If you're going to make multiple changes to a member variable, either reserve a local variable to do the work and write it into the structure once you've finished or mark the pointer to the structure/array/variable as __restrict to tell the compiler that the data will only be modified by your function.
For this case, the auto-vectorizer can do a decent job and is free to wait until the end of the loop before it has to write the updated value into the structure.
Know your target machine
Code which is aware of your target machine's capabilities can make a huge difference in performance. Some of the features to think about:
Not all memory is created equal
Is your code running on a PC or embedded microcontroller? This can make a huge difference in how you design your project.
Portable and Target aware?
Let's talk math...
System calls
Slow Fast�int my_array[MY_SIZE]; int my_array[MY_SIZE];�for(i=0; i<MY_SIZE; i++) { fread(my_array, sizeof(int),� fread(&my_array[i], sizeof(int), MY_SIZE, h);� 1, h);�}
Inlined Functions
Inlining functions tells the compiler to replicate the code for that function at the point it’s used instead of using a call+return to a single copy of the code. The code size increases, but the overhead of pushing+popping parameters and scratch registers from the stack can be worth the extra size if it’s used very frequently. Declaring a function static will usually let the compiler decide if it should be inlined too.
Non-Temporal Writes
When working with a multi-core CPU, there is the issue of cache coherency - if the cache contents and DRAM contents remain consistent across multiple CPUs (their individual cache’s copy of the data stored in DRAM). Intel and Arm created an optional type of data write called NT (non-temporal). NT writes skip cache coherence, but in exchange, they can complete faster. You can make use of such writes when you know that the data won’t be read immediately and future reads of that data will be read from DRAM (not cache).
Example use: converting a large image buffer to a different pixel format (color to grayscale).�Intel intrinsic: _mm_stream_si128
Registers - part 1
Registers - part 2
How to improve performance and reduce register pressure?
A Brief word on SIMD / Vector code
Why should I care?
Auto-Vectorization? Addirittura?
It tries to turn your C++ code into SIMD instructions
This is one of those topics that really irks me because people attribute magical powers to auto-vectorizers, but they're usually very limited in what they can accomplish without understanding their limitations. Here are 2 examples:
Given a simple loop with no info about the value for iLen, Clang generates scalar code and tests for doing the work with SIMD on groups of 8 values at a time. This example was created with Compiler Explorer. You can try it with the following URL:
This conservative SIMD code improves the performance a bit over scalar code. It works with 2 x 32-bit words at a time and interleaves enough for dual-issue execution.
In this version, we give Clang enough info to know that we're working with a large data set and it can unleash a wider vector loop on the data. Various values of VECTOR_GROUPING work; I used 32 for this example. It now uses the 128-bit registers efficiently and reduces stalls by enabling quad-issue operations.
Auto-Vectorizer Limitations
The compiler can deal with certain conditional statements within the main loop, but usually gives up when individual elements of a SIMD register are treated differently. Hand-written SIMD code can do an efficient job of this by using compare+mask operations, but Clang is only able to optimize it by unrolling a scalar loop.
SIMD / Vector Challenges
Rules of the Road
Programmers who write efficient code…