1 of 31

Efficient implementation and �code optimization

Yevhen Zadorozhnii

�The summer programming school Uzhhorod - 2020

League 1. Day 4.

2 of 31

Introduction

3 of 31

Why is this topic important?

It's dangerous to write a solution that is much slower than it might be.
Im some problems you may face with tight time limits, because author tried to cut off larger asymptotic.
In rare cases, an efficient implementation allows you to pass a solution with worse asymptotics than the author expected.
It's useful to evaluate not only asymptotic of time complexity but constant factor and expected execution time as well.
Execution time of the different (or sometimes even the same) algorithms with the same time complexity can differ by tens or even hundreds of times

4 of 31

But...!

“Premature optimization is the root of all evil” - Donald Knuth in 1960
Your main goal is to solve the task
In most of cases you will have a huge time margin with just a correct idea
No need to optimize the code which will never be a bottleneck
Effective > Efficient

Being effective is about doing the right things, �while being efficient is about doing things right.

5 of 31

The main principles of efficient implementation

The fewer operations the faster
Efficiency of working with memory
Avoiding slow operations, slow data structures and slow approaches

6 of 31

C++ STL features

7 of 31

std::vector

Works like an array of dynamic size
Uses dynamic allocations (memory is allocated on the heap)
It’s inner buffer has some capacity, and if size reaches capacity after push_back(), it reallocates buffer with the size capacity x 2
Vector doesn’t deallocate its buffer after pop_back() or clear()
You can reserve buffer capacity using method .reserve()

8 of 31

std::vector

Using a vector as a large array usually is just �as fast as using a global static array. Accessing its data by pointer is as fast as for any other data pointer.
Using a vector for small data, probably of originally known size, like for 2, 5, 16 elements is much slower than simple arrays.
Worst idea - to use std::vector<std::vector<int>> for something like matrices 2x2

9 of 31

std::map / std::set

Is based on self-balancing binary search tree
Uses dynamic allocations
Has large constant factor
Has significant memory overhead

10 of 31

std::map / std::set

Method erase(val) returns number of erased elements
Method insert(val) returns pair<iterator,bool>, where bool value is true on success
That allows optimize constructions of next kind:
There is an insert with hint method �insert(iterator, val), which helps optimize insertions�of several sorted elements

int value = 0;

set<int> s;

if (!s.count(value)) {

s.insert(value);

// do_something_else

}

11 of 31

std::unordered_map / std::unordered_set

Is based on hash-table implementation
Even with better time complexity gives speedup in rare cases
Anti-test may be constructed for O(N^2) time complexity of N insertions
Its usage may be optimized with methods max_load_factor(value), rehash(count) and reserve(count)

12 of 31

std::string

In old versions of gcc (before c++11 introduced) had a tricky implementation storing small strings on the stack memory and using copy-on-write technique.
Currently implementation and behavior are very similar to std::vector
Operations s = s + ‘c’ have O(size(s)) time complexity

13 of 31

std::vector<bool>

A very strange specialization of std::vector. Kind of bad design example :)
Uses memory compression (i.e. 8 elements per byte). Because of that differs a bit in interface and usage.
Comparing to std::bitset has dynamic size but doesn’t implement binary operations.

14 of 31

std::bitset

Doesn’t use dynamic memory allocations
Implements operations of binary arithmetic |, &, ^, <<, >>
Is quite efficient and usually not worse than custom implementation
We can cast bitset pointer to int* and implement some custom specific logic. (correctness isn’t guaranteed by standard, but usually this works)
In competitive programming it combines the logic of dynamic programming, efficient boolean array and set functionality.

15 of 31

std::array

Wrapper of C-style arrays T[size] with interface typical for STL containers
Doesn’t use dynamic memory allocations
Implements comparison operations, so may be used as an element of the set or key of map. Is much more efficient than vector in such use case.
array<int, 4> looks much better than pair<int, pair<int, pair<int, int>>>

16 of 31

Fast Input/Output

17 of 31

Fast Input/Output

Standard input and output in C/C++ use buffering as optimization.
std::endl flushes buffer which leads to TL on output with a lot of lines.
Alternation of cin/cout usage forces buffer flush, this behavior may be disabled by cin.tie(0). Don’t use in interactive tasks!
printf/scanf and cin/cout are synchronised and may be used in the same program. This synchronisation may be disabled with operation ios_base::sync_with_stdio(0); It gives significant speedup to cin/cout.

18 of 31

Fast Input/Output

Custom implementation of input and output may be much faster in the case of competitive programming.
Huge speedup may be achieved with getchar() usage.
Fastest implementation should have buffering technique instead.

19 of 31

Example of custom input implementation

template<class T = int>

T readInt() {

const int BSIZE = 4096;

static char buffer[BSIZE];

static char* bptr = buffer + BSIZE;

auto getChar = []() {

if (bptr == buffer + BSIZE) {

std::memset(buffer, 0, BSIZE);

cin.read(buffer, BSIZE);

bptr = buffer;

}

return *bptr++;

};

char c = getChar(); while (c && (c < '0' || c > '9') && c != '-') c = getChar();

bool minus = false;

if (c == '-') minus = true, c = getChar();

T res = 0;

while (c >= '0' && c <= '9') { res = res * 10 + c - '0'; c = getChar(); }

return minus ? -res : res;

}

20 of 31

Efficient memory

21 of 31

Memory allocations

Why are they slow? First point - let’s try to imagine how memory management task works from the competitive programmer point of view
We have a huge buffer of memory, i.e. big array of bytes.
Allocation is kind of query “Find and reserve some empty segment of size N”
Deallocation is kind of query “First index of some segment is given, find length of segment which you allocated there and free it”
It’s already not so simple task, right? And it’s obviously not solvable without some additional memory overhead.
Actually everything is even harder, because your program doesn’t own “huge buffer of memory” but requests some memory pages from Operation System, which should guarantee concurrency and safety as well.

22 of 31

Memory access

We assume random memory access to work at O(1) time complexity.
It’s kind of true, but not completely :)
Because of different level of caches, random access works more like with O(sqrt(N)) time complexity, where N - is the size of memory where we can read some data.
That means, that if you have some 100MB array, reading some value in random position of it, will be up to 10 times slower than of 1MB array.
That’s why random access may be very slow.

23 of 31

Hints on work with memory

Because of previous points, vector<bool> and std::bitset may be more efficient, for example, for huge prime sieve, even though they use additional computations to get or set the value.
Also that’s why sometimes it’s better to use smaller data types for big arrays of data.
Sequential memory access is the most efficient and best for caching
Dynamic memory allocations lead to much more random access.

24 of 31

Hints on work with memory

Instead of using dynamic memory allocations we can use object pool approach or bump memory allocations.
One of the example - dynamic tree structures, like cartesian tree, implicit segment tree, other binary trees or automats. Instead of allocation new nodes using new operator, we can reserve huge static array of elements and take pointers on its elements. Be sure to foresee correct size for such huge array!
That is we can even ignore deallocations and memory leaks to achieve best performance.

25 of 31

Code optimization

26 of 31

Introduction

Remember, that TL verdict may mean that you have bug in your code. And you should check for bugs before trying to optimize it!
Use low level optimization only if you for sure don’t know how to find better solution or improve time complexity of the current one.

27 of 31

Operations cost

Not all O(1) operations are of the same efficiency level
Plus, minus, assignment are most efficient
Multiplication is much more expensive
Division and modulo operations are extremely expensive, �but they can be optimized by compiler if you divide by constant.
Floating point operations are more expensive than integer
Branching (i.e. complex `if` constructions) if quite expensive as well, try to replace conditions by simple combinations of comparison-assignment or by some arithmetic.

For example a[i] = a[i] >= x ? a[i] - x : a[i]; This code will not have branching and may be even optimized with vectorized instruction.

28 of 31

Vectorization

We can allow to compiler use some extreme optimizations like loops unrolling and vectorized instructions�#pragma GCC optimize("Ofast")�#pragma GCC target("sse,sse2,sse3,ssse3,sse4,avx")xs
Processor may process several operations using 1 such instruction, because of using 128- or 256-bit registers for them.
For example the next operations may be computed with single arithmetic instructions�c[0] = a[0] + b[0]�c[1] = a[1] + b[1]�c[2] = a[2] + b[2]�c[3] = a[3] + b[3]

29 of 31

Vectorization

Sometimes they may slow your program!
With clear enough code compiler does auto-vectorization for you.
Code with branching almost always will be not vectorized.
Integer division and remainder operations are not vectorizable!
Use std::memset, std::memcpy, std::memcmp for corresponding operations on some large data, these functions are already well optimized.

30 of 31

Helping the compiler to vectorize your code

Remove branching
Make code as simple as you can. Try to make your loops have clearly defined number of iterations: for (int i = 0; i < n; ++i)
Try to remove redundant indexing in the bottleneck, like instead of writing a[r - l][s + i] in the loop of i, create new pointer variable:�const auto* aptr = a[r - l] + s;
Use constants, constexprs and compile time evaluations.

31 of 31

Helping the compiler

Extracting the bottleneck code into the function may help compiler optimize it better.
It says to compiler that it may use all the registers for optimization and makes it easier to analyze and apply optimization.
Just be sure, that this function is not too fast (like small O(1)), so that calling of it will not be a new bottleneck.
In the opposite case we probably need to be sure in the opposite - that the function is inlined.