1 of 31

Efficient implementation and �code optimization

Yevhen Zadorozhnii

�The summer programming school Uzhhorod - 2020

League 1. Day 4.

2 of 31

Introduction

3 of 31

Why is this topic important?

  • It's dangerous to write a solution that is much slower than it might be.
  • Im some problems you may face with tight time limits, because author tried to cut off larger asymptotic.
  • In rare cases, an efficient implementation allows you to pass a solution with worse asymptotics than the author expected.
  • It's useful to evaluate not only asymptotic of time complexity but constant factor and expected execution time as well.
  • Execution time of the different (or sometimes even the same) algorithms with the same time complexity can differ by tens or even hundreds of times

4 of 31

But...!

  • “Premature optimization is the root of all evil” - Donald Knuth in 1960
  • Your main goal is to solve the task
  • In most of cases you will have a huge time margin with just a correct idea
  • No need to optimize the code which will never be a bottleneck
  • Effective > Efficient

Being effective is about doing the right things, �while being efficient is about doing things right.

5 of 31

The main principles of efficient implementation

  1. The fewer operations the faster
  2. Efficiency of working with memory
  3. Avoiding slow operations, slow data structures and slow approaches

6 of 31

C++ STL features

7 of 31

std::vector

  • Works like an array of dynamic size
  • Uses dynamic allocations (memory is allocated on the heap)
  • It’s inner buffer has some capacity, and if size reaches capacity after push_back(), it reallocates buffer with the size capacity x 2
  • Vector doesn’t deallocate its buffer after pop_back() or clear()
  • You can reserve buffer capacity using method .reserve()

8 of 31

std::vector

  • Using a vector as a large array usually is just �as fast as using a global static array. Accessing its data by pointer is as fast as for any other data pointer.
  • Using a vector for small data, probably of originally known size, like for 2, 5, 16 elements is much slower than simple arrays.
  • Worst idea - to use std::vector<std::vector<int>> for something like matrices 2x2

9 of 31

std::map / std::set

  • Is based on self-balancing binary search tree
  • Uses dynamic allocations
  • Has large constant factor
  • Has significant memory overhead

10 of 31

std::map / std::set

  • Method erase(val) returns number of erased elements
  • Method insert(val) returns pair<iterator,bool>, where bool value is true on success
  • That allows optimize constructions of next kind:
  • There is an insert with hint method �insert(iterator, val), which helps optimize insertions�of several sorted elements

int value = 0;

set<int> s;

if (!s.count(value)) {

s.insert(value);

// do_something_else

}

11 of 31

std::unordered_map / std::unordered_set

  • Is based on hash-table implementation
  • Even with better time complexity gives speedup in rare cases
  • Anti-test may be constructed for O(N^2) time complexity of N insertions
  • Its usage may be optimized with methods max_load_factor(value), rehash(count) and reserve(count)

12 of 31

std::string

  • In old versions of gcc (before c++11 introduced) had a tricky implementation storing small strings on the stack memory and using copy-on-write technique.
  • Currently implementation and behavior are very similar to std::vector
  • Operations s = s + ‘c’ have O(size(s)) time complexity

13 of 31

std::vector<bool>

  • A very strange specialization of std::vector. Kind of bad design example :)
  • Uses memory compression (i.e. 8 elements per byte). Because of that differs a bit in interface and usage.
  • Comparing to std::bitset has dynamic size but doesn’t implement binary operations.

14 of 31

std::bitset

  • Doesn’t use dynamic memory allocations
  • Implements operations of binary arithmetic |, &, ^, <<, >>
  • Is quite efficient and usually not worse than custom implementation
  • We can cast bitset pointer to int* and implement some custom specific logic. (correctness isn’t guaranteed by standard, but usually this works)
  • In competitive programming it combines the logic of dynamic programming, efficient boolean array and set functionality.

15 of 31

std::array

  • Wrapper of C-style arrays T[size] with interface typical for STL containers
  • Doesn’t use dynamic memory allocations
  • Implements comparison operations, so may be used as an element of the set or key of map. Is much more efficient than vector in such use case.
  • array<int, 4> looks much better than pair<int, pair<int, pair<int, int>>>

16 of 31

Fast Input/Output

17 of 31

Fast Input/Output

  • Standard input and output in C/C++ use buffering as optimization.
  • std::endl flushes buffer which leads to TL on output with a lot of lines.
  • Alternation of cin/cout usage forces buffer flush, this behavior may be disabled by cin.tie(0). Don’t use in interactive tasks!
  • printf/scanf and cin/cout are synchronised and may be used in the same program. This synchronisation may be disabled with operation ios_base::sync_with_stdio(0); It gives significant speedup to cin/cout.

18 of 31

Fast Input/Output

  • Custom implementation of input and output may be much faster in the case of competitive programming.
  • Huge speedup may be achieved with getchar() usage.
  • Fastest implementation should have buffering technique instead.

19 of 31

Example of custom input implementation

template<class T = int>

T readInt() {

const int BSIZE = 4096;

static char buffer[BSIZE];

static char* bptr = buffer + BSIZE;

auto getChar = []() {

if (bptr == buffer + BSIZE) {

std::memset(buffer, 0, BSIZE);

cin.read(buffer, BSIZE);

bptr = buffer;

}

return *bptr++;

};

char c = getChar(); while (c && (c < '0' || c > '9') && c != '-') c = getChar();

bool minus = false;

if (c == '-') minus = true, c = getChar();

T res = 0;

while (c >= '0' && c <= '9') { res = res * 10 + c - '0'; c = getChar(); }

return minus ? -res : res;

}

20 of 31

Efficient memory

21 of 31

Memory allocations

  • Why are they slow? First point - let’s try to imagine how memory management task works from the competitive programmer point of view
  • We have a huge buffer of memory, i.e. big array of bytes.
  • Allocation is kind of query “Find and reserve some empty segment of size N”
  • Deallocation is kind of query “First index of some segment is given, find length of segment which you allocated there and free it”
  • It’s already not so simple task, right? And it’s obviously not solvable without some additional memory overhead.
  • Actually everything is even harder, because your program doesn’t own “huge buffer of memory” but requests some memory pages from Operation System, which should guarantee concurrency and safety as well.

22 of 31

Memory access

  • We assume random memory access to work at O(1) time complexity.
  • It’s kind of true, but not completely :)
  • Because of different level of caches, random access works more like with O(sqrt(N)) time complexity, where N - is the size of memory where we can read some data.
  • That means, that if you have some 100MB array, reading some value in random position of it, will be up to 10 times slower than of 1MB array.
  • That’s why random access may be very slow.

23 of 31

Hints on work with memory

  • Because of previous points, vector<bool> and std::bitset may be more efficient, for example, for huge prime sieve, even though they use additional computations to get or set the value.
  • Also that’s why sometimes it’s better to use smaller data types for big arrays of data.
  • Sequential memory access is the most efficient and best for caching
  • Dynamic memory allocations lead to much more random access.

24 of 31

Hints on work with memory

  • Instead of using dynamic memory allocations we can use object pool approach or bump memory allocations.
  • One of the example - dynamic tree structures, like cartesian tree, implicit segment tree, other binary trees or automats. Instead of allocation new nodes using new operator, we can reserve huge static array of elements and take pointers on its elements. Be sure to foresee correct size for such huge array!
  • That is we can even ignore deallocations and memory leaks to achieve best performance.

25 of 31

Code optimization

26 of 31

Introduction

  • Remember, that TL verdict may mean that you have bug in your code. And you should check for bugs before trying to optimize it!
  • Use low level optimization only if you for sure don’t know how to find better solution or improve time complexity of the current one.

27 of 31

Operations cost

  • Not all O(1) operations are of the same efficiency level
  • Plus, minus, assignment are most efficient
  • Multiplication is much more expensive
  • Division and modulo operations are extremely expensive, �but they can be optimized by compiler if you divide by constant.
  • Floating point operations are more expensive than integer
  • Branching (i.e. complex `if` constructions) if quite expensive as well, try to replace conditions by simple combinations of comparison-assignment or by some arithmetic.
    • For example a[i] = a[i] >= x ? a[i] - x : a[i]; This code will not have branching and may be even optimized with vectorized instruction.

28 of 31

Vectorization

  • We can allow to compiler use some extreme optimizations like loops unrolling and vectorized instructions�#pragma GCC optimize("Ofast")�#pragma GCC target("sse,sse2,sse3,ssse3,sse4,avx")xs
  • Processor may process several operations using 1 such instruction, because of using 128- or 256-bit registers for them.
  • For example the next operations may be computed with single arithmetic instructions�c[0] = a[0] + b[0]�c[1] = a[1] + b[1]�c[2] = a[2] + b[2]�c[3] = a[3] + b[3]

29 of 31

Vectorization

  • Sometimes they may slow your program!
  • With clear enough code compiler does auto-vectorization for you.
  • Code with branching almost always will be not vectorized.
  • Integer division and remainder operations are not vectorizable!
  • Use std::memset, std::memcpy, std::memcmp for corresponding operations on some large data, these functions are already well optimized.

30 of 31

Helping the compiler to vectorize your code

  • Remove branching
  • Make code as simple as you can. Try to make your loops have clearly defined number of iterations: for (int i = 0; i < n; ++i)
  • Try to remove redundant indexing in the bottleneck, like instead of writing a[r - l][s + i] in the loop of i, create new pointer variable:�const auto* aptr = a[r - l] + s;
  • Use constants, constexprs and compile time evaluations.

31 of 31

Helping the compiler

  • Extracting the bottleneck code into the function may help compiler optimize it better.
  • It says to compiler that it may use all the registers for optimization and makes it easier to analyze and apply optimization.
  • Just be sure, that this function is not too fast (like small O(1)), so that calling of it will not be a new bottleneck.
  • In the opposite case we probably need to be sure in the opposite - that the function is inlined.