1 of 63

Pitfalls of Object Oriented

Programming - Revisited

Tony Albrecht

Riot Games

@TonyAlbrecht

2 of 63

Pitfalls of Object Oriented Programming - 2009

  • Investigated the performance of a simple OO scenetree.
  • Ran on PlayStation 3.
  • Used Sony tools for profiling.
  • PS3 had a limited CPU.�

3 of 63

Pitfalls 2009

Start: 19.2ms

Data reorg: 12.9ms

Linear traversal: 4.8ms

Prefetching: 3.3ms

4 of 63

8 years later...

  • Do we still need to care about data as much?
  • What about branching?
  • Prefetching?
  • Virtuals?
  • Can’t the compiler optimise it?

5 of 63

The compiler will not tidy your room!

The compiler will not tidy your room!

6 of 63

“The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”

-Henry Petroski

7 of 63

Random Memory Accesses are slow

Computer architecture: a quantitative approach

By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau

8 of 63

Caches

  • Number of levels, types and speeds depend on your platform:
    • L1 ~ cycles
    • L2 ~ 10s to 100s of cycles
    • L3 ~ 100s to thousands of cycles
  • Fairly dumb - they store data for access until evicted.
  • CPUs will try to predict where the next access will be.

9 of 63

How does the CPU prefetch?

  • Linearly.
    • Uniform stride/direction.
  • Multiple streams can be active at once.
    • But only a limited number of them.

A smart programmer will take advantage of this.

10 of 63

So, memory access is slow?

If what I’m saying is true, we should be able to observe and measure it.

Then, as we change code and data, we can measure the changes in performance.

This is not an ideological argument. This is science.

11 of 63

DO

12 of 63

Performance measurement?

  • Profilers
    • Instrumented
    • Sampling
    • Special

13 of 63

A quick note on units:

  • Never use Frames Per Second to measure performance.
  • FPS is a relative measurement.
  • For example: How much faster is “20fps faster”?
  • That depends...
    • 60fps -> 80fps = 4.16ms improvement per frame
    • 20fps -> 40fps = 25ms improvement per frame

14 of 63

Instrumented profiling

  • Manually mark up sections to profile
    • Record unique ID
    • Start time
    • End time
  • Visualise it

15 of 63

Visualisation

16 of 63

Instrumented Profilers

Pros

  • Fantastic for detecting spikes
  • Provides a visual sense of performance characteristics
  • Top-down view

Cons

  • Intrusive
  • Won’t tell you which lines are slow

Examples:

  • RAD Game Tool’s Telemetry
  • Write your own - visualise with chrome://tracing
  • Use mine (when I release it)

17 of 63

Sampling profilers

  • Rapidly sample the Program Counter and store the stack.
  • Then reassembles the samples by stack.
  • Slow functions will get hit more often - basic probability.
    • Slow lines will be hit more often.
  • Bottom up profiling

Sampling profilers:

  • Intel’s Vtune
  • AMD’s CodeXL
  • Very Sleepy

18 of 63

Specialised Profilers

Extract particular information from a process

  • CPU specific perf counters
    • AMD/Intel profilers
  • CacheSim

19 of 63

When optimising

  • You want a deterministic test case (if possible)
    • Otherwise, be aware of iterative variation
    • Run test case multiple times and compare

  • USE THE COMPILER OPTIONS!!
    • Learn what the different compiler options do.
    • Experiment and Profile!

20 of 63

Measuring performance is not enough

You need to know *why* something is slow.

When you know why, then you can address it.

For that, you must understand your hardware.

(left as an exercise for the reader)�http://www.agner.org/optimize/microarchitecture.pdf

21 of 63

The Test Case

Basically the same code as the 2009 Pitfalls talk, but with more.�55,000 objects instead of 11,000.

Animates, culls and renders a scenetree.

22 of 63

23 of 63

Hardware Used

24 of 63

Here’s a single instrumented frame

25 of 63

Sampling profiler

26 of 63

27 of 63

inline const Matrix4 Matrix4::operator *()

28 of 63

29 of 63

30 of 63

Cache miss!

  • An L3 cache miss is of the order of a few 100 cycles. (200-300?)
  • A hit is around 40 cycles

  • Average instruction takes 1 to 14 cycles (atomics can be 30+cycles)
  • And they can pipeline…
  • An L3 Cache miss is equivalent to potentially 100s of instructions.

31 of 63

Let’s take a step back...

  • Be careful not to get caught up in micro-optimisation.
  • Take the time to understand the big picture.
  • Algorithmic optimisations can provide dramatic performance boosts.
  • For this example, let’s assume that it’s algorithmically perfect
    • It’s not.

32 of 63

What code are we dealing with?

33 of 63

Object Class

34 of 63

Modifiers

  • Hold a vector of Objects
  • And a Matrix4
  • Call Update() to multiply all its Objects by its transform.

35 of 63

Nodes

36 of 63

Back to the Cache miss

  • Why is Matrix4::operator*() the bottleneck?

Where Object is

37 of 63

Memory layout for Nodes

Node size = 200 bytes �Object size = 188 bytes

38 of 63

Modifer::Update()

Iterates through all its objects.

Which are scattered throughout memory.

39 of 63

How do we remove this bottleneck?

  • Do less.
  • Use less memory.
  • Minimise load stalls by making memory access contiguous.
  • Or, use prefetching to tell the CPU where the upcoming data will be.
    • Tricky. Pointer chasing, pre-emptive loads, messy code...

  • Better off working with the HW.

40 of 63

How do we fix it?

Force homogeneous, temporally coherent data to be contiguous

  • Memory Pool Managers
  • Overload new

“Don’t be clever, be clear”

41 of 63

A simple allocator

sizeof(Node) = 44, sizeof(Object) = 32�(was 200 and 188)�

42 of 63

Let’s look at the memory layout now

43 of 63

Now, measure performance...

Now…

Previously…

44 of 63

17.5ms -> 9.5ms

No functional code changes.

45 of 63

Now, measure performance...

Now…

Previously…

46 of 63

Where are the bottlenecks now?

New

Previous

47 of 63

A closer look at Matrix4 multiply

Where is my SIMD?

48 of 63

Recompile and profile with SIMD

9.5ms -> 6.2ms

49 of 63

Sampling profile

  • Matrix multiply has disappeared!
    • It’s now small enough to be inlined.

50 of 63

Modifier::Update()

51 of 63

Virtual function overhead

  • This was a big issue on PS3.
  • Let’s look at SetVisibilityRecursively()

52 of 63

53 of 63

De-inheriting everything

  • Decoupled Node from Object
  • Changed code to addNode and addObject etc
  • Nodes looped over Objects and Nodes separately
  • No virtuals!
  • How fast?

6.2ms -> 7.6ms

54 of 63

Ah, wat?

  • Suspect better branch prediction.
  • From asm - not much worse than function call overhead.
  • Extra code for looping over nodes and objects broke cache coherence.
  • Worthy of further inspection.

Assume nothing, test everything

55 of 63

56 of 63

Prefetching?

  • Prefetching is complicated
  • Hard to get significant perf improvements
  • The HW does a pretty good job if you keep your access patterns simple

57 of 63

Summary

  • 17.5ms -> 6.2ms
  • No functional changes
    • Memory layout (9.5ms)
    • SIMD use
  • Could go even faster
    • 2009 talk reduced everything to flat arrays
    • But, at the cost of flexibility and readability.

58 of 63

Optimisation Process

  1. Understand your problem.
  2. Is there a better algorithm?
  3. Can you call it less (or in a different thread)?
  4. Understand your data access patterns.
    • Optimise for temporal coherence.
    • Side effect: Easier to parallelise!
  5. Then, instruction level optimisation.�

59 of 63

Obfuscation by Optimisation

When optimising, aim for simplicity.

Simple code is easy to understand, easy to maintain.

Weigh up costs of complex, highly optimised code - it can be brittle and costly to maintain. Will often be throw away, but can be necessary.

60 of 63

So, is OO bad?

  • Encapsulation by
    • logic/function � vs
    • data

  • OO used with foresight
    • Fast
    • Simple
    • Maintainable
  • OO used without care
    • Slow
    • Complex
    • Unmaintainable
    • Unoptimisable.

61 of 63

The Language is not your platform

You are not building something to run in C++

You are building something to run on some hardware.

Your language is an abstraction of the HW.

If you need it to run fast, build with the HW in mind.

62 of 63

END�پایان

63 of 63