1 of 63

Pitfalls of Object Oriented

Programming - Revisited

Tony Albrecht

Riot Games

@TonyAlbrecht

2 of 63

Pitfalls of Object Oriented Programming - 2009

Investigated the performance of a simple OO scenetree.
Ran on PlayStation 3.
Used Sony tools for profiling.
PS3 had a limited CPU.�

Original: http://overbyte.com.au/misc/Pitfalls2009.pdf

3 of 63

Pitfalls 2009

Start: 19.2ms

Data reorg: 12.9ms

Linear traversal: 4.8ms

Prefetching: 3.3ms

4 of 63

8 years later...

Do we still need to care about data as much?
What about branching?
Prefetching?
Virtuals?
Can’t the compiler optimise it?

5 of 63

The compiler will not tidy your room!

6 of 63

“The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”

-Henry Petroski

7 of 63

Random Memory Accesses are slow

Computer architecture: a quantitative approach

By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau

8 of 63

Caches

Number of levels, types and speeds depend on your platform:

L1 ~ cycles
L2 ~ 10s to 100s of cycles
L3 ~ 100s to thousands of cycles

Fairly dumb - they store data for access until evicted.
CPUs will try to predict where the next access will be.

9 of 63

How does the CPU prefetch?

Linearly.

Uniform stride/direction.

Multiple streams can be active at once.

But only a limited number of them.

A smart programmer will take advantage of this.

10 of 63

So, memory access is slow?

If what I’m saying is true, we should be able to observe and measure it.

Then, as we change code and data, we can measure the changes in performance.

This is not an ideological argument. This is science.

11 of 63

DO

12 of 63

Performance measurement?

Profilers

Instrumented
Sampling
Special

13 of 63

A quick note on units:

Never use Frames Per Second to measure performance.
FPS is a relative measurement.
For example: How much faster is “20fps faster”?
That depends...

60fps -> 80fps = 4.16ms improvement per frame
20fps -> 40fps = 25ms improvement per frame

14 of 63

Instrumented profiling

Manually mark up sections to profile

Record unique ID
Start time
End time

Visualise it

15 of 63

Visualisation

16 of 63

Instrumented Profilers

Pros

Fantastic for detecting spikes
Provides a visual sense of performance characteristics
Top-down view

Cons

Intrusive
Won’t tell you which lines are slow

Examples:

RAD Game Tool’s Telemetry
Write your own - visualise with chrome://tracing
Use mine (when I release it)

17 of 63

Sampling profilers

Rapidly sample the Program Counter and store the stack.
Then reassembles the samples by stack.
Slow functions will get hit more often - basic probability.

Slow lines will be hit more often.

Bottom up profiling

Sampling profilers:

Intel’s Vtune
AMD’s CodeXL
Very Sleepy

18 of 63

Specialised Profilers

Extract particular information from a process

CPU specific perf counters

AMD/Intel profilers

CacheSim

https://github.com/InsomniacGames/ig-cachesim

19 of 63

When optimising

You want a deterministic test case (if possible)

Otherwise, be aware of iterative variation
Run test case multiple times and compare

USE THE COMPILER OPTIONS!!

Learn what the different compiler options do.
Experiment and Profile!

20 of 63

Measuring performance is not enough

You need to know *why* something is slow.

When you know why, then you can address it.

For that, you must understand your hardware.

(left as an exercise for the reader)�http://www.agner.org/optimize/microarchitecture.pdf

21 of 63

The Test Case

Basically the same code as the 2009 Pitfalls talk, but with more.�55,000 objects instead of 11,000.

Animates, culls and renders a scenetree.

FREE 3rd party libs/applications:

dear imgui: https://github.com/ocornut/imgui
Vectormath from Bullet: http://bulletphysics.org/
Chrome Tracing for perf vis: chrome://tracing
CodeXL:http://gpuopen.com/compute-product/codexl/

22 of 63

23 of 63

Hardware Used

24 of 63

Here’s a single instrumented frame

25 of 63

Sampling profiler

26 of 63

27 of 63

inline const Matrix4 Matrix4::operator *()

28 of 63

29 of 63

30 of 63

Cache miss!

An L3 cache miss is of the order of a few 100 cycles. (200-300?)
A hit is around 40 cycles

Average instruction takes 1 to 14 cycles (atomics can be 30+cycles)
And they can pipeline…
An L3 Cache miss is equivalent to potentially 100s of instructions.

31 of 63

Let’s take a step back...

Be careful not to get caught up in micro-optimisation.
Take the time to understand the big picture.
Algorithmic optimisations can provide dramatic performance boosts.
For this example, let’s assume that it’s algorithmically perfect

It’s not.

32 of 63

What code are we dealing with?

33 of 63

Object Class

34 of 63

Modifiers

Hold a vector of Objects
And a Matrix4
Call Update() to multiply all its Objects by its transform.

35 of 63

Nodes

36 of 63

Back to the Cache miss

Why is Matrix4::operator*() the bottleneck?

Where Object is

37 of 63

Memory layout for Nodes

Node size = 200 bytes �Object size = 188 bytes

38 of 63

Modifer::Update()

Iterates through all its objects.

Which are scattered throughout memory.

39 of 63

How do we remove this bottleneck?

Do less.
Use less memory.
Minimise load stalls by making memory access contiguous.
Or, use prefetching to tell the CPU where the upcoming data will be.

Tricky. Pointer chasing, pre-emptive loads, messy code...

Better off working with the HW.

40 of 63

How do we fix it?

Force homogeneous, temporally coherent data to be contiguous

Memory Pool Managers
Overload new

“Don’t be clever, be clear”

41 of 63

A simple allocator

sizeof(Node) = 44, sizeof(Object) = 32�(was 200 and 188)�

42 of 63

Let’s look at the memory layout now

43 of 63

Now, measure performance...

Now…

Previously…

44 of 63

17.5ms -> 9.5ms

No functional code changes.

45 of 63

Now, measure performance...

Now…

Previously…

46 of 63

Where are the bottlenecks now?

New

A closer look at Matrix4 multiply

Where is my SIMD?

48 of 63

Recompile and profile with SIMD

9.5ms -> 6.2ms

49 of 63

Sampling profile

Matrix multiply has disappeared!

It’s now small enough to be inlined.

50 of 63

Modifier::Update()

51 of 63

Virtual function overhead

This was a big issue on PS3.
Let’s look at SetVisibilityRecursively()

52 of 63

53 of 63

De-inheriting everything

Decoupled Node from Object
Changed code to addNode and addObject etc
Nodes looped over Objects and Nodes separately
No virtuals!
How fast?

6.2ms -> 7.6ms

54 of 63

Ah, wat?

Suspect better branch prediction.
From asm - not much worse than function call overhead.
Extra code for looping over nodes and objects broke cache coherence.
Worthy of further inspection.

“Assume nothing, test everything”

55 of 63

56 of 63

Prefetching?

Prefetching is complicated
Hard to get significant perf improvements
The HW does a pretty good job if you keep your access patterns simple

57 of 63

Summary

17.5ms -> 6.2ms
No functional changes

Memory layout (9.5ms)
SIMD use

Could go even faster

2009 talk reduced everything to flat arrays
But, at the cost of flexibility and readability.

58 of 63

Optimisation Process

Understand your problem.
Is there a better algorithm?
Can you call it less (or in a different thread)?
Understand your data access patterns.

Optimise for temporal coherence.
Side effect: Easier to parallelise!

Then, instruction level optimisation.�

59 of 63

Obfuscation by Optimisation

When optimising, aim for simplicity.

Simple code is easy to understand, easy to maintain.

Weigh up costs of complex, highly optimised code - it can be brittle and costly to maintain. Will often be throw away, but can be necessary.

60 of 63

So, is OO bad?

Encapsulation by

logic/function � vs
data

OO used with foresight

Fast
Simple
Maintainable

OO used without care

Slow
Complex
Unmaintainable
Unoptimisable.

61 of 63

The Language is not your platform

You are not building something to run in C++

You are building something to run on some hardware.

Your language is an abstraction of the HW.

If you need it to run fast, build with the HW in mind.

62 of 63

END�پایان