Pitfalls of Object Oriented
Programming - Revisited
Tony Albrecht
Riot Games
@TonyAlbrecht
Pitfalls of Object Oriented Programming - 2009
Pitfalls 2009
Start: 19.2ms
Data reorg: 12.9ms
Linear traversal: 4.8ms
Prefetching: 3.3ms
8 years later...
The compiler will not tidy your room!
The compiler will not tidy your room!
“The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”
-Henry Petroski
Random Memory Accesses are slow
Computer architecture: a quantitative approach
By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
Caches
How does the CPU prefetch?
A smart programmer will take advantage of this.
So, memory access is slow?
If what I’m saying is true, we should be able to observe and measure it.
Then, as we change code and data, we can measure the changes in performance.
This is not an ideological argument. This is science.
DO
Performance measurement?
A quick note on units:
Instrumented profiling
Visualisation
Instrumented Profilers
Pros
Cons
Examples:
Sampling profilers
Sampling profilers:
Specialised Profilers
Extract particular information from a process
When optimising
Measuring performance is not enough
You need to know *why* something is slow.
When you know why, then you can address it.
For that, you must understand your hardware.
(left as an exercise for the reader)�http://www.agner.org/optimize/microarchitecture.pdf
The Test Case
Basically the same code as the 2009 Pitfalls talk, but with more.�55,000 objects instead of 11,000.
Animates, culls and renders a scenetree.
Hardware Used
Here’s a single instrumented frame
Sampling profiler
inline const Matrix4 Matrix4::operator *()
Cache miss!
Let’s take a step back...
What code are we dealing with?
Object Class
Modifiers
Nodes
Back to the Cache miss
Where Object is
Memory layout for Nodes
Node size = 200 bytes �Object size = 188 bytes
Modifer::Update()
Iterates through all its objects.
Which are scattered throughout memory.
How do we remove this bottleneck?
How do we fix it?
Force homogeneous, temporally coherent data to be contiguous
“Don’t be clever, be clear”
A simple allocator
sizeof(Node) = 44, sizeof(Object) = 32�(was 200 and 188)�
Let’s look at the memory layout now
Now, measure performance...
Now…
Previously…
17.5ms -> 9.5ms
No functional code changes.
Now, measure performance...
Now…
Previously…
Where are the bottlenecks now?
New
Previous
A closer look at Matrix4 multiply
Where is my SIMD?
Recompile and profile with SIMD
9.5ms -> 6.2ms
Sampling profile
Modifier::Update()
Virtual function overhead
De-inheriting everything
6.2ms -> 7.6ms
Ah, wat?
“Assume nothing, test everything”
Prefetching?
Summary
Optimisation Process
Obfuscation by Optimisation
When optimising, aim for simplicity.
Simple code is easy to understand, easy to maintain.
Weigh up costs of complex, highly optimised code - it can be brittle and costly to maintain. Will often be throw away, but can be necessary.
So, is OO bad?
The Language is not your platform
You are not building something to run in C++
You are building something to run on some hardware.
Your language is an abstraction of the HW.
If you need it to run fast, build with the HW in mind.
END�پایان