Performance Tuning for CPU
Part 3: IPC and friends
Marat Dukhan
Last week material
Scary pictures are over...
This week material
...scary movies come into play
A Victorian era scientist and his assistant take a test run in their Iron Mole drilling machine and end up in a strange underground labyrinth ruled by a species of giant telepathic bird and full of prehistoric monsters and cavemen.
Image from Wikipedia Commons
Assembly 101: Intro
Assembly 101: Instructions
Few examples:
Assembly 101: Registers
You may think of registers as special variables inside the processor core
Assembly 101: GP Registers
Some examples
Assembly 101: GP Registers
Some more examples
Any instruction which overwrites the low part of GP register implicitly zeros the high part
Assembly 101: GP Registers
Assembly 101: GP Registers
Which code is faster?
double f(const double* x, int n) {
double s = 0.0;
for (int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
double f(const double* x, unsigned int n) {
double s = 0.0;
for (unsigned int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
Assembly 101: GP Registers
Which code is faster?
int f(const int* x, int n) {
int s = 0;
for (int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
int f(const int*, int):
xor eax, eax
test esi, esi
jle .L2
xor edx, edx
.L3:
lea ecx, [rdx+1]
imul edx, ecx
sar edx
movsx rdx, edx
add eax, DWORD PTR [rdi+rdx*4]
cmp ecx, esi
mov edx, ecx
jne .L3
.L2:
ret
int g(const int* x, unsigned int n) {
int s = 0;
for (unsigned int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
int g(const int* x, unsigned int n):
xor eax, eax
test esi, esi
je .L8
xor edx, edx
.L3:
lea ecx, [rdx+1]
imul edx, ecx
shr edx
add eax, DWORD PTR [rdi+rdx*4]
cmp ecx, esi
mov edx, ecx
jne .L9
.L2:
ret
Assembly 101: GP Registers
Which code is faster?
int f(const int* x, int n) {
int s = 0;
for (int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
int f(const int*, int):
xor eax, eax
test esi, esi
jle .L2
xor edx, edx
.L3:
lea ecx, [rdx+1]
imul edx, ecx
sar edx
movsx rdx, edx
add eax, DWORD PTR [rdi+rdx*4]
cmp ecx, esi
mov edx, ecx
jne .L3
.L2:
ret
int g(const int* x, unsigned int n) {
int s = 0;
for (unsigned int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
int g(const int* x, unsigned int n):
xor eax, eax
test esi, esi
je .L8
xor edx, edx
.L3:
lea ecx, [rdx+1]
imul edx, ecx
shr edx
add eax, DWORD PTR [rdi+rdx*4]
cmp ecx, esi
mov edx, ecx
jne .L9
.L2:
ret
Sign-extend 32-bit register EDX into 64-bit register RDX
Unsigned extending of 32-bit register EDX into 64-bit register RDX should be here. But the high 32 bits of RDX are already zeroes
Assembly 101: GP Registers
Which code is faster?
int f(const int* x, int n) {
int s = 0;
for (int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
int g(const int* x, unsigned int n) {
int s = 0;
for (unsigned int i = 0; i < n; i++) {
s += x[i * (i + 1) / 2];
}
return s;
}
Further thoughts:
This one!
Assembly 101: XMM Registers
Some examples
Assembly 101: XMM Registers
SIMD floating-point instructions operate on vectors of floating-point numbers in an XMM register
Assembly 101: XMM Registers
Scalar floating-point instructions operate only on the low floating-point number in an XMM register
Assembly 101: XMM Instructions
Note the common pattern:
Assembly 101: XMM Instructions
void vector_max(const double*, double*, size_t):
cmp rdx, 1
push rbx
movapd xmm0, XMMWORD PTR .LC0[rip]
mov rbx, rsi
jbe .L2
mov rcx, rdx
mov rax, rdi
.L3:
movupd xmm1, XMMWORD PTR [rax]
sub rcx, 2
add rax, 16
cmp rcx, 1
maxpd xmm0, xmm1
ja .L3
lea rax, [rdx-2]
and edx, 1
shr rax
add rax, 1
sal rax, 4
add rdi, rax
.L2:
movapd xmm1, xmm0
test rdx, rdx
unpckhpd xmm1, xmm0
maxsd xmm0, xmm1
je .L4
movsd xmm1, QWORD PTR [rdi]
call fmax
.L4:
movsd QWORD PTR [rbx], xmm0
pop rbx
ret
Is this code vectorized?
Assembly 101: XMM Instructions
void vector_max(const double*, double*, size_t):
cmp rdx, 1
push rbx
movapd xmm0, XMMWORD PTR .LC0[rip]
mov rbx, rsi
jbe .L2
mov rcx, rdx
mov rax, rdi
.L3:
movupd xmm1, XMMWORD PTR [rax]
sub rcx, 2
add rax, 16
cmp rcx, 1
maxpd xmm0, xmm1
ja .L3
lea rax, [rdx-2]
and edx, 1
shr rax
add rax, 1
sal rax, 4
add rdi, rax
.L2:
movapd xmm1, xmm0
test rdx, rdx
unpckhpd xmm1, xmm0
maxsd xmm0, xmm1
je .L4
movsd xmm1, QWORD PTR [rdi]
call fmax
.L4:
movsd QWORD PTR [rbx], xmm0
pop rbx
ret
Is this code vectorized?
Aha!
MAXPD
Assembly 101: RFLAGS Register
Special 64-bit register
Assembly 101: RFLAGS Register
Some more examples
Assembly 101: Conditional clauses
x86 provides 3 types of conditional instructions
Assembly 101: Conditional clauses
Asm code:
.Label:
Some asm instructions
...
TEST RAX, RAX
JNZ .Label
Equivalent C code:
do {
Some C statements
...
} while (RAX != 0)
Assembly 101: Stack
Stack as abstract data structure
Assembly 101: Machine Stack
Machine stack components
Assembly 101: Functions
the machine stack
instructions at the address
in the popped element
Assembly 101: Functions
Function call convention (Linux, Mac OS)
Assembly 101: Example
double vector_sum(
const double* data,
size_t length)
{
double sum = 0.0;
for (; length != 0; length -= 1) {
// Load element and add to sum
sum += *data;
// Advance array pointer
data += 1;
}
return sum;
}
Assembly 101: Example
double vector_sum(
const double* data,
size_t length)
{
double sum = 0.0;
for (; length != 0; length -= 1) {
// Load element and add to sum
sum += *data;
// Advance array pointer
data += 1;
}
return sum;
}
double vector_sum(
const double* data,
size_t length):
TEST RSI, RSI
XORPD XMM0, XMM0
JE .L2
.L3:
ADDSD XMM0, [RDI]
ADD RDI, 8
SUB RSI, 1
JNE .L3
.L2:
REP RET
Assembly 101
Questions?
Microarchitecture 101
What is inside the "Intel inside"?
Images from intel.com and Wikipedia Commons
Image from adisney.go.com/disneypictures/aliceinwonderland/
Images from adisney.go.com/disneypictures/aliceinwonderland/ and pressroom-se.intel.com
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
???
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
Image from arm.com
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
Microarchitecture 101: Intro
Most modern processors are built on pipelined superscalar speculative out-of-order designs
Microarchitecture 101: OoO
How to execute instructions out of order?
The general approach includes three phases:
Microarchitecture 101: OoO
But the reality is not that simple...
Images from the article www.realworldtech.com/nehalem/ which I advice you to read
Microarchitecture 101: OoO
But the reality is not that simple...
What if...
Microarchitecture 101: OoO
But the reality is not that simple...
What if...
Microarchitecture 101: Summary
Image from adisney.go.com/disneypictures/aliceinwonderland/