JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 45

Performance Tuning for CPU

Part 3: IPC and friends

Marat Dukhan

2 of 45

Last week material

Scary pictures are over...

3 of 45

This week material

...scary movies come into play

A Victorian era scientist and his assistant take a test run in their Iron Mole drilling machine and end up in a strange underground labyrinth ruled by a species of giant telepathic bird and full of prehistoric monsters and cavemen.

Image from Wikipedia Commons

4 of 45

Assembly 101: Intro

Processor can execute only binary code
The first programs back in 1950s were written directly in machine codes
We still can read, modify, and write machine codes in hex editor, but this soon gets too complicated
Assembly is a slightly humanized version of the the machine code
Assembly code does not map one-to-one to machine code, but it is close

5 of 45

Assembly 101: Instructions

Assembly instructions are architecture-dependent

In this lecture we will only consider x86-64 assembly

But still, there are two dialects of x86-64 assembly: Intel syntax, and AT&T syntax

In this lecture we will use Intel syntax

Few examples:

ADD x, y

x = x + y

DEC x

x = x - 1

Note the common features: instructions write to the first argument, and overwrite its input value

6 of 45

Assembly 101: Registers

You may think of registers as special variables inside the processor core

Registers are the fastest kind of memory inside the processor
But even more important is that computational instructions work with registers

On x86 only one of the instruction arguments can be in memory

There is only a fixed number of registers

Each register has a name in assembly
Register numbers are hard-coded in machine code

7 of 45

Assembly 101: GP Registers

16 general-purpose registers
Each 64-bit wide

But the low 32 bits also have names

Used for integer computations

Some examples

ADD RAX, RBX

RAX = RAX + RBX

ADD RAX, -1
DEC RAX

RAX = RAX - 1

8 of 45

Assembly 101: GP Registers

Some more examples

ADD EAX, EBX

EAX = EAX + EBX

Any instruction which overwrites the low part of GP register implicitly zeros the high part

ADD RAX, EBX

Error: arguments must have the same size

9 of 45

Assembly 101: GP Registers

64-bit general-purpose registers can be used to indirectly access memory
MOV EAX, [RBX]

Load the 32-bit integer from the address in register RBX into register EAX
RAX = *RBX
[RBX] means the address in register RBX

MOV [R12], RDI

Store the 64-bit integer in register RDI to the address in register R12
R12 = *RDI

10 of 45

Assembly 101: GP Registers

Which code is faster?

double f(const double* x, int n) {

double s = 0.0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

double f(const double* x, unsigned int n) {

double s = 0.0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

11 of 45

Assembly 101: GP Registers

Which code is faster?

int f(const int* x, int n) {

int s = 0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int f(const int*, int):

xor eax, eax

test esi, esi

jle .L2

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

sar edx

movsx rdx, edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L3

.L2:

ret

int g(const int* x, unsigned int n) {

int s = 0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int g(const int* x, unsigned int n):

xor eax, eax

test esi, esi

je .L8

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

shr edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L9

.L2:

ret

12 of 45

Assembly 101: GP Registers

Which code is faster?

int f(const int* x, int n) {

int s = 0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int f(const int*, int):

xor eax, eax

test esi, esi

jle .L2

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

sar edx

movsx rdx, edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L3

.L2:

ret

int g(const int* x, unsigned int n) {

int s = 0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int g(const int* x, unsigned int n):

xor eax, eax

test esi, esi

je .L8

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

shr edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L9

.L2:

ret

Sign-extend 32-bit register EDX into 64-bit register RDX

Unsigned extending of 32-bit register EDX into 64-bit register RDX should be here. But the high 32 bits of RDX are already zeroes

13 of 45

Assembly 101: GP Registers

Which code is faster?

int f(const int* x, int n) {

int s = 0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int g(const int* x, unsigned int n) {

int s = 0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

Further thoughts:

Unsigned 32-bit indices are faster than signed

Other options to consider: size_t, ptrdiff_t

In Java only 32-bit signed ints can be used for indexing

This one!

14 of 45

Assembly 101: XMM Registers

16 SSE registers
Each 128-bit wide
Used for floating-point and SIMD computations

__m128i, __m128d, __m128 types map to these registers

Some examples

XORPD XMM0, XMM0

XMM0 = XMM0 ^ XMM0 = 0

ADDPD XMM4, XMM9

XMM4 = _mm_add_pd(XMM4, XMM9)

MOVAPD XMM2, [RCX]

XMM2 = _mm_load_pd(RCX)

15 of 45

Assembly 101: XMM Registers

SIMD floating-point instructions operate on vectors of floating-point numbers in an XMM register

ADDPD XMM3, XMM5

XMM3 = _mm_add_pd(XMM3, XMM5)

16 of 45

Assembly 101: XMM Registers

Scalar floating-point instructions operate only on the low floating-point number in an XMM register

ADDSD XMM3, XMM5

XMM3 = XMM3 + XMM5
XMM3 = _mm_add_sd(XMM3, XMM5)

17 of 45

Assembly 101: XMM Instructions

Note the common pattern:

<OP>PD

Packed (SIMD) double-precision
E.g. MULPD, SUBPD, ANDPD, DIVPD

<OP>SD

Scalar double-precision
E.g. MULSD, SUBSD, DIVSD, SQRTSD

<OP>PS

Packed (SIMD) single-precision
E.g. MULPS, SUBPS, ANDPS, DIVPS

<OP>SS

Scalar single-precision
E.g. MULPS, SUBPS, ANDPS, DIVPS

How about intrinsics? _mm_mul_pd (!)

18 of 45

Assembly 101: XMM Instructions

void vector_max(const double*, double*, size_t):

cmp rdx, 1

push rbx

movapd xmm0, XMMWORD PTR .LC0[rip]

mov rbx, rsi

jbe .L2

mov rcx, rdx

mov rax, rdi

.L3:

movupd xmm1, XMMWORD PTR [rax]

sub rcx, 2

add rax, 16

cmp rcx, 1

maxpd xmm0, xmm1

ja .L3

lea rax, [rdx-2]

and edx, 1

shr rax

add rax, 1

sal rax, 4

add rdi, rax

.L2:

movapd xmm1, xmm0

test rdx, rdx

unpckhpd xmm1, xmm0

maxsd xmm0, xmm1

je .L4

movsd xmm1, QWORD PTR [rdi]

call fmax

.L4:

movsd QWORD PTR [rbx], xmm0

pop rbx

ret

Is this code vectorized?

19 of 45

Assembly 101: XMM Instructions

void vector_max(const double*, double*, size_t):

cmp rdx, 1

push rbx

movapd xmm0, XMMWORD PTR .LC0[rip]

mov rbx, rsi

jbe .L2

mov rcx, rdx

mov rax, rdi

.L3:

movupd xmm1, XMMWORD PTR [rax]

sub rcx, 2

add rax, 16

cmp rcx, 1

maxpd xmm0, xmm1

ja .L3

lea rax, [rdx-2]

and edx, 1

shr rax

add rax, 1

sal rax, 4

add rdi, rax

.L2:

movapd xmm1, xmm0

test rdx, rdx

unpckhpd xmm1, xmm0

maxsd xmm0, xmm1

je .L4

movsd xmm1, QWORD PTR [rdi]

call fmax

.L4:

movsd QWORD PTR [rbx], xmm0

pop rbx

ret

Is this code vectorized?

Aha!

MAXPD

20 of 45

Assembly 101: RFLAGS Register

Special 64-bit register

Keeps the result of compare operations

CMP EAX, 10

Subtract 10 from EAX
Set the flags according to result

Set zero flag if result is 0, i.e. EAX == 10

Discard the result

And lots of other stuff we don't care about

Arithmetic operations also change flags

SUB RCX, 10

Has the same effect on flags as the CMP

21 of 45

Assembly 101: RFLAGS Register

Some more examples

AND EAX, 0x8000000

0x8000000 = 1 << 31
Bitwise AND EAX with 0x8000000
Set the flags according to result

Set zero flag if result is 0, i.e. bit 31 of EAX is 0

And lots of other stuff we don't care about

TEST RDX, RDX

Same effect on flags as AND RDX, RDX

Sets zero flag if (RDX & RDX) == 0

That is, if RDX == 0

Unlike AND, the result is discarded

22 of 45

Assembly 101: Conditional clauses

x86 provides 3 types of conditional instructions

Result of operation depends on RFLAGS

E.g. do something if Zero flag is (not) set

Set instruction

Set register to 1 if condition is true, to 0 otherwise

Conditional moves

CMOVZ RDX, R10

If Zero flag is set copy the value in R10 to RDX

Conditional jumps

JNZ Label

If Zero flag is not set continue execution at Label

23 of 45

Assembly 101: Conditional clauses

Asm code:

.Label:

Some asm instructions

...

TEST RAX, RAX

JNZ .Label

Equivalent C code:

do {

Some C statements

...

} while (RAX != 0)

24 of 45

Assembly 101: Stack

Stack as abstract data structure

LIFO: Last In - First Out
We can only work with the top of the stack

PUSH operation adds new element to the top of the stack
POP operation removes the top element

Most processor architectures have hardware supported stack for every thread

25 of 45

Assembly 101: Machine Stack

Machine stack components

Chunk of memory

Stack grows downwards

64-bit general purpose register RSP which hold the address of the stack top

SP for stack pointer

PUSH and POP instructions

PUSH RDI

SUB RSP, 8
MOV [RSP], RDI

POP RAX

MOV RAX, [RSP]
ADD RSP, 8

26 of 45

Assembly 101: Functions

CALL Label instruction

Pushes the address of the next instruction on stack
Executes instructions at Label

RET instruction

Pops the top element from

the machine stack

Continues execution of

instructions at the address

in the popped element

Stack may also contain:

Parameters a for the function
Local variables of the function

27 of 45

Assembly 101: Functions

Function call convention (Linux, Mac OS)

Integer parameters for a function a passed in RDI, RSI, RDX, RCX, R8, R9

The first parameter in RDI
The second in RSI
If more than 6 parameters, some of them are passed in machine stack

Function returns a value in a register

In RAX if the return value is integer
In XMM0 if the return value is floating-point

28 of 45

Assembly 101: Example

double vector_sum(

const double* data,

size_t length)

{

double sum = 0.0;

for (; length != 0; length -= 1) {

// Load element and add to sum

sum += *data;

// Advance array pointer

data += 1;

}

return sum;

}

29 of 45

Assembly 101: Example

double vector_sum(

const double* data,

size_t length)

{

double sum = 0.0;

for (; length != 0; length -= 1) {

// Load element and add to sum

sum += *data;

// Advance array pointer

data += 1;

}

return sum;

}

double vector_sum(

const double* data,

size_t length):

TEST RSI, RSI

XORPD XMM0, XMM0

JE .L2

.L3:

ADDSD XMM0, [RDI]

ADD RDI, 8

SUB RSI, 1

JNE .L3

.L2:

REP RET

30 of 45

Assembly 101

Questions?

31 of 45

Microarchitecture 101

What is inside the "Intel inside"?

Images from intel.com and Wikipedia Commons

32 of 45

Image from adisney.go.com/disneypictures/aliceinwonderland/

33 of 45

Images from adisney.go.com/disneypictures/aliceinwonderland/ and pressroom-se.intel.com

34 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

???

35 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

Pipelined

Instructions are not executed from entirely in a single cycle
Instead execution of an instruction is split into many small stages

Each stage is small enough to run in one cycle

Consider 3 GHz CPU with 15-stages pipeline

If it did not start to execute the new instruction until the previous one has finished, it would be just a 200 MHz CPU

36 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

Pipelined

Example: ARM Cortex-A15 pipeline

Image from arm.com

37 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

Superscalar

Can run more than one instruction per cycle (IPC)

Intel Pentium, Intel Atom: 2 IPC
Intel Pentium 4, Intel Pentium M: 3 IPC
Intel Core 2 and later: 5 IPC
AMD K8, K10: 3 IPC
AMD Bulldozer: 4 IPC
ARM Cortex-A7, Cortex-A8, Cortex-A9: 2 IPC
ARM Cortex-A15, Qualcomm Krait: 3 IPC
IBM Power 7: 6 IPC
Intel Itanium (Poulson core): 12 IPC (!)

But not out-of-order

38 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

Out-of-order

Can execute the instructions in a different order than they appear in the program

But the program must still operate as if they were executed in order

All desktop processors and some mobile processors are capable of out-of-order execution

Exceptions: Intel Atom, ARM Cortex-A5/A7/A8
Intel Nehalem, the CPU inside Jinx, is capable of out-of-order execution

39 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

Speculative

If CPU is unsure about something, it makes a guess, and uses this guess in executing instructions

If the guess happens to be wrong, the CPU rolls back all actions which depended on it

Branch prediction

The processor tries to predict whether the conditional jump will transfer execution to a label

Load address prediction

If the address used to load data is not yet computed, the processor tries to guess it

40 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

Speculative

If CPU is unsure about something, it makes a guess, and uses this guess in executing instructions

If the guess happens to be wrong, the CPU rolls back all actions which depended on it
Rollback is expensive

In term of performance: need to recompute all results which depended on the wrong assumption
In term of power: energy was used on computing results which are now just trashed

How can we help the CPU to make right guesses

Yes! Do what the CPU expects you to do ☺

41 of 45

Microarchitecture 101: OoO

How to execute instructions out of order?

The general approach includes three phases:

Decode instructions as they appear in the program

Detect dependencies between instructions

Attempt to remove false dependencies

Once all the inputs of an instruction are ready, we can compute its result

And store it in a special buffer
And we can compute results in any order (!)

In the order the instructions appear in the program take their computed results from the special buffers and update the permanent state of the processor

42 of 45

Microarchitecture 101: OoO

But the reality is not that simple...

Images from the article www.realworldtech.com/nehalem/ which I advice you to read

43 of 45

Microarchitecture 101: OoO

But the reality is not that simple...

What if...

...an instruction needs a recently computed value which is not yet written to the permanent state?

No problem, it can get it!

...a memory load want a value which was recently kind of stored to memory, but was not written to the permanent state and still resides in some buffer?

No problem, it can get it!

...instructions from different loop iterations have their operands ready

They can issue in parallel on the same cycle

44 of 45

Microarchitecture 101: OoO

But the reality is not that simple...

What if...

...we do a function call?

No problem, the CPU will predict in advance that the execution will at the function location!

...we do a function call via a pointer?

No problem, if the pointer does not change it will have the same performance as usual function call!

...a memory load results in a cache miss?

No problem, subsequent independent instructions can continue execution while the cache line is being loaded!

45 of 45

Microarchitecture 101: Summary

"But I don't want to go among mad people," Alice remarked.
"Oh, you can't help that," said the Cat: "We're all mad here. I'm mad. You're mad."
"How do you know I'm mad?" said Alice.
"You must be," said the Cat, "otherwise you wouldn't have come here."

Image from adisney.go.com/disneypictures/aliceinwonderland/