1 of 45

Performance Tuning for CPU

Part 3: IPC and friends

Marat Dukhan

2 of 45

Last week material

Scary pictures are over...

3 of 45

This week material

...scary movies come into play

A Victorian era scientist and his assistant take a test run in their Iron Mole drilling machine and end up in a strange underground labyrinth ruled by a species of giant telepathic bird and full of prehistoric monsters and cavemen.

Image from Wikipedia Commons

4 of 45

Assembly 101: Intro

  • Processor can execute only binary code
  • The first programs back in 1950s were written directly in machine codes
  • We still can read, modify, and write machine codes in hex editor, but this soon gets too complicated
  • Assembly is a slightly humanized version of the the machine code
  • Assembly code does not map one-to-one to machine code, but it is close

5 of 45

Assembly 101: Instructions

  • Assembly instructions are architecture-dependent
    • In this lecture we will only consider x86-64 assembly
  • But still, there are two dialects of x86-64 assembly: Intel syntax, and AT&T syntax
    • In this lecture we will use Intel syntax

Few examples:

  • ADD x, y
    • x = x + y
  • DEC x
    • x = x - 1
  • Note the common features: instructions write to the first argument, and overwrite its input value

6 of 45

Assembly 101: Registers

You may think of registers as special variables inside the processor core

  • Registers are the fastest kind of memory inside the processor
  • But even more important is that computational instructions work with registers
    • On x86 only one of the instruction arguments can be in memory
  • There is only a fixed number of registers
    • Each register has a name in assembly
    • Register numbers are hard-coded in machine code

7 of 45

Assembly 101: GP Registers

  • 16 general-purpose registers
  • Each 64-bit wide
    • But the low 32 bits also have names
  • Used for integer computations

Some examples

  • ADD RAX, RBX
    • RAX = RAX + RBX
  • ADD RAX, -1
  • DEC RAX
    • RAX = RAX - 1

8 of 45

Assembly 101: GP Registers

Some more examples

  • ADD EAX, EBX
    • EAX = EAX + EBX

Any instruction which overwrites the low part of GP register implicitly zeros the high part

  • ADD RAX, EBX
    • Error: arguments must have the same size

9 of 45

Assembly 101: GP Registers

  • 64-bit general-purpose registers can be used to indirectly access memory
  • MOV EAX, [RBX]
    • Load the 32-bit integer from the address in register RBX into register EAX
    • RAX = *RBX
    • [RBX] means the address in register RBX
  • MOV [R12], RDI
    • Store the 64-bit integer in register RDI to the address in register R12
    • R12 = *RDI

10 of 45

Assembly 101: GP Registers

Which code is faster?

double f(const double* x, int n) {

double s = 0.0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

double f(const double* x, unsigned int n) {

double s = 0.0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

11 of 45

Assembly 101: GP Registers

Which code is faster?

int f(const int* x, int n) {

int s = 0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int f(const int*, int):

xor eax, eax

test esi, esi

jle .L2

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

sar edx

movsx rdx, edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L3

.L2:

ret

int g(const int* x, unsigned int n) {

int s = 0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int g(const int* x, unsigned int n):

xor eax, eax

test esi, esi

je .L8

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

shr edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L9

.L2:

ret

12 of 45

Assembly 101: GP Registers

Which code is faster?

int f(const int* x, int n) {

int s = 0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int f(const int*, int):

xor eax, eax

test esi, esi

jle .L2

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

sar edx

movsx rdx, edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L3

.L2:

ret

int g(const int* x, unsigned int n) {

int s = 0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int g(const int* x, unsigned int n):

xor eax, eax

test esi, esi

je .L8

xor edx, edx

.L3:

lea ecx, [rdx+1]

imul edx, ecx

shr edx

add eax, DWORD PTR [rdi+rdx*4]

cmp ecx, esi

mov edx, ecx

jne .L9

.L2:

ret

Sign-extend 32-bit register EDX into 64-bit register RDX

Unsigned extending of 32-bit register EDX into 64-bit register RDX should be here. But the high 32 bits of RDX are already zeroes

13 of 45

Assembly 101: GP Registers

Which code is faster?

int f(const int* x, int n) {

int s = 0;

for (int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

int g(const int* x, unsigned int n) {

int s = 0;

for (unsigned int i = 0; i < n; i++) {

s += x[i * (i + 1) / 2];

}

return s;

}

Further thoughts:

  • Unsigned 32-bit indices are faster than signed
    • Other options to consider: size_t, ptrdiff_t
  • In Java only 32-bit signed ints can be used for indexing

This one!

14 of 45

Assembly 101: XMM Registers

  • 16 SSE registers
  • Each 128-bit wide
  • Used for floating-point and SIMD computations
    • __m128i, __m128d, __m128 types map to these registers

Some examples

  • XORPD XMM0, XMM0
    • XMM0 = XMM0 ^ XMM0 = 0
  • ADDPD XMM4, XMM9
    • XMM4 = _mm_add_pd(XMM4, XMM9)
  • MOVAPD XMM2, [RCX]
    • XMM2 = _mm_load_pd(RCX)

15 of 45

Assembly 101: XMM Registers

SIMD floating-point instructions operate on vectors of floating-point numbers in an XMM register

  • ADDPD XMM3, XMM5
    • XMM3 = _mm_add_pd(XMM3, XMM5)

16 of 45

Assembly 101: XMM Registers

Scalar floating-point instructions operate only on the low floating-point number in an XMM register

  • ADDSD XMM3, XMM5
    • XMM3 = XMM3 + XMM5
    • XMM3 = _mm_add_sd(XMM3, XMM5)

17 of 45

Assembly 101: XMM Instructions

Note the common pattern:

  • <OP>PD
    • Packed (SIMD) double-precision
    • E.g. MULPD, SUBPD, ANDPD, DIVPD
  • <OP>SD
    • Scalar double-precision
    • E.g. MULSD, SUBSD, DIVSD, SQRTSD
  • <OP>PS
    • Packed (SIMD) single-precision
    • E.g. MULPS, SUBPS, ANDPS, DIVPS
  • <OP>SS
    • Scalar single-precision
    • E.g. MULPS, SUBPS, ANDPS, DIVPS
  • How about intrinsics? _mm_mul_pd (!)

18 of 45

Assembly 101: XMM Instructions

void vector_max(const double*, double*, size_t):

cmp rdx, 1

push rbx

movapd xmm0, XMMWORD PTR .LC0[rip]

mov rbx, rsi

jbe .L2

mov rcx, rdx

mov rax, rdi

.L3:

movupd xmm1, XMMWORD PTR [rax]

sub rcx, 2

add rax, 16

cmp rcx, 1

maxpd xmm0, xmm1

ja .L3

lea rax, [rdx-2]

and edx, 1

shr rax

add rax, 1

sal rax, 4

add rdi, rax

.L2:

movapd xmm1, xmm0

test rdx, rdx

unpckhpd xmm1, xmm0

maxsd xmm0, xmm1

je .L4

movsd xmm1, QWORD PTR [rdi]

call fmax

.L4:

movsd QWORD PTR [rbx], xmm0

pop rbx

ret

Is this code vectorized?

19 of 45

Assembly 101: XMM Instructions

void vector_max(const double*, double*, size_t):

cmp rdx, 1

push rbx

movapd xmm0, XMMWORD PTR .LC0[rip]

mov rbx, rsi

jbe .L2

mov rcx, rdx

mov rax, rdi

.L3:

movupd xmm1, XMMWORD PTR [rax]

sub rcx, 2

add rax, 16

cmp rcx, 1

maxpd xmm0, xmm1

ja .L3

lea rax, [rdx-2]

and edx, 1

shr rax

add rax, 1

sal rax, 4

add rdi, rax

.L2:

movapd xmm1, xmm0

test rdx, rdx

unpckhpd xmm1, xmm0

maxsd xmm0, xmm1

je .L4

movsd xmm1, QWORD PTR [rdi]

call fmax

.L4:

movsd QWORD PTR [rbx], xmm0

pop rbx

ret

Is this code vectorized?

Aha!

MAXPD

20 of 45

Assembly 101: RFLAGS Register

Special 64-bit register

  • Keeps the result of compare operations
    • CMP EAX, 10
      • Subtract 10 from EAX
      • Set the flags according to result
        • Set zero flag if result is 0, i.e. EAX == 10
      • Discard the result
    • And lots of other stuff we don't care about
  • Arithmetic operations also change flags
    • SUB RCX, 10
      • Has the same effect on flags as the CMP

21 of 45

Assembly 101: RFLAGS Register

Some more examples

  • AND EAX, 0x8000000
      • 0x8000000 = 1 << 31
      • Bitwise AND EAX with 0x8000000
      • Set the flags according to result
        • Set zero flag if result is 0, i.e. bit 31 of EAX is 0
    • And lots of other stuff we don't care about
  • TEST RDX, RDX
    • Same effect on flags as AND RDX, RDX
      • Sets zero flag if (RDX & RDX) == 0
        • That is, if RDX == 0
    • Unlike AND, the result is discarded

22 of 45

Assembly 101: Conditional clauses

x86 provides 3 types of conditional instructions

  • Result of operation depends on RFLAGS
    • E.g. do something if Zero flag is (not) set
  • Set instruction
    • Set register to 1 if condition is true, to 0 otherwise
  • Conditional moves
    • CMOVZ RDX, R10
      • If Zero flag is set copy the value in R10 to RDX
  • Conditional jumps
    • JNZ Label
      • If Zero flag is not set continue execution at Label

23 of 45

Assembly 101: Conditional clauses

Asm code:

.Label:

Some asm instructions

...

TEST RAX, RAX

JNZ .Label

Equivalent C code:

do {

Some C statements

...

} while (RAX != 0)

24 of 45

Assembly 101: Stack

Stack as abstract data structure

  • LIFO: Last In - First Out
  • We can only work with the top of the stack
    • PUSH operation adds new element to the top of the stack
    • POP operation removes the top element
  • Most processor architectures have hardware supported stack for every thread

25 of 45

Assembly 101: Machine Stack

Machine stack components

  • Chunk of memory
    • Stack grows downwards
  • 64-bit general purpose register RSP which hold the address of the stack top
    • SP for stack pointer
  • PUSH and POP instructions
    • PUSH RDI
      • SUB RSP, 8
      • MOV [RSP], RDI
    • POP RAX
      • MOV RAX, [RSP]
      • ADD RSP, 8

26 of 45

Assembly 101: Functions

  • CALL Label instruction
    • Pushes the address of the next instruction on stack
    • Executes instructions at Label
  • RET instruction
    • Pops the top element from

the machine stack

    • Continues execution of

instructions at the address

in the popped element

  • Stack may also contain:
    • Parameters a for the function
    • Local variables of the function

27 of 45

Assembly 101: Functions

Function call convention (Linux, Mac OS)

  • Integer parameters for a function a passed in RDI, RSI, RDX, RCX, R8, R9
    • The first parameter in RDI
    • The second in RSI
    • If more than 6 parameters, some of them are passed in machine stack
  • Function returns a value in a register
    • In RAX if the return value is integer
    • In XMM0 if the return value is floating-point

28 of 45

Assembly 101: Example

double vector_sum(

const double* data,

size_t length)

{

double sum = 0.0;

for (; length != 0; length -= 1) {

// Load element and add to sum

sum += *data;

// Advance array pointer

data += 1;

}

return sum;

}

29 of 45

Assembly 101: Example

double vector_sum(

const double* data,

size_t length)

{

double sum = 0.0;

for (; length != 0; length -= 1) {

// Load element and add to sum

sum += *data;

// Advance array pointer

data += 1;

}

return sum;

}

double vector_sum(

const double* data,

size_t length):

TEST RSI, RSI

XORPD XMM0, XMM0

JE .L2

.L3:

ADDSD XMM0, [RDI]

ADD RDI, 8

SUB RSI, 1

JNE .L3

.L2:

REP RET

30 of 45

Assembly 101

Questions?

31 of 45

Microarchitecture 101

What is inside the "Intel inside"?

Images from intel.com and Wikipedia Commons

32 of 45

Image from adisney.go.com/disneypictures/aliceinwonderland/

33 of 45

Images from adisney.go.com/disneypictures/aliceinwonderland/ and pressroom-se.intel.com

34 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

???

35 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

  • Pipelined
    • Instructions are not executed from entirely in a single cycle
    • Instead execution of an instruction is split into many small stages
      • Each stage is small enough to run in one cycle
    • Consider 3 GHz CPU with 15-stages pipeline
      • If it did not start to execute the new instruction until the previous one has finished, it would be just a 200 MHz CPU

36 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

  • Pipelined
    • Example: ARM Cortex-A15 pipeline

Image from arm.com

37 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

  • Superscalar
    • Can run more than one instruction per cycle (IPC)
      • Intel Pentium, Intel Atom: 2 IPC
      • Intel Pentium 4, Intel Pentium M: 3 IPC
      • Intel Core 2 and later: 5 IPC
      • AMD K8, K10: 3 IPC
      • AMD Bulldozer: 4 IPC
      • ARM Cortex-A7, Cortex-A8, Cortex-A9: 2 IPC
      • ARM Cortex-A15, Qualcomm Krait: 3 IPC
      • IBM Power 7: 6 IPC
      • Intel Itanium (Poulson core): 12 IPC (!)
        • But not out-of-order

38 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

  • Out-of-order
    • Can execute the instructions in a different order than they appear in the program
      • But the program must still operate as if they were executed in order
    • All desktop processors and some mobile processors are capable of out-of-order execution
      • Exceptions: Intel Atom, ARM Cortex-A5/A7/A8
      • Intel Nehalem, the CPU inside Jinx, is capable of out-of-order execution

39 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

  • Speculative
    • If CPU is unsure about something, it makes a guess, and uses this guess in executing instructions
      • If the guess happens to be wrong, the CPU rolls back all actions which depended on it
    • Branch prediction
      • The processor tries to predict whether the conditional jump will transfer execution to a label
    • Load address prediction
      • If the address used to load data is not yet computed, the processor tries to guess it

40 of 45

Microarchitecture 101: Intro

Most modern processors are built on pipelined superscalar speculative out-of-order designs

  • Speculative
    • If CPU is unsure about something, it makes a guess, and uses this guess in executing instructions
      • If the guess happens to be wrong, the CPU rolls back all actions which depended on it
      • Rollback is expensive
        • In term of performance: need to recompute all results which depended on the wrong assumption
        • In term of power: energy was used on computing results which are now just trashed
    • How can we help the CPU to make right guesses
      • Yes! Do what the CPU expects you to do ☺

41 of 45

Microarchitecture 101: OoO

How to execute instructions out of order?

The general approach includes three phases:

  • Decode instructions as they appear in the program
    • Detect dependencies between instructions
      • Attempt to remove false dependencies
  • Once all the inputs of an instruction are ready, we can compute its result
    • And store it in a special buffer
    • And we can compute results in any order (!)
  • In the order the instructions appear in the program take their computed results from the special buffers and update the permanent state of the processor

42 of 45

Microarchitecture 101: OoO

But the reality is not that simple...

Images from the article www.realworldtech.com/nehalem/ which I advice you to read

43 of 45

Microarchitecture 101: OoO

But the reality is not that simple...

What if...

  • ...an instruction needs a recently computed value which is not yet written to the permanent state?
    • No problem, it can get it!
  • ...a memory load want a value which was recently kind of stored to memory, but was not written to the permanent state and still resides in some buffer?
    • No problem, it can get it!
  • ...instructions from different loop iterations have their operands ready
    • They can issue in parallel on the same cycle

44 of 45

Microarchitecture 101: OoO

But the reality is not that simple...

What if...

  • ...we do a function call?
    • No problem, the CPU will predict in advance that the execution will at the function location!
  • ...we do a function call via a pointer?
    • No problem, if the pointer does not change it will have the same performance as usual function call!
  • ...a memory load results in a cache miss?
    • No problem, subsequent independent instructions can continue execution while the cache line is being loaded!

45 of 45

Microarchitecture 101: Summary

  • "But I don't want to go among mad people," Alice remarked.
  • "Oh, you can't help that," said the Cat: "We're all mad here. I'm mad. You're mad."
  • "How do you know I'm mad?" said Alice.
  • "You must be," said the Cat, "otherwise you wouldn't have come here."

Image from adisney.go.com/disneypictures/aliceinwonderland/