1 of 66

Journeying through Optimization with Heuristics

2 of 66

Hi - I’m Jack

Done some embedded
Done some Quant fund / HFT
This year I’ve been not competing

(Spent a bunch of time at The Recurse Center)

3 of 66

I made a programming game

https://jm771.github.io/posevac/

4 of 66

I made a cool hat

5 of 66

That you can play frogger on

6 of 66

I also got really into some maths code

This code is calculating a sequence of numbers (number of Eulerian circuits in a graph) https://oeis.org/A007082 https://github.com/rafl/oeis-A007082/
Got really into optimising it

Before we started people only knew the 10th term -> 6*10^67
Now we have the 19th term -> ~2*10^237

7 of 66

Optimisation Challenges

Sometimes, what you really care about is hard to measure
Maybe you have some easier to measure heuristics
Some changes are obvious, many aren’t

8 of 66

Quant Finance

What you really care about (future profits) is impossible to measure
You have some easier to measure heuristics (model fit scores, trading simulations)
Most changes aren’t obvious (predicting with a signal to noise ratio of 1% is good!)

9 of 66

The Strategy

Runtime on small case = main heuristic
We’ll consider other relevant heuristics
We’ll consider maintenance costs of the decision
We’ll consider robustness of the heuristic / of the optimization

10 of 66

What are we optimising?

Here’s some stats from perf on the latest code

+ 29.15% oeis [.] mont_mul -> (a*b) (mod p)

+ 28.18% oeis [.] mont_mul_sub -> (a*b) - (c*d) (mod p)

+ 5.13% oeis [.] extended_euclidean -> (1/x) (mod p)

+ 3.62% oeis [.] add_mod_u64 -> (a+b) (mod p)

65% of self time is modular maths primitives

57% is modular multiplication

11 of 66

Modular Multiplication

inline uint64_t mod_mul(uint64_t a, uint64_t b,

uint64_t p, uint64_t p_dash) {

return ((uint128_t)a * (uint128_t)b) % (uint128_t)p;

}

Baseline: 80 seconds with 64 bit division

% p is division
Division is slo(ooo)w

12 of 66

Montgomery Multiplication

// p_dash = inverse of -1

uint64_t mont_mul(uint64_t a, uint64_t b,

uint64_t p, uint64_t p_dash) {

uint128_t t = (uint128_t)a * b;

uint64_t m = (uint64_t)t * p_dash;

uint128_t u = t + (uint128_t)m * p;

uint64_t res = u >> 64;

if (res >= p) res -= p;

return res;

}

51s run time

https://en.wikipedia.org/wiki/Montgomery_modular_multiplication

13 of 66

Heuristic - Run time of small test case

Calculating the 9th number in the sequence is a nice size test case

Pros:

Measuring close to what you want

Cons:

Runtime noisy without effort

CPU heat, background processes

Beware unequal scaling

Threads scaling differently
Step changes in perf with cache size

14 of 66

Heuristic - Perf record

Despite my best efforts, had to disable inlining to get reasonable traces

Adds noise from call
Removes optimizations downstream of inlining

15 of 66

Heuristic - Division is slow

Actually - this used to be sloooooow and now is slow

16 of 66

How to Divide

Euler

VS

Euclid�

17 of 66

Euler's theorem

We can compute a^(p-2) to get the modular inverse of a
Compute with repeated (modular) squaring - lots of multiplies

18 of 66

Extended Euclidean Algorithm

Repeatedly do

a <- a % b
swap (a, b)

Keep track of some divisors along the way and tada - modular inverse
Doing lots of divisions in a loop

19 of 66

Euclid > Euler!

Swapping from Euler to Euclid! Showed a 2x speedup in modular inverse on my laptop

20 of 66

Euler > Euclid!

Testing on Heap (compute cluster at recurse) showed - nearly a 3x slow down in modular inverse!
2x faster vs 3x slower is a 6x difference in performance

21 of 66

Wait, what?

22 of 66

Digging in…

The main difference between these two methods is that one is doing a bunch of multiplication, and the other is doing a bunch of division

Jack’s Laptop (circa 2023) is on a Zen 3 architecture

Heap (circa 2015) is on a broadwell architecture

23 of 66

Multiplies (thank you Agner Fog!)

Zen 3:

Haswell:

24 of 66

Divides (thank you Agner Fog!)

Zen 3:

Haswell:

25 of 66

The solution

26 of 66

Caching

Repeatedly calculating x^n (mod p)
x can take about 100 different values
n is between 1 and 100
100 * 100 * 8 = 80kb … a bit iffy on keeping it in L1…

27 of 66

Caching

x^23 = x^20 * x^3
Cache x^1 x^2 x^3….
Cache x^10 x^20 x^30…
Cache size is 2 * 10 * 8 = 16kb

28 of 66

Heuristic - Keep things small for cache

29 of 66

Lets look at our function!

Don’t worry if you can’t read assembly

Instructions on the left, registers on the right

Excellent reference for exactly what instructions do https://www.felixcloutier.com/x86/

Can make a reasonable guess at that they do from their name.

add = add, adc = add with carry, sub = subtract, mul = multiply, mov = move, cmov = conditional move cmp = compare

https://godbolt.org/z/Enf6933fY

30 of 66

cmp

Compares the first source operand with the second source operand and sets the status flags in the EFLAGS register according to the results. The comparison is performed by subtracting the second operand from the first operand and then setting the status flags in the same manner as the SUB instruction.

31 of 66

First Optimisation!

Before:

After:

32 of 66

Stop Press - can get it from “C”

33 of 66

One other good idea from the Maths!

uint64_t mont_mul(uint64_t a, uint64_t b,

uint64_t p, uint64_t p_dash) {

uint128_t t = (uint128_t)a * b;

uint64_t m = (uint64_t)t * p_dash;

uint128_t u = t + (uint128_t)m * p;

uint64_t res = u >> 64;

The lower bits of u are always zero

34 of 66

Couldn’t get GCC to do it

35 of 66

Not live demo

What it’s like to code with gcc ASM syntax
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

36 of 66

Stack Corruption

static inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

asm("movq %rdi, %rax\n"

"movq %rdx, %r8\n"

"mulq %rsi\n"

"imulq %rax, %rcx\n"

"movq %rax, %rsi\n"

"movq %rdx, %rdi\n"

"movq %rcx, %rax\n"

"mulq %r8\n"

"addq %rsi, %rax\n"

"adcq %rdi, %rdx\n"

"movq %rdx, %rax\n"

"subq %r8, %rax\n"

"cmovs %rdx, %rax\n"

"ret\n");

}

make: *** [Makefile:132: test] Error 139

37 of 66

Non obvious syntax error

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %rdi, %rax\n"

"movq %rdx, %r8\n"

"mulq %rsi\n"

"imulq %rax, %rcx\n"

"movq %rax, %rsi\n"

"movq %rdx, %rdi\n"

"movq %rcx, %rax\n"

"mulq %r8\n"

"addq %rsi, %rax\n"

"adcq %rdi, %rdx\n"

"movq %rdx, %rax\n"

"subq %r8, %rax\n"

"cmovs %rdx, %rax"

: "=a"(ret));

return ret;

}

include/maths.h:143:3: error: invalid 'asm': operand number missing after %-letter

38 of 66

Register Corruption

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret));

return ret;

}

make: *** [Makefile:132: test] Error 139

39 of 66

Non obvious syntax error

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret)

:

: "rax", "rdi", "rdx", "r8", "rsi", "rcx");

return ret;

}

include/maths.h:143:3: error: ‘asm’ operand has impossible constraints

40 of 66

Using Junk data as input

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret)

:

: "rdi", "rdx", "r8", "rsi", "rcx");

return ret;

}

❌ n=1 (k=3): expected 2, got 340566958 [0.007s]

41 of 66

Register Corruption

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret)

: "D"(a), "S"(b), "d"(p), "c"(p_dash)

: "r8");

return ret;

}

Test hangs forever

42 of 66

Success! But unnecessary instruction

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+D"(a), "+S"(b), "+d"(p), "+c"(p_dash)

:

: "r8");

return ret;

}

✅ n=1 OK [0.005s]

43 of 66

Register Corruption

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+d"(p), "+c"(p_dash)

: "a"(a)

: "r8");

return ret;

}

make: *** [Makefile:132: test] Error 139

44 of 66

The feature you want doesn’t exist

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+c"(p_dash)

: "a"(a), "r8"(p)

: "r8", "rdi", "rdx");

return ret;

}

include/maths.h:157:7: error: matching constraint references invalid operand number

45 of 66

Success! - actually Undefined Behaviour

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %[p_reg]\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p_reg], %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+c"(p_dash)

: "a"(a), [p_reg] "r"(p)

: "rdi", "rdx");

return ret;

}

✅ n=1 OK [0.005s]

46 of 66

I _think_ this is correct now

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %[p_reg]\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p_reg], %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+c"(p_dash)

: "a"(a), [p_reg] "r"(p)

: "rdi", "rdx", "cc");

return ret;

}

✅ n=1 OK [0.005s]

47 of 66

Works or not depending on where inlined

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t result;

uint64_t scratch;

asm("mulq %[b]\n"

"movq %%rdx, %[scratch]\n"

"imulq %[p_dash], %%rax\n"

"mulq %[p]\n"

"addq $-1, %%rax\n"

"adcq %[scratch], %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p], %%rax\n"

"cmovcq %%rdx, %%rax\n"

: "=a"(result)

: [a] "0"(a), [b] "r"(b), [p] "r"(p), [p_dash] "r"(p_dash),

[scratch] "r"(scratch)

: "rdx", "cc");

return result;

}

✅ n=3 OK [0.006s]

❌ n=3 (--jack both) (k=7): expected 1015440, got 4512050826672256204 [0.009s]

48 of 66

I finally think this is good

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t result;

uint64_t scratch;

asm("mulq %[b]\n"

"movq %%rdx, %[scratch]\n"

"imulq %[p_dash], %%rax\n"

"mulq %[p]\n"

"addq $-1, %%rax\n"

"adcq %[scratch], %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p], %%rax\n"

"cmovcq %%rdx, %%rax\n"

: "=&a"(result), [scratch] "=&r"(scratch)

: [a] "0"(a), [b] "r"(b), [p] "r"(p), [p_dash] "r"(p_dash)

: "rdx", "cc");

return result;

}

✅ n=3 OK [0.006s]

✅ n=3 (--jack both) OK [0.010s]

49 of 66

How does the perf compare?

With C:

2564193830794 cycles:u

With ASM:

2514090657968 cycles:u

About 2% faster

50 of 66

Heuristic - Perf stat - cycles

Perf using CPU counters
Fluctuates much less - more like 0.1%
sleep might lead to “wrong” results in multi threaded applications

51 of 66

Ooops - things I found when writing the slides

MULX exists
Clang can do some numerical tricks gcc can’t

52 of 66

Actually… clang can generate something just as short

53 of 66

C vs ASM using Clang

(other perf improvements too)

Optimized C:

2473979233378 cycles:u

Inline ASM

2385026908186 cycles:u

Wait… that’s not supposed to happen…

54 of 66

Heuristic - Less instructions is better

But this isn’t how CPUs work any more!

Matt Godbolt: Advanced Skylake Deep Dive

https://www.youtube.com/watch?v=BVVNtG5dgks

55 of 66

Heuristic - LLVM MCA

Micro Code Analyser
Maybe we can find answers here

Pure C https://godbolt.org/z/sbxYxqbhE
Inline ASM https://godbolt.org/z/98Gx7ed76

56 of 66

What’s going on here??

57 of 66

Are the LLVM params correct?

defm : Zn3WriteResIntPair<WriteIMul64, [Zn3Multiplier], 3, [3], 2>;

// Integer 64-bit multiplication.

defm : Zn3WriteResIntPair<WriteMULX64, [Zn3Multiplier], 3, [1], 2>;

// Integer 32-bit Unsigned Multiply Without Affecting Flags.

58 of 66

Results with Modded Schedule

optimised_c.c

Iterations: 100

Instructions: 800

Total Cycles: 1305

Total uOps: 1000

Dispatch Width: 6

uOps Per Cycle: 0.77

IPC: 0.61

Block RThroughput: 3.0

inline_asm.c

Iterations: 100

Instructions: 900

Total Cycles: 1303

Total uOps: 1100

Dispatch Width: 6

uOps Per Cycle: 0.84

IPC: 0.69

Block RThroughput: 3.0

59 of 66

Results with Modded Schedule

optimised_c.c

[0,0] DeeeeER . . . mulx rax, rdx, rdi

[0,1] D====eeeER. . . imul rdx, rcx

[0,7] .D=============eER . cmovs rax, rdx

Inline_asm.c

[0,0] DeeeER . . . mul rsi

[0,1] D----R . . . mov rdi, rdx

[0,2] D===eeeER . . . imul rax, rcx

[0,8] .D===========eER . cmovb rax, rdx

60 of 66

Code layout noise

perf stat might be (close to) deterministic, but things can be deterministic and still chaotic

61 of 66

Making decisions in uncertainty

Theory 1 the ASM isn’t faster

LLVM project are smart people
Code layout noise
Meta uncertainty

Theory 2 the ASM is faster

Lower total instruction latency
Lower cpu cycles

62 of 66

But…

This is machine specific
There’s a good chance I did something wrong in the ASM still
Miss out on future innovations in machines / compilers
This is code location specific

63 of 66

The Heuristics

Run time of small test case
Perf record
Division is slow (but has got faster)
Keep things small for cache
Perf stat - cycles
Less total instruction latency is better (worked better than I’d have thought)
LLVM-MCA metrics

64 of 66

The Moral

No Heuristic is perfect, but you can still make decisions with uncertainty
You can learn a lot by pushing code hard (and pushing your knowledge)

65 of 66

Questions?

66 of 66

Shout outs

Shout out:

Florian Ragwitz - started the project and had already pushed it really far before I showed up
David Feil - delightful colaborator
Professor Brendan McKay - wrote the paper back in 2001 - has been generous and excited to talk to us
Recurse center - free programming retreat where I met Florian and David and worked on this