1 of 66

Journeying through Optimization with Heuristics

2 of 66

Hi - I’m Jack

  • Done some embedded
  • Done some Quant fund / HFT
  • This year I’ve been not competing
    • (Spent a bunch of time at The Recurse Center)

3 of 66

I made a programming game

4 of 66

I made a cool hat

5 of 66

That you can play frogger on

6 of 66

I also got really into some maths code

  • This code is calculating a sequence of numbers (number of Eulerian circuits in a graph) https://oeis.org/A007082 https://github.com/rafl/oeis-A007082/
  • Got really into optimising it
    • Before we started people only knew the 10th term -> 6*10^67
    • Now we have the 19th term -> ~2*10^237

7 of 66

Optimisation Challenges

  • Sometimes, what you really care about is hard to measure
  • Maybe you have some easier to measure heuristics
  • Some changes are obvious, many aren’t

8 of 66

Quant Finance

  • What you really care about (future profits) is impossible to measure
  • You have some easier to measure heuristics (model fit scores, trading simulations)
  • Most changes aren’t obvious (predicting with a signal to noise ratio of 1% is good!)

9 of 66

The Strategy

  • Runtime on small case = main heuristic
  • We’ll consider other relevant heuristics
  • We’ll consider maintenance costs of the decision
  • We’ll consider robustness of the heuristic / of the optimization

10 of 66

What are we optimising?

  • Here’s some stats from perf on the latest code

+ 29.15% oeis [.] mont_mul -> (a*b) (mod p)

+ 28.18% oeis [.] mont_mul_sub -> (a*b) - (c*d) (mod p)

+ 5.13% oeis [.] extended_euclidean -> (1/x) (mod p)

+ 3.62% oeis [.] add_mod_u64 -> (a+b) (mod p)

65% of self time is modular maths primitives

57% is modular multiplication

11 of 66

Modular Multiplication

inline uint64_t mod_mul(uint64_t a, uint64_t b,

uint64_t p, uint64_t p_dash) {

return ((uint128_t)a * (uint128_t)b) % (uint128_t)p;

}

Baseline: 80 seconds with 64 bit division

  • % p is division
  • Division is slo(ooo)w

12 of 66

Montgomery Multiplication

// p_dash = inverse of -1

uint64_t mont_mul(uint64_t a, uint64_t b,

uint64_t p, uint64_t p_dash) {

uint128_t t = (uint128_t)a * b;

uint64_t m = (uint64_t)t * p_dash;

uint128_t u = t + (uint128_t)m * p;

uint64_t res = u >> 64;

if (res >= p) res -= p;

return res;

}

51s run time

https://en.wikipedia.org/wiki/Montgomery_modular_multiplication

13 of 66

Heuristic - Run time of small test case

Calculating the 9th number in the sequence is a nice size test case

  • Pros:
    • Measuring close to what you want

  • Cons:
    • Runtime noisy without effort
      • CPU heat, background processes
    • Beware unequal scaling
      • Threads scaling differently
      • Step changes in perf with cache size

14 of 66

Heuristic - Perf record

  • Despite my best efforts, had to disable inlining to get reasonable traces
    • Adds noise from call
    • Removes optimizations downstream of inlining

15 of 66

Heuristic - Division is slow

  • Actually - this used to be sloooooow and now is slow

16 of 66

How to Divide

Euler

VS

Euclid�

17 of 66

Euler's theorem

  • We can compute a^(p-2) to get the modular inverse of a
  • Compute with repeated (modular) squaring - lots of multiplies

18 of 66

Extended Euclidean Algorithm

  • Repeatedly do
    • a <- a % b
    • swap (a, b)
  • Keep track of some divisors along the way and tada - modular inverse
  • Doing lots of divisions in a loop

19 of 66

Euclid > Euler!

  • Swapping from Euler to Euclid! Showed a 2x speedup in modular inverse on my laptop

20 of 66

Euler > Euclid!

  • Testing on Heap (compute cluster at recurse) showed - nearly a 3x slow down in modular inverse!
  • 2x faster vs 3x slower is a 6x difference in performance

21 of 66

Wait, what?

22 of 66

Digging in…

The main difference between these two methods is that one is doing a bunch of multiplication, and the other is doing a bunch of division

Jack’s Laptop (circa 2023) is on a Zen 3 architecture

Heap (circa 2015) is on a broadwell architecture

23 of 66

Multiplies (thank you Agner Fog!)

Zen 3:

Haswell:

24 of 66

Divides (thank you Agner Fog!)

Zen 3:

Haswell:

25 of 66

The solution

26 of 66

Caching

  • Repeatedly calculating x^n (mod p)
  • x can take about 100 different values
  • n is between 1 and 100
  • 100 * 100 * 8 = 80kb … a bit iffy on keeping it in L1…

27 of 66

Caching

  • x^23 = x^20 * x^3
  • Cache x^1 x^2 x^3….
  • Cache x^10 x^20 x^30…
  • Cache size is 2 * 10 * 8 = 16kb

28 of 66

Heuristic - Keep things small for cache

29 of 66

Lets look at our function!

  • Don’t worry if you can’t read assembly
    • Instructions on the left, registers on the right
  • Excellent reference for exactly what instructions do https://www.felixcloutier.com/x86/
    • Can make a reasonable guess at that they do from their name.
      • add = add, adc = add with carry, sub = subtract, mul = multiply, mov = move, cmov = conditional move cmp = compare
  • https://godbolt.org/z/Enf6933fY

30 of 66

cmp

Compares the first source operand with the second source operand and sets the status flags in the EFLAGS register according to the results. The comparison is performed by subtracting the second operand from the first operand and then setting the status flags in the same manner as the SUB instruction.

31 of 66

First Optimisation!

Before:

After:

32 of 66

Stop Press - can get it from “C”

33 of 66

One other good idea from the Maths!

uint64_t mont_mul(uint64_t a, uint64_t b,

uint64_t p, uint64_t p_dash) {

uint128_t t = (uint128_t)a * b;

uint64_t m = (uint64_t)t * p_dash;

uint128_t u = t + (uint128_t)m * p;

uint64_t res = u >> 64;

The lower bits of u are always zero

34 of 66

Couldn’t get GCC to do it

35 of 66

Not live demo

36 of 66

Stack Corruption

static inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

asm("movq %rdi, %rax\n"

"movq %rdx, %r8\n"

"mulq %rsi\n"

"imulq %rax, %rcx\n"

"movq %rax, %rsi\n"

"movq %rdx, %rdi\n"

"movq %rcx, %rax\n"

"mulq %r8\n"

"addq %rsi, %rax\n"

"adcq %rdi, %rdx\n"

"movq %rdx, %rax\n"

"subq %r8, %rax\n"

"cmovs %rdx, %rax\n"

"ret\n");

}

make: *** [Makefile:132: test] Error 139

37 of 66

Non obvious syntax error

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %rdi, %rax\n"

"movq %rdx, %r8\n"

"mulq %rsi\n"

"imulq %rax, %rcx\n"

"movq %rax, %rsi\n"

"movq %rdx, %rdi\n"

"movq %rcx, %rax\n"

"mulq %r8\n"

"addq %rsi, %rax\n"

"adcq %rdi, %rdx\n"

"movq %rdx, %rax\n"

"subq %r8, %rax\n"

"cmovs %rdx, %rax"

: "=a"(ret));

return ret;

}

include/maths.h:143:3: error: invalid 'asm': operand number missing after %-letter

38 of 66

Register Corruption

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret));

return ret;

}

make: *** [Makefile:132: test] Error 139

39 of 66

Non obvious syntax error

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret)

:

: "rax", "rdi", "rdx", "r8", "rsi", "rcx");

return ret;

}

include/maths.h:143:3: error: ‘asm’ operand has impossible constraints

40 of 66

Using Junk data as input

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret)

:

: "rdi", "rdx", "r8", "rsi", "rcx");

return ret;

}

❌ n=1 (k=3): expected 2, got 340566958 [0.007s]

41 of 66

Register Corruption

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret)

: "D"(a), "S"(b), "d"(p), "c"(p_dash)

: "r8");

return ret;

}

Test hangs forever

42 of 66

Success! But unnecessary instruction

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdi, %%rax\n"

"movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+D"(a), "+S"(b), "+d"(p), "+c"(p_dash)

:

: "r8");

return ret;

}

✅ n=1 OK [0.005s]

43 of 66

Register Corruption

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("movq %%rdx, %%r8\n"

"mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+d"(p), "+c"(p_dash)

: "a"(a)

: "r8");

return ret;

}

make: *** [Makefile:132: test] Error 139

44 of 66

The feature you want doesn’t exist

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %%r8\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %%r8, %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+c"(p_dash)

: "a"(a), "r8"(p)

: "r8", "rdi", "rdx");

return ret;

}

include/maths.h:157:7: error: matching constraint references invalid operand number

45 of 66

Success! - actually Undefined Behaviour

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %[p_reg]\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p_reg], %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+c"(p_dash)

: "a"(a), [p_reg] "r"(p)

: "rdi", "rdx");

return ret;

}

✅ n=1 OK [0.005s]

46 of 66

I _think_ this is correct now

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t ret;

asm("mulq %%rsi\n"

"imulq %%rax, %%rcx\n"

"movq %%rax, %%rsi\n"

"movq %%rdx, %%rdi\n"

"movq %%rcx, %%rax\n"

"mulq %[p_reg]\n"

"addq %%rsi, %%rax\n"

"adcq %%rdi, %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p_reg], %%rax\n"

"cmovs %%rdx, %%rax\n"

: "=a"(ret), "+S"(b), "+c"(p_dash)

: "a"(a), [p_reg] "r"(p)

: "rdi", "rdx", "cc");

return ret;

}

✅ n=1 OK [0.005s]

47 of 66

Works or not depending on where inlined

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t result;

uint64_t scratch;

asm("mulq %[b]\n"

"movq %%rdx, %[scratch]\n"

"imulq %[p_dash], %%rax\n"

"mulq %[p]\n"

"addq $-1, %%rax\n"

"adcq %[scratch], %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p], %%rax\n"

"cmovcq %%rdx, %%rax\n"

: "=a"(result)

: [a] "0"(a), [b] "r"(b), [p] "r"(p), [p_dash] "r"(p_dash),

[scratch] "r"(scratch)

: "rdx", "cc");

return result;

}

✅ n=3 OK [0.006s]

❌ n=3 (--jack both) (k=7): expected 1015440, got 4512050826672256204 [0.009s]

48 of 66

I finally think this is good

inline uint64_t mont_mul(uint64_t a, uint64_t b, uint64_t p, uint64_t p_dash) {

uint64_t result;

uint64_t scratch;

asm("mulq %[b]\n"

"movq %%rdx, %[scratch]\n"

"imulq %[p_dash], %%rax\n"

"mulq %[p]\n"

"addq $-1, %%rax\n"

"adcq %[scratch], %%rdx\n"

"movq %%rdx, %%rax\n"

"subq %[p], %%rax\n"

"cmovcq %%rdx, %%rax\n"

: "=&a"(result), [scratch] "=&r"(scratch)

: [a] "0"(a), [b] "r"(b), [p] "r"(p), [p_dash] "r"(p_dash)

: "rdx", "cc");

return result;

}

✅ n=3 OK [0.006s]

✅ n=3 (--jack both) OK [0.010s]

49 of 66

How does the perf compare?

With C:

2564193830794 cycles:u

With ASM:

2514090657968 cycles:u

About 2% faster

50 of 66

Heuristic - Perf stat - cycles

  • Perf using CPU counters
  • Fluctuates much less - more like 0.1%
  • sleep might lead to “wrong” results in multi threaded applications

51 of 66

Ooops - things I found when writing the slides

  • MULX exists
  • Clang can do some numerical tricks gcc can’t

52 of 66

Actually… clang can generate something just as short

53 of 66

C vs ASM using Clang

(other perf improvements too)

Optimized C:

  • 2473979233378 cycles:u

Inline ASM

  • 2385026908186 cycles:u

Wait… that’s not supposed to happen…

54 of 66

Heuristic - Less instructions is better

  • But this isn’t how CPUs work any more!

Matt Godbolt: Advanced Skylake Deep Dive

https://www.youtube.com/watch?v=BVVNtG5dgks

55 of 66

Heuristic - LLVM MCA

  • Micro Code Analyser
  • Maybe we can find answers here

56 of 66

What’s going on here??

57 of 66

Are the LLVM params correct?

defm : Zn3WriteResIntPair<WriteIMul64, [Zn3Multiplier], 3, [3], 2>;

// Integer 64-bit multiplication.

defm : Zn3WriteResIntPair<WriteMULX64, [Zn3Multiplier], 3, [1], 2>;

// Integer 32-bit Unsigned Multiply Without Affecting Flags.

58 of 66

Results with Modded Schedule

optimised_c.c

Iterations: 100

Instructions: 800

Total Cycles: 1305

Total uOps: 1000

Dispatch Width: 6

uOps Per Cycle: 0.77

IPC: 0.61

Block RThroughput: 3.0

inline_asm.c

Iterations: 100

Instructions: 900

Total Cycles: 1303

Total uOps: 1100

Dispatch Width: 6

uOps Per Cycle: 0.84

IPC: 0.69

Block RThroughput: 3.0

59 of 66

Results with Modded Schedule

optimised_c.c

[0,0] DeeeeER . . . mulx rax, rdx, rdi

[0,1] D====eeeER. . . imul rdx, rcx

[0,7] .D=============eER . cmovs rax, rdx

Inline_asm.c

[0,0] DeeeER . . . mul rsi

[0,1] D----R . . . mov rdi, rdx

[0,2] D===eeeER . . . imul rax, rcx

[0,8] .D===========eER . cmovb rax, rdx

60 of 66

Code layout noise

  • perf stat might be (close to) deterministic, but things can be deterministic and still chaotic

61 of 66

Making decisions in uncertainty

Theory 1 the ASM isn’t faster

  • LLVM project are smart people
  • Code layout noise
  • Meta uncertainty

Theory 2 the ASM is faster

  • Lower total instruction latency
  • Lower cpu cycles

62 of 66

But…

  • This is machine specific
  • There’s a good chance I did something wrong in the ASM still
  • Miss out on future innovations in machines / compilers
  • This is code location specific

63 of 66

The Heuristics

  • Run time of small test case
  • Perf record
  • Division is slow (but has got faster)
  • Keep things small for cache
  • Perf stat - cycles
  • Less total instruction latency is better (worked better than I’d have thought)
  • LLVM-MCA metrics

64 of 66

The Moral

  • No Heuristic is perfect, but you can still make decisions with uncertainty
  • You can learn a lot by pushing code hard (and pushing your knowledge)

65 of 66

Questions?

66 of 66

Shout outs

Shout out:

  • Florian Ragwitz - started the project and had already pushed it really far before I showed up
  • David Feil - delightful colaborator
  • Professor Brendan McKay - wrote the paper back in 2001 - has been generous and excited to talk to us
  • Recurse center - free programming retreat where I met Florian and David and worked on this