1 of 41

“Github Major Service Outage”

Georges Seurat, 1884

Oil on canvas�http://classicprogrammerpaintings.com/post/144953638470

Low level details for high level developers

2 of 41

Low level details for high level developers

Balázs Attila-Mihály

3 of 41

About me

  • Balázs Attila-Mihály
  • Software craftsman
  • https://www.grey-panther.net/pages/presentations.html
  • Occasional perl user
  • the community
  • Question at the end and in the hall

4 of 41

why?

5 of 41

Why?

  • Useful for performance estimation, improvement
  • General debugging
  • Just knowing

6 of 41

What we’re (not) going to talk about

7 of 41

Those things are important!

“There are three great virtues of a programmer: laziness, impatience and hubris”�-- Larry Wall

Getting the biggest return on investment...

8 of 41

What we’re going to talk about

  • About the common case: running on x64 hardware with a Linux OS
  • There are alternatives: GPUs, FPGAs, ASICs, DLPs, analog computers
    • Perhaps in a different talk - or come talk to me afterwards
  • Don’t forget about about algorithmic efficiencies!
    • Do less work, do some work in advance, etc

9 of 41

The Zen of performance

  • Everything is inefficient - by at least two orders of magnitude
  • You are are never “optimal” - you are just “optimal enough”
  • First make it correct, easy to maintain, well tested. Then make it fast.
  • Trust the platform
  • Measure, measure, measure
  • The first question of performance optimization is always “what are the constraints?”
    • How much time?
    • How much money?
    • How many people?

10 of 41

Why should I care?

  • Better understanding of what’s going on
  • Having the “big picture”
  • More options for debugging
  • More options for performance tuning

11 of 41

Not your grandmother's Von Neuman machine

12 of 41

From source code to hardware

13 of 41

From source code to hardware

14 of 41

Ideas

  • Caching�“There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.”�http://martinfowler.com/bliki/TwoHardThings.html
  • Batching
  • Parallelization
  • Pipelining
  • Speculative execution

15 of 41

From source code to hardware

16 of 41

From source code to hardware

17 of 41

From source code to hardware

18 of 41

perf stat perl -E 'say "Hello World!"'

Performance counter stats for 'perl -E say "Hello World!"':

1,621871 task-clock (msec) # 0,890 CPUs utilized

0 context-switches # 0,000 K/sec

0 cpu-migrations # 0,000 K/sec

200 page-faults # 0,123 M/sec

4.839.322 cycles # 2,984 GHz

<not supported> stalled-cycles-frontend

<not supported> stalled-cycles-backend

4.428.820 instructions # 0,92 insns per cycle

914.978 branches # 564,150 M/sec

37.288 branch-misses # 4,08% of all branches

0,001822859 seconds time elapsed

19 of 41

From source code to hardware

run.c:

int

Perl_runops_standard(pTHX)�{� OP *op = PL_op;� PERL_DTRACE_PROBE_OP(op);� while ((PL_op = op = op->op_ppaddr(aTHX))) {� PERL_DTRACE_PROBE_OP(op);� }� PERL_ASYNC_CHECK();

TAINT_NOT;� return 0;�}

20 of 41

Latency Numbers Every Programmer Should Know

21 of 41

why?

22 of 41

Memory layout matters

23 of 41

Possible solution

  • RPerl (F/LOSS Weekly interview)
  • As easy as “sudo cpan RPerl”*
    • sudo apt install libperl-dev
    • sudo cpanm --notest RPerl
    • rperl --noparallel /tmp/foobar.pl

24 of 41

Possible solution

  • RPerl (F/LOSS Weekly interview)
    • Perl -> C -> Machine code
      • With all the complexity that goes with that
    • Limited subset of Perl
      • ~ passes Perl::Critic “brutal”
  • As easy as “sudo cpan RPerl”*
    • sudo apt install libperl-dev
    • sudo cpanm --notest RPerl
    • rperl --noparallel /tmp/foobar.pl

25 of 41

RPerl

have $foo = 33

have $bar = 1_932

have $baz = 58.545_454_545_454_5

Performance counter stats for '/tmp/foobar':

1,101880 task-clock (msec) # 0,856 CPUs utilized

3.616.659 instructions # 1,10 insns per cycle

637.273 branches # 578,351 M/sec

18.727 branch-misses # 2,94% of all branches

0,001286668 seconds time elapsed

have $foo = 33

have $bar = 1_932

have $baz = 58.545_454_545_454_5

Performance counter stats for 'perl /tmp/foobar.pl':

130,153650 task-clock (msec) # 0,997 CPUs utilized

473.391.113 instructions # 1,20 insns per cycle

101.997.452 branches # 783,670 M/sec

3.610.345 branch-misses # 3,54% of all branches

0,130532666 seconds time elapsed

26 of 41

Possible solution

  • PDL - Perl Data Language aka. NumPy for Perl :-)��“PDL ("Perl Data Language") gives standard Perl the ability to compactly store and speedily manipulate the large N-dimensional data arrays which are the bread and butter of scientific computing.��PDL turns Perl into a free, array-oriented, numerical language similar to (but, we believe, better than) such commercial packages as IDL and MatLab. One can write simple perl expressions to manipulate entire numerical arrays all at once.”

27 of 41

Profiling

  • Devel::NYTProf
    • Instrumenting profiler
  • Sampling profilers
    • Less overhead
    • Can do full-system profiling
    • Needs be independent of the program
    • Can miss small methods

28 of 41

Profiling

  • Flame Graphs for Online Performance Profiling (YAPC::NA 2013)
    • By Yichun Zhang from CloudFlare

http://agentzh.org/misc/flamegraph/perl-vm-test-nginx.svg

29 of 41

Problem

#chr position id ref alt�1 27259823 rs143970144 C A�3 134279741 rs570267197 C T�3 4427096 rs189830239 T G�4 56396589 rs751646898 A G�6 103754045 rs188253003 A G�8 81783139 rs201875105 G A�9 40999891 rs28602573 G A�12 55068468 rs3062496 TCA T�21 37313602 rs145886040 G T

30 of 41

Solution: pure perl

https://github.com/gpanther/yapc-eu-2016-benchmarks

  • Perl 5.22
  • Load time: ~20s
  • Used memory: ~1.9G
    • Only for a subset: 10m out of 152m
    • ~6x bloat (subset file: ~300M)
    • Can use compressed inputs at almost no performance loss
  • Lookup time for 13.5m lookups: 6.8s

31 of 41

Solution: perl with encoded key

  • Load time: ~215s
  • Used memory: ~900M
    • ~3x bloat (subset file: ~300M)
  • Lookup time for 13.5m lookups: 6s

32 of 41

Embracing the OS

  • Memory mapped files

33 of 41

Embracing the OS

34 of 41

Embracing the OS

35 of 41

Embracing the OS

36 of 41

Solution: perl with memory mapped file

  • Sys::Mmap
  • Write an open addressing hash map to a binary file:
    • Use the upper 20bits as hash / bucket id
    • Store at most 16 elements per bucket
  • Load time: 0s
    • With a 283 seconds time for generation
  • Used memory:
    • ~50M
    • Binary file: ~1G
  • Lookup time for 2.5m lookups: > 30s

37 of 41

Solution: perl + mmap + Inline::C

  • Load time: 0s
    • With a 283 seconds time for generation
  • Used memory:
    • ~50M
    • Binary file: ~1G
  • Lookup time for 2.5m lookups: 4s

38 of 41

Resources

39 of 41

Resources

40 of 41

Resources

41 of 41

Thank you!

Questions and (possibly) answers