1 of 59

Optimizing Ruby:

Building an

Always-On Production Profiler

Ivo Anjo / ivo.anjo@datadoghq.com

slides dtdg.co/rubykaigi2024 😉

2 of 59

Who am I

2

Ivo Anjo

Senior Software Engineer @ Datadog

Ruby since 2014

Exploring language runtimes

Application Performance

Making tools to uncover perf insights 👀

3 of 59

Who am I

3

Ivo Anjo

Senior Software Engineer @ Datadog

I work on the�Datadog Ruby Continuous Profiler! 🎉

4 of 59

We provide Cloud Monitoring as a Service 😃

Including profiling, distributed tracing, security, errors, logs for Ruby

4

?

5 of 59

Datadog Ruby Continuous Profiler

➜ Part of ddtrace gem

github.com/datadog/dd-trace-rb + Open-source (Apache2 or BSD-3)

➜ Supports CRuby/MRI Ruby 2.5+

➜ Profiler: Linux + hidden experimental macOS support 😉 (docker on macOS works!)� ➜ Gem: Linux + macOS

➜ Ruby + C + Rust

➜ Produces data in pprof format created by google

➜ Built to be always-on in production

5

6 of 59

How to gather hundreds/thousands of profiling data points per second with low overhead

Or,

Having fun pushing the Ruby VM to its limits 😉

6

7 of 59

There’s already a bunch of great Ruby profilers out there…

7

…often with >1 caveat for prod:

➜ Add a lot of overhead

➜ Produce a lot of data

(store? upload?)

➜ Gather sensitive data

(heap dumps and similar)

➜ Track only one type of� data at a time

Always-on in production:

➜ Low overhead (sampling!)

➜ Efficiently manage data� (in-memory + uploaded)

➜ No sensitive data

➜ Track multiple types of� data simultaneously

8 of 59

Outline

8

00

Profiling in practice

01

Types of profiling data

02

Gathering profiles

03

Challenges && workarounds

04

Using Rust in a gem

05

Always-on production profiling: What? and Why?

9 of 59

Profiling in practice

10 of 59

The case of the internal server errors

Setup:

➜ Datadog self-hosted� GitLab

➜ Requests failing with� rack timeouts after� 60 seconds 🤔

10

11 of 59

The case of the internal server errors

🔍 Almost all time spent

for this request was on�CPU in match?

11

12 of 59

The case of the internal server errors

12

Separate branch + tag names into groups with a slash or without a slash

Build regular expression with all names without a slash

Check if names with slash match regular expression

13 of 59

The case of the internal server errors

13

🤔

Wait… What repo is this again?

How many branches and tags are there?

14 of 59

The case of the internal server errors

14

🤯

Big datadog frontend repo

> 50000 branches�> 130000 tags

15 of 59

The case of the internal server errors

�Colleague:�> I did a huge cleanup on (...repo…) to delete 50000 branches, and now the route is working again on GitLab.

🎉

15

16 of 59

The case of the slow ClassCount check in ddtrace

16

17 of 59

The case of the slow ClassCount check in ddtrace

17

18 of 59

Bonus: How to use locally without a Datadog account

  • bundle add ddtrace && bundle install
  • DD_PROFILING_ENABLED=true bundle exec ddprofrb exec ruby hello-profiler.rb
  • lz4 -d hello-profiler.pprof.lz4
  • go tool pprof -http :8987 hello-profiler.pprof

18

19 of 59

Bonus: How to use locally without a Datadog account

19

20 of 59

Types of profiling data

21 of 59

CPU Time

➜ How much time was spent by code executing on cpu and where

➜ ≈ Time spent by Ruby VM actually running code

Does not include network, file access, waiting, …

Wall Time

➜ How much time was spent by what code in total

sleep(10): 0 seconds cpu time, 10 seconds wall time

➜ Includes on cpu and off cpu (network, waiting for GVL, etc)

➜ Wall time of a web request: latency your users experience

21

22 of 59

Allocations

➜ How many objects are getting created and where

➜ More objects means more work for Garbage Collector

➜ GC work impacts the whole application, not just the parts that create a lot of objects

Garbage Collection

➜ How much time was spent by the Garbage Collector

➜ Hidden cost of too many objects

22

23 of 59

Heap Live Objects

➜ What objects are still alive after GC

➜ Useful for tracking object memory leaks

(objects that are kept alive and should not)

23

24 of 59

Gathering profiles

25 of 59

Controlling overhead: Sampling!

Recording every method call, object created, event, … is costly

Challenge:�➜ Profiling should not make the app slower, use more memory, …

Sampling:

➜ Don’t record every event, instead record periodically*

➜ Create a statistically significant approximation of app workload

25

26 of 59

Gathering cpu and wall time (*simplified!)

➜ From a background thread, every 10ms record the state of other threads

  • CPU time
  • Wall time
  • Backtrace
  • Timestamp
  • Active web request
  • Thread name, state,�extra metadata

26

27 of 59

Gathering cpu and wall time (*simplified!)

Wall time:

➜ Record time since last sample

CPU time:

➜ Ask OS for time the thread spent on cpu

27

28 of 59

Gathering cpu and wall time (*simplified!)

Timestamp:

➜ Used to build timeline view where events are shown in order

28

29 of 59

Gathering cpu and wall time (*simplified!)

Active web request:

➜ Read state of distributed tracing from fiber-local storage

➜ Can filter profile for� single request� (“code hotspots”) or� single endpoint� (“endpoint profiling”)

➜ Especially useful for� multi-threaded apps

29

30 of 59

Gathering cpu and wall time (*simplified!)

30

31 of 59

Gathering cpu and wall time (*simplified!)

From a background thread, every 10ms record the state of other threads

Challenge:

➜ What about the Global VM Lock (GVL)?

➜ Does the profiler need to wait for its turn to execute?

➜ What if that takes longer than 10ms?

31

32 of 59

Gathering cpu and wall time (*simplified!)

���Solution:

➜ Background thread runs in C�without GVL

➜ Interrupt currently-active thread�using unix SIGPROF signal

➜ Use rb_postponed_job_trigger�to ask VM to record info

From a background thread, every 10ms record the state of other threads

32

33 of 59

Gathering cpu and wall time (*simplified!)

���Challenge:

➜ What if doing this every 10ms is still too much overhead?

Solution:

➜ Profiler measures time spent sampling, sleeps for longer to reduce overhead if needed

From a background thread, every 10ms record the state of other threads

33

34 of 59

Gathering Garbage Collection (*simplified!)

Create tracepoint for

RUBY_INTERNAL_EVENT_GC_ENTER

RUBY_INTERNAL_EVENT_GC_EXIT

(These tracepoint events are only available to native extensions)

34

35 of 59

Gathering Garbage Collection (*simplified!)

GC_ENTER and GC_EXIT are special tracepoints, you cannot allocate objects during GC

➜ Need to delay some of the recording work until after GC

35

36 of 59

Gathering Garbage Collection (*simplified!)

Ruby splits GC work into many small steps

➜ Tracepoints get called for every one of them� (up to hundreds/thousands of times per second)

➜ Batch small steps to keep overhead low

36

37 of 59

Gathering Garbage Collection (*simplified!)

Ruby does GC in whatever thread was active

➜ Create “virtual” thread to centralize it

37

38 of 59

Gathering Allocations (*simplified!)

➜ Create tracepoint for RUBY_INTERNAL_EVENT_NEWOBJ

(only native)

➜ Every N objects allocated record

  • Backtrace
  • Class
  • Active web request
  • Extra metadata

38

39 of 59

Gathering Allocations (*simplified!)

Every N objects?

Challenge:�➜ What’s a good value for N that keeps overhead low?

(Trick question…)�

Solution:�➜ Actively adjust N based on current overhead + activity of app

39

40 of 59

Gathering Allocations (*simplified!)

Very well-tuned application that allocates very little:

➜ N will be close to 1

➜ Otherwise profiler would collect too little data

Application that is allocating a lot of objects:

➜ N will be large number

➜ Otherwise profiler would collect too much data

40

41 of 59

Gathering Heap Live Objects (*simplified!)

➜ Reuse allocations tracepoint

➜ Every M objects that allocation profiler picked record

  • Backtrace
  • Object id
  • Extra metadata

41

42 of 59

Gathering Heap Live Objects (*simplified!)

�Every ~minute:

➜ Use id2ref to check if object is still alive

➜ Include object in profile if still alive

➜ Clean up tracking if object was GC’d

42

43 of 59

Gathering Heap Live Objects (*simplified!)

Hard to catch very small amount of large objects…

…but we have some ideas for this 😉

43

�Not every object is tracked, but memory leaks will show up as more and more objects stay alive

44 of 59

Challenges && Workarounds

45 of 59

TL;DR: Profiling stretches the VM to its limits to get all this data with low overhead

There’s a few things we wish we could do to make profiling Ruby rock even more ;) 🚀🚀🚀

There’s too much for this presentation…�…collecting our use-cases at dtdg.co/ruby-profiler-challenges

45

46 of 59

Lowering overhead of Allocations/Heap profiling

➜ The Ruby object allocation code has both a fast and a slow path

➜ Creating a RUBY_INTERNAL_EVENT_NEWOBJ tracepoint forces all object allocations to always go through the slow path 🐌🐌🐌

➜ Being able to have the tracepoint active and still use the fast path would lower the overhead of profiling objects 🚅🚅🚅

46

47 of 59

Profiling and Ractors

➜ Currently there’s a weird behavior in the VM where a Ractor getting garbage collected disables active tracepoints

➜ This means that some profiler features stop collecting data when Ractors are active

🤔 bugs.ruby-lang.org/issues/19112�🏗 github.com/ruby/ruby/pull/7184

47

48 of 59

Safe sampling from signal handler

➜ Getting a stack trace sample from a signal handler enables highly accurate CPU profiling (Go + others use this successfully!)

➜ Initializing the updating the frame pointer is not guaranteed to be atomic in vm_push_frame (in practice it often is).� ➜ Signal interruption could see an incomplete initialized frame on some compilers

�➜ Currently we use rb_postponed_job_trigger as a workaround

48

49 of 59

Also a big thanks to the community!

A lot of fixes/improvements have landed thanks to

@kjtsanaktsidis�@byroot�@mame�@ko1�@luke-gru

and many others (I definitely missed folks, and you all rock!)

49

50 of 59

Using Rust in a gem

51 of 59

Using Rust in a gem

➜ Datadog Ruby profiler uses Rust to implement profile data storage and serialization

➜ Key to having very efficient + safe implementation

51

52 of 59

Using Rust in a gem

➜ libdatadog gem

github.com/datadog/libdatadog (open-source)

➜ Distributed as a binary gem on rubygems.org

➜ Installs without needing a Rust compiler around

52

53 of 59

Full solution: C + Rust

On installation of ddtrace/datadog gem:

➜ Build native C extension from source

➜ Pull in libdatadog gem as dependency

➜ Link to binary libdatadog_profiling.so inside it

➜ Ruby-specific parts in C

➜ One libdatadog_profiling.so for all Ruby VM versions

➜ …need binaries for every platform we support

53

54 of 59

Always-on production profiling:

What? and Why?

55 of 59

�1. Investigate problems

➜ What changed and is causing issues?

➜ What’s using so much CPU/memory/…?

➜ Why was this request slower than usual?

➜ Hard to reproduce in staging

➜ Unknown-unknowns, not covered by instrumentation

55

56 of 59

2. Reduce latency

➜ Happy users

➜ Always check end-to-end (Ruby may not be the biggest source!)

➜ Profile-guided optimization…?

3. Reduce costs

➜ Find low-hanging fruit

➜ Find biggest resource consumers�

56

57 of 59

4. Learning && Discovery tool

➜ What are the important code paths?

➜ Does it match what I expected?

It’s spending most of the time on that part!?!

I thought we had removed that feature flag!

57

58 of 59

Let’s make Ruby apps even faster together! 🙏

59 of 59

Let’s make Ruby apps even faster together! 🙏

Ivo Anjo / ivo.anjo@datadoghq.com / @knux

slides

vm wishlist

Datadog is hiring!