Optimizing Ruby:
Building an
Always-On Production Profiler
Who am I
2
Ivo Anjo
Senior Software Engineer @ Datadog
❤ Ruby since 2014
❤ Exploring language runtimes
❤ Application Performance
❤ Making tools to uncover perf insights 👀
Who am I
3
Ivo Anjo
Senior Software Engineer @ Datadog
I work on the�Datadog Ruby Continuous Profiler! 🎉
We provide Cloud Monitoring as a Service 😃
Including profiling, distributed tracing, security, errors, logs for Ruby
4
?
Datadog Ruby Continuous Profiler
➜ Part of ddtrace gem
➜ github.com/datadog/dd-trace-rb + Open-source (Apache2 or BSD-3)
➜ Supports CRuby/MRI Ruby 2.5+
➜ Profiler: Linux + hidden experimental macOS support 😉 (docker on macOS works!)� ➜ Gem: Linux + macOS
➜ Ruby + C + Rust
➜ Produces data in pprof format created by google
➜ Built to be always-on in production
5
How to gather hundreds/thousands of profiling data points per second with low overhead
Or,
Having fun pushing the Ruby VM to its limits 😉
6
There’s already a bunch of great Ruby profilers out there…
7
…often with >1 caveat for prod:
➜ Add a lot of overhead
➜ Produce a lot of data
(store? upload?)
➜ Gather sensitive data
(heap dumps and similar)
➜ Track only one type of� data at a time
Always-on in production:
➜ Low overhead (sampling!)
➜ Efficiently manage data� (in-memory + uploaded)
➜ No sensitive data
➜ Track multiple types of� data simultaneously
Outline
8
00
Profiling in practice
01
Types of profiling data
02
Gathering profiles
03
Challenges && workarounds
04
Using Rust in a gem
05
Always-on production profiling: What? and Why?
Profiling in practice
The case of the internal server errors
Setup:
➜ Datadog self-hosted� GitLab
➜ Requests failing with� rack timeouts after� 60 seconds 🤔
10
The case of the internal server errors
🔍 Almost all time spent
for this request was on�CPU in match?
11
The case of the internal server errors
12
Separate branch + tag names into groups with a slash or without a slash
Build regular expression with all names without a slash
Check if names with slash match regular expression
The case of the internal server errors
13
🤔
Wait… What repo is this again?
How many branches and tags are there?
The case of the internal server errors
14
🤯
Big datadog frontend repo
> 50000 branches�> 130000 tags
The case of the internal server errors
�Colleague:�> I did a huge cleanup on (...repo…) to delete 50000 branches, and now the route is working again on GitLab.
🎉
15
The case of the slow ClassCount check in ddtrace
16
The case of the slow ClassCount check in ddtrace
17
Bonus: How to use locally without a Datadog account
18
Bonus: How to use locally without a Datadog account
19
Types of profiling data
CPU Time
➜ How much time was spent by code executing on cpu and where
➜ ≈ Time spent by Ruby VM actually running code
➜ Does not include network, file access, waiting, …
Wall Time
➜ How much time was spent by what code in total
➜ sleep(10): 0 seconds cpu time, 10 seconds wall time
➜ Includes on cpu and off cpu (network, waiting for GVL, etc)
➜ Wall time of a web request: latency your users experience
21
Allocations
➜ How many objects are getting created and where
➜ More objects means more work for Garbage Collector
➜ GC work impacts the whole application, not just the parts that create a lot of objects
Garbage Collection
➜ How much time was spent by the Garbage Collector
➜ Hidden cost of too many objects
22
Heap Live Objects
➜ What objects are still alive after GC
➜ Useful for tracking object memory leaks
(objects that are kept alive and should not)
23
Gathering profiles
Controlling overhead: Sampling!
Recording every method call, object created, event, … is costly
Challenge:�➜ Profiling should not make the app slower, use more memory, …
Sampling:
➜ Don’t record every event, instead record periodically*
➜ Create a statistically significant approximation of app workload
25
Gathering cpu and wall time (*simplified!)
➜ From a background thread, every 10ms record the state of other threads
26
Gathering cpu and wall time (*simplified!)
Wall time:
➜ Record time since last sample
CPU time:
➜ Ask OS for time the thread spent on cpu
27
Gathering cpu and wall time (*simplified!)
Timestamp:
➜ Used to build timeline view where events are shown in order
28
Gathering cpu and wall time (*simplified!)
Active web request:
➜ Read state of distributed tracing from fiber-local storage
➜ Can filter profile for� single request� (“code hotspots”) or� single endpoint� (“endpoint profiling”)
➜ Especially useful for� multi-threaded apps
29
Gathering cpu and wall time (*simplified!)
30
Gathering cpu and wall time (*simplified!)
From a background thread, every 10ms record the state of other threads
Challenge:
➜ What about the Global VM Lock (GVL)?
➜ Does the profiler need to wait for its turn to execute?
➜ What if that takes longer than 10ms?
31
Gathering cpu and wall time (*simplified!)
���Solution:
➜ Background thread runs in C�without GVL
➜ Interrupt currently-active thread�using unix SIGPROF signal
➜ Use rb_postponed_job_trigger�to ask VM to record info
From a background thread, every 10ms record the state of other threads
32
Gathering cpu and wall time (*simplified!)
���Challenge:
➜ What if doing this every 10ms is still too much overhead?
Solution:
➜ Profiler measures time spent sampling, sleeps for longer to reduce overhead if needed
From a background thread, every 10ms record the state of other threads
33
Gathering Garbage Collection (*simplified!)
Create tracepoint for
➜ RUBY_INTERNAL_EVENT_GC_ENTER
➜ RUBY_INTERNAL_EVENT_GC_EXIT
(These tracepoint events are only available to native extensions)
34
Gathering Garbage Collection (*simplified!)
GC_ENTER and GC_EXIT are special tracepoints, you cannot allocate objects during GC
➜ Need to delay some of the recording work until after GC
35
Gathering Garbage Collection (*simplified!)
Ruby splits GC work into many small steps
➜ Tracepoints get called for every one of them� (up to hundreds/thousands of times per second)
➜ Batch small steps to keep overhead low
36
Gathering Garbage Collection (*simplified!)
Ruby does GC in whatever thread was active
➜ Create “virtual” thread to centralize it
37
Gathering Allocations (*simplified!)
➜ Create tracepoint for RUBY_INTERNAL_EVENT_NEWOBJ
(only native)
➜ Every N objects allocated record
38
Gathering Allocations (*simplified!)
Every N objects?
Challenge:�➜ What’s a good value for N that keeps overhead low?
(Trick question…)�
Solution:�➜ Actively adjust N based on current overhead + activity of app
39
Gathering Allocations (*simplified!)
Very well-tuned application that allocates very little:
➜ N will be close to 1
➜ Otherwise profiler would collect too little data
Application that is allocating a lot of objects:
➜ N will be large number
➜ Otherwise profiler would collect too much data
40
Gathering Heap Live Objects (*simplified!)
➜ Reuse allocations tracepoint
➜ Every M objects that allocation profiler picked record
41
Gathering Heap Live Objects (*simplified!)
�Every ~minute:
➜ Use id2ref to check if object is still alive
➜ Include object in profile if still alive
➜ Clean up tracking if object was GC’d
42
Gathering Heap Live Objects (*simplified!)
Hard to catch very small amount of large objects…
…but we have some ideas for this 😉
43
�Not every object is tracked, but memory leaks will show up as more and more objects stay alive
Challenges && Workarounds
TL;DR: Profiling stretches the VM to its limits to get all this data with low overhead
There’s a few things we wish we could do to make profiling Ruby rock even more ;) 🚀🚀🚀
There’s too much for this presentation…�…collecting our use-cases at dtdg.co/ruby-profiler-challenges
45
Lowering overhead of Allocations/Heap profiling
➜ The Ruby object allocation code has both a fast and a slow path
➜ Creating a RUBY_INTERNAL_EVENT_NEWOBJ tracepoint forces all object allocations to always go through the slow path 🐌🐌🐌
➜ Being able to have the tracepoint active and still use the fast path would lower the overhead of profiling objects 🚅🚅🚅
46
Profiling and Ractors
➜ Currently there’s a weird behavior in the VM where a Ractor getting garbage collected disables active tracepoints
➜ This means that some profiler features stop collecting data when Ractors are active
🤔 bugs.ruby-lang.org/issues/19112�🏗 github.com/ruby/ruby/pull/7184
47
Safe sampling from signal handler
➜ Getting a stack trace sample from a signal handler enables highly accurate CPU profiling (Go + others use this successfully!)
➜ Initializing the updating the frame pointer is not guaranteed to be atomic in vm_push_frame (in practice it often is).� ➜ Signal interruption could see an incomplete initialized frame on some compilers
�➜ Currently we use rb_postponed_job_trigger as a workaround
48
Also a big thanks to the community!
A lot of fixes/improvements have landed thanks to
@kjtsanaktsidis�@byroot�@mame�@ko1�@luke-gru
and many others (I definitely missed folks, and you all rock!)
49
Using Rust in a gem
Using Rust in a gem
➜ Datadog Ruby profiler uses Rust to implement profile data storage and serialization
➜ Key to having very efficient + safe implementation
51
Using Rust in a gem
➜ libdatadog gem
➜ github.com/datadog/libdatadog (open-source)
➜ Distributed as a binary gem on rubygems.org
➜ Installs without needing a Rust compiler around
52
Full solution: C + Rust
On installation of ddtrace/datadog gem:
➜ Build native C extension from source
➜ Pull in libdatadog gem as dependency
➜ Link to binary libdatadog_profiling.so inside it
➜ Ruby-specific parts in C
➜ One libdatadog_profiling.so for all Ruby VM versions
➜ …need binaries for every platform we support
53
Always-on production profiling:
What? and Why?
�1. Investigate problems
➜ What changed and is causing issues?
➜ What’s using so much CPU/memory/…?
➜ Why was this request slower than usual?
➜ Hard to reproduce in staging
➜ Unknown-unknowns, not covered by instrumentation
55
2. Reduce latency
➜ Happy users
➜ Always check end-to-end (Ruby may not be the biggest source!)
➜ Profile-guided optimization…?
3. Reduce costs
➜ Find low-hanging fruit
➜ Find biggest resource consumers�
56
4. Learning && Discovery tool
➜ What are the important code paths?
➜ Does it match what I expected?
It’s spending most of the time on that part!?!
I thought we had removed that feature flag!
57
Let’s make Ruby apps even faster together! 🙏
Let’s make Ruby apps even faster together! 🙏
Ivo Anjo / ivo.anjo@datadoghq.com / @knux
slides
vm wishlist
Datadog is hiring!