1 of 30

Evolution of stack trace capture with BPF

Andrii Nakryiko

2 of 30

BPF_MAP_TYPE_STACK_TRACE

struct {

__uint(type, BPF_MAP_TYPE_STACK_TRACE);

__uint(max_entries, MAX_STACK_TRACE_CNT);

__uint(key, u32);

__uint(value, u64[PERF_MAX_STACK_DEPTH]);

} stacks SEC(".maps");

int bpf_get_stackid(void *ctx, void *map, __u64 flags);

3 of 30

BPF side

id = bpf_get_stackid(ctx, &stacks, BPF_F_USER_STACK);

if (id < 0) {

/* failure */

}

sample.ustack_id = id;

bpf_perf_event_output(ctx, …, &sample, sizeof(sample));

4 of 30

User space side

u64 addrs[PERF_MAX_STACK_DEPTH];

err = bpf_map_lookup_elem(map_fd, &sample.ustack_id, &addrs);

if (err) {

/* error handling */

}

/* first N elements of addrs[] contain captured addresses */

5 of 30

Build ID support

Possible to capture (build ID + file offset) instead of absolute address
.map_flags = BPF_F_USER_BUILD_ID
Special per-stack frame type:

#define BPF_BUILD_ID_SIZE 20

struct bpf_stack_build_id {

__s32 status;

unsigned char build_id[BPF_BUILD_ID_SIZE];

union {

__u64 offset;

__u64 ip;

};

6 of 30

Quirks of STACK_TRACE API

Returns 32-bit stack ID (convenient!)
Captures user space stack trace (BPF_F_USER_STACK)
… or kernel stack trace (omit BPF_F_USER_STACK)

can’t capture both (and no one complained so far!)

actual number of captured addresses is implicit (!)
automatic stack deduplication

7 of 30

Implementation: the good

Specialized hash map implementation.

Stacks deduplication can save space.

Design favors space efficiency and performance.

Does not support hash collisions.

8 of 30

Implementation: the bad

Hash collisions are pretty frequent and unavoidable!

Hash collision handling and tradeoffs controlled through flags:

BPF_F_FAST_STACK_CMP – compare only hashes
BPF_F_REUSE_STACKID – overwrite previous stack trace

9 of 30

Implementation: the ugly

Choice between two bad options:

Lose data

Without BPF_F_REUSE_STACKID – drop stack trace even if there is space available

Corrupt data

With BPF_F_REUSE_STACKID – corrupt all previous references for same stack ID

Our production never uses BPF_F_REUSE_STACKID!

10 of 30

Implementation: the ugly

Stack dedup makes removal from STACK_TRACE inherently racy.
While user space deletes element, BPF side might use that stack ID.
Can't free up space as soon as user space consumed stack trace (!)
STACK_TRACE is not well suited for longer-running sessions.

11 of 30

Making it work in practice

"Double buffering" approach:

two STACK_TRACE maps, one active at a time
the other is read and cleared by user space

Cons:

wastes memory
complicates setup
a small transition window:

user space consumes stack traces
while BPF side completes writing into it

12 of 30

Observations from production

CPU profiling didn't benefit much from deduplication of stacks.

Stack traces are pretty unique, overall.

Let users manage memory.

13 of 30

Evolution: bpf_get_stack()

int bpf_get_stack(void *ctx, void *buf, __u32 size, __u64 flags);

captures stack trace into user-supplied buffer
returns amount of actual data

clears the tail, making it usable as part of hash map key (!)

up to user how to use it afterwards:

dedup as part of HASH map key
send to user space with BPF ringbuf
analyze in BPF code (but I'm not aware of anyone doing it)

All but one use cases at Meta switched to bpf_get_stack()!

14 of 30

Are done yet?

Not quite.

There are still problems.

15 of 30

Synchronous API: assumptions

stacks are captured synchronously
assume worst case (i.e., NMI context)
no page faults allowed, memory has to be physically present

16 of 30

Synchronous API: consequences

user stack traces capture can be unreliable
build ID support is restricted and unreliable

again, worst-case NMI assumptions;
fails if build ID ELF note is not physically present
fails if build ID is not within first 4KB of ELF file (!)
there were attempts to add build ID caching (NACKed, though)

kernel stack traces are oblivious to this (reliable!)

17 of 30

Synchronous API: limitations

(fundamentally) incompatible with SFrame or .eh_frame (DWARF) stack unwinding approaches
can’t wait for necessary data to be paged in

18 of 30

We need a new API

This time, asynchronous!

19 of 30

Asynchronous API: kernel stacks

can't be done for kernel stack traces
they are needed here and now (perf_event, kprobe, tracepoint)
good news: it already works well even with synchronous API

20 of 30

Asynchronous API: user stacks

Key observation: user stacks can be postponed
requested in NMI – captured just before returning to user space
user stack trace is still the same (user thread is frozen)
do it in faultable (a.k.a. "sleepable") context

means we can wait for ELF data to be paged in, if necessary

21 of 30

API design: overview

bpf_get_stackid()-like API, returning 32-bit stack ID
ID is a reservation, stable and can be recorded upfront
kernel stack trace is captured synchronously
user stack trace is scheduled until return to user space
bpf_map_lookup_elem() returns -EAGAIN if stack is not ready

22 of 30

API design: deduplication

STACK_TRACE map is notoriously hard to use reliably
Stack deduplication has to go as part of public API.
One ID – one unique stack.
Makes bpf_map_delete_elem() race-free (no risk of reusing ID)

23 of 30

API design: deduplication

internal dedup is possible (but hidden from user)

Internal refcounting
bpf_map_delete_elem() drops refcount of underlying stack trace memory

CPU vs memory trade off

complexity and CPU overhead with dedup
race-free deletes allow fast memory reuse!

Opinion: seems not worth it to bother.

24 of 30

API design: notifications

How to notify user that stack trace is ready?

25 of 30

API design: notifications

trivial: no notification

(+) no code is best code
(-) user code forced to periodically retry (but maybe that's ok?)

26 of 30

API design: notifications

easy: map-wide epoll notification whenever any stack trace is ready

(+) cheap and simple
(-) might be wasteful for user, causing many retries

27 of 30

API design: notifications

wasteful: each slot supports epoll

(+) user can poll on each stack ID
(-) need to create FD for each ID
(-) each slot embeds wait_queue_head_t

28 of 30

API design: notifications

(?) efficient: BPF ringbuf as an efficient delivery mechanism

(+) IDs are sent as they become "ready"
(+) Very efficient notification and consumption
(-) What to do if BPF ringbuf is full?

(?) User problem
(?) Some map stats

(?) Send entire stack trace?

(+) variable-length data is possible, no space waste
(+) extensible way (BPF ringbuf record size is reported to user)

29 of 30

API design: customization

should we allow custom BPF program for stack unwinding?

bpf_wq should be flexible and sufficient for that?

good built-in kernel support is important

uretprobe "corrupting" stack trace
kernel can fix this up ([0])

SFrame is coming?
is limited .eh_frame (DWARF) support feasible?

[0] https://lore.kernel.org/all/20240508212605.4012172-3-andrii@kernel.org/

1 of 30

2 of 30

3 of 30

4 of 30

5 of 30

6 of 30

7 of 30

8 of 30

9 of 30

10 of 30

11 of 30

12 of 30

13 of 30

14 of 30

15 of 30

16 of 30

17 of 30

18 of 30

19 of 30

20 of 30

21 of 30

22 of 30

23 of 30

24 of 30

25 of 30

26 of 30

27 of 30

28 of 30

29 of 30

30 of 30