1 of 36

THE INTEL PROCESSOR TRACE

BOGDAN TANASA

28^th of May, 2025

Lund Linux Con

2 of 36

AGENDA

Introduction
Use Cases
Challenges

Processor Performance Insights and Optimization

(from a tooling perspective)

3 of 36

INTRODUCTION�(INTEL PROCESSOR TRACE)

HW that captures information about software execution

with minimal performance perturbation to the SW being traced

Generates a variety of packets that, when combined with binaries, can:

(re)produce an exact execution trace
reveal timing and program flow information

enables both functional and performance debugging of applications

correlate with Processor Event-Based Sampling (PEBS)

4 of 36

INTRODUCTION�(PEBS)

A HW feature providing more precise profiling than traditional PMU capabilities

How it works?

Configure a PMU event
When the event occurs, the CPU records current CPU state into a memory buffer

CPU state is usually defined by:

Instruction Pointer
Some other Registers (check the CPU’s manual!)
Memory Address (for some memory events)

Main Advantage

Rich Data – Includes the CPU state not just the values of the counters

5 of 36

INTRODUCTION�(INTEL PROCESSOR TRACE)

Can log software generated 8 bytes packets

via the ptwrite instruction

Software generated packets:

bind to the associated ptwrite instruction in the binary

the address of it as seen by the program counter

are timestamped by Intel PT

Intel PT has a dedicated counter different from the TSC

asm volatile ("mov %0, %%rax" : : "r"(my_var) : "rax");

asm volatile ("ptwrite %%rax" : : : "rax");

6 of 36

HOW TO USE INTEL PT?

Configuration of packet generation and capabilities via a set of MSRs

Use the cpuid instruction to detect Intel PT capabilities

On Linux systems

via the perf_event_open syscall

7 of 36

PERF_EVENT_OPEN�(HTTPS://MAN7.ORG/LINUX/MAN-PAGES/MAN2/PERF_EVENT_OPEN.2.HTML)

Populate struct perf_event_attr

.config -> Configure Intel PT’s capabilities
.type -> Get the from /sys/bus/event_source/devices/intel_pt/type
Save the returned file descriptor (fd)

Invoke mmap(fd) so that Intel PT outputs the packets to memory

Decode PERF_RECORD_AUX records

8 of 36

PERF_RECORD_AUX ENTRIES

Stream of bytes containing Intel PT packets

Chapter 33.4 TRACE PACKETS AND DATA TYPES of Intel’s Software Developer’s Manual

The PERF_RECORD_AUX entries have to be decoded fast

loss of information
perf-subsystem may hang

9 of 36

USE CASES

Timing

?

Processor Event Sampling

Instruction Tracing

Record Execution Flow

?

10 of 36

USE CASES

All use cases require a decoder that consumes the Intel PT packets!

Some resources need to be given to the Intel PT decoder

CPU
Memory Bandwidth

11 of 36

USE CASES

Traced CPU

do_work

Tracee CPU

Intel PT decoder

Main Memory

Intel PT HW

Cache Memory

Read Intel PT packets

Write filtered data

Main Memory

Network

Storage

12 of 36

TIME MEASUREMENTS

Traditional way

read tsc ← 25-35 cc
do_work
read tsc ← 25-35 cc
record time difference ← few cc

PTWRITE Instrumentation

mov RAX Marker_Start ← X cc
ptwrite RAX ← 1 cc
do_work
mov RAX Marker_Stop ← Y cc
ptwrite RAX ← 1 cc

X = Y = 1 if Marker_Start and Marker_Stop are constants

13 of 36

TIME MEASUREMENTS

Low overhead instrumentation
Suitable for small pieces of code

TSC overhead is too noisy

Caveats:

ptwrite as rdtsc are NOT serializable

The exact timing is affected by out-of-order execution

14 of 36

TIME MEASUREMENTS�(HOW IT WORKS?)

Intel PT HW uses its own Cycle Counter (CYC)

Runs at Core:Bus ratio frequency
TSC is synchronized over all cores

One of the reasons why rdtsc has larger overhead

Intel PT outputs TSC packets to inform the decoder about the wall time

To keep in mind:

CYC and TSC frequencies may not be the same

a transform is needed!

15 of 36

PROCESSOR EVENT-BASED SAMPLING - PEBS

Check with CPUID if PEBS output to Intel PT is supported
Some HW PMU counters are PEBS compatible

16 of 36

PROCESSOR EVENT-BASED SAMPLING – PEBS�(HOW IT WORKS?)

Do another perf_event_open syscall

struct perf_event_attr.config set to PEBS compatible event

INST_RETIRED.ANY
BR_INST_RETIRED.ALL_BRANCHES
BR_MISP_RETIRED.ALL_BRANCHES

struct perf_event_attr.aux_output = 1
[ Use the fd of Intel PT as a group leader ]

The Intel PT HW of the traced CPU will generate PEBS records

encoded as BBP, BIP and BEP packets

17 of 36

PROCESSOR EVENT-BASED SAMPLING – PEBS�(HOW IT WORKS?)

The BBP, BIP and BEP packets

CPU State

18 of 36

PEBS + PDIST

PDIST = Precise Distribution

PEBS records are being generated precisely upon completion of the instruction

perf_event_open doesn’t allow for selecting the PMU counter

First call luckily chooses PMU counter 0

19 of 36

PEBS + PDIST

PDIST is a HW feature so that PEBS are generated precisely upon the completion of the instruction

Limitations:

Selected counters
Counter reload value must not be less than 256 for PDIST to operate

In other words, PDIST works well at a “distance” of 256 retired instructions

20 of 36

TRACE OF EXECUTION

int main(int argc, char* argv[ ]) {

for (;;) {

unsigned long long start = read_tsc();

unsigned long long elapsed = start & 0xFFFllu;

asm volatile ("mov %0, %%rax" : : "r"(elapsed) : "rax");

asm volatile ("ptwrite %%rax" : : : "rax");

asm volatile ("mfence" : : : "memory");

if (elapsed > 1000llu) {

a++;

} else if (elapsed > 500llu) {

b++;

} else if (elapsed > 100llu) {

c++;

} else {

d++;

}

return 0;

}

elapsed == 36 leads to d++

21 of 36

TRACE OF EXECUTION

PTW = 36

FUP = 40178c :: PTWRITE PTWRITE :: 17 140862

CYC = 14

SHORT_TNT :: 06

CYC = e7

BBP :: SZ = 00000000 TYPE = 00000004

CYC = 5

BIP :: ID = 00000000 :: PAYLOAD = 40179c :: COND_BR JBE

CYC = 11

BIP :: ID = 00000001 :: PAYLOAD = 1

CYC = 16

BIP :: ID = 00000002 :: PAYLOAD = e174078715c

CYC = af

BEP :: IP = 00000001

FUP = 4017af :: BINARY CMP :: 17 140862

CYC = 90

SHORT_TNT :: 0e

CYC = 2d

PTW = 914

elapsed == 36 leads to d++

22 of 36

TRACE OF EXECUTION

PTW = 36

FUP = 40178c :: PTWRITE PTWRITE :: 17 140862

CYC = 14

SHORT_TNT :: 06 => first branch after ptwrite is taken as 6₁₆=0110₂

CYC = e7

BBP :: SZ = 00000000 TYPE = 00000004

CYC = 5

BIP :: ID = 00000000 :: PAYLOAD = 40179c :: COND_BR JBE

CYC = 11

BIP :: ID = 00000001 :: PAYLOAD = 1

CYC = 16

BIP :: ID = 00000002 :: PAYLOAD = e174078715c

CYC = af

BEP :: IP = 00000001

FUP = 4017af :: BINARY CMP :: 17 140862

CYC = 90

SHORT_TNT :: 0e => the upcoming two branches after CMP are taken e₁₆=1110₂

CYC = 2d

PTW = 914

elapsed == 36 leads to d++

23 of 36

TRACE OF EXECUTION

40178c: ptwrite %rax

401791: mfence

if (elapsed > 1000llu) {

401794: cmpq $0x3e8,-0x8(%rbp)

40179c: jbe 4017af <main+0x6a>

a++;

40179e: mov 0xc5b4c(%rip),%eax # 4c72f0 <a>

4017a4: add $0x1,%eax

4017a7: mov %eax,0xc5b43(%rip) # 4c72f0 <a>

4017ad: jmp 401754 <main+0xf>

} else if (elapsed > 500llu) {

4017af: cmpq $0x1f4,-0x8(%rbp)

4017b7: jbe 4017ca <main+0x85>

b++;

4017b9: mov 0xc5b35(%rip),%eax # 4c72f4 <b>

4017bf: add $0x1,%eax

4017c2: mov %eax,0xc5b2c(%rip) # 4c72f4 <b>

4017c8: jmp 401754 <main+0xf>

} else if (elapsed > 100llu) {

4017ca: cmpq $0x64,-0x8(%rbp)

4017cf: jbe 4017e5 <main+0xa0>

c++;

4017d1: mov 0xc5b21(%rip),%eax # 4c72f8 <c>

4017d7: add $0x1,%eax

4017da: mov %eax,0xc5b18(%rip) # 4c72f8 <c>

4017e0: jmp 401754 <main+0xf>

} else {

d++;

4017e5: mov 0xc5b11(%rip),%eax # 4c72fc <d>

4017eb: add $0x1,%eax

4017ee: mov %eax,0xc5b08(%rip) # 4c72fc <d>

for (;;) {

4017f4: jmp 401754 <main+0xf>

4017f9: nopl 0x0(%rax)

elapsed == 36 leads to d++

PTW = 36

SHORT_TNT :: 06 => first branch after ptwrite is taken as 6₁₆=0110₂

BIP :: 40179c :: COND_BR JBE

FUP :: 4017af :: BINARY CMP

SHORT_TNT :: 0e => the upcoming two branches after CMP are taken e₁₆=1110₂

24 of 36

TRACE OF EXECUTION

Show objdump output and explain XED instruction decoding

https://github.com/intelxed/xed

25 of 36

26 of 36

CPU State

27 of 36

INSTRUCTION TRACING

Live instruction tracing on real HW

requires a modified Linux kernel

CPU runs in single step execution mode

generates the SIGTRAP interrupt
output the program counter via ptwrite
return from interrupt right after knowing the PC

push EFLAGS to stack

modify single step flag

pop EFLAGS from stack

ptwrite PC in the SIGTRAP IRQ

(x86_64 doesn’t have instructions to read the PC)

28 of 36

LIVE INSTRUCTION TRACING

“Accurate” timing

time of instruction = time between two consecutive interrupts
assumes that the noise from the SIGTRAP IRQ is somehow constant!

Actual content of the registers

get the most probable input for that sequence of instructions

Power/Thermal insights

29 of 36

LIVE INSTRUCTION TRACING�(CONCLUSIONS)

Method 1:

Single step execution with help of ptwrite
Disadvantages:

Slow method
Software needs to handle timeouts

Advantages:

Software execution can be correlated to the state of the kernel not only state of CPU

Method 2:

Natural execution of SW
Disadvantages:

Intel PT may generate too much data

Especially when analysing branches

No correlation to state of the kernel

Advantages:

No need to modify the SW and/or the kernel

30 of 36

MORE ADVANCED USE CASES

31 of 36

STACK UNWINDING

perf is copying the content of stack into a user space buffer

not useful for time sensitive code

Use ptwrite to dump the stack pointers

much much faster than libunwind

combined with special dwarf information
additional ELF section embedded in the binary

the kernel uses this section to perform stack unwinding

32 of 36

COMPILE TIME INSTRUMENTATION

The compiler adds ptwrite instructions to measure the timing of different pieces of code

Enhanced PGO

A GCC plugin is under development

A demo for another time

33 of 36

CHALLENGES

The Intel PT decoder needs to be fast

writing the decoded data back to memory with minimal filtering

use of non-temporal stores to not pollute LLC (L3 cache)

Make use of Memory Bandwidth Allocation

The tracer CPU can be enforced to use minimal bandwidth

34 of 36

OTHER FINDINGS

Bug in 6.11 kernel on hybrid systems (RaptorLake)

confirmed by Alexander Shishkin alexander.shishkin@linux.intel.com

aux_output for PEBS events return “Operation Not Supported”

35 of 36

VISION

Make the perf subsystem great again!

Full user-space toolkit

1 of 36

2 of 36

3 of 36

4 of 36

5 of 36

6 of 36

7 of 36

8 of 36

9 of 36

10 of 36

11 of 36

12 of 36

13 of 36

14 of 36

15 of 36

16 of 36

17 of 36

18 of 36

19 of 36

20 of 36

21 of 36

22 of 36

23 of 36

24 of 36

25 of 36

26 of 36

27 of 36

28 of 36

29 of 36

30 of 36

31 of 36

32 of 36

33 of 36

34 of 36

35 of 36

36 of 36