1 of 36

THE INTEL PROCESSOR TRACE

BOGDAN TANASA

28th of May, 2025

Lund Linux Con

2 of 36

AGENDA

  • Introduction
  • Use Cases
  • Challenges

Processor Performance Insights and Optimization

(from a tooling perspective)

3 of 36

INTRODUCTION�(INTEL PROCESSOR TRACE)

  • HW that captures information about software execution
    • with minimal performance perturbation to the SW being traced

  • Generates a variety of packets that, when combined with binaries, can:
    • (re)produce an exact execution trace
    • reveal timing and program flow information
      • enables both functional and performance debugging of applications
    • correlate with Processor Event-Based Sampling (PEBS)

4 of 36

INTRODUCTION�(PEBS)

  • A HW feature providing more precise profiling than traditional PMU capabilities

  • How it works?
    • Configure a PMU event
    • When the event occurs, the CPU records current CPU state into a memory buffer
      • CPU state is usually defined by:
        • Instruction Pointer
        • Some other Registers (check the CPU’s manual!)
        • Memory Address (for some memory events)

Main Advantage

Rich Data – Includes the CPU state not just the values of the counters

5 of 36

INTRODUCTION�(INTEL PROCESSOR TRACE)

  • Can log software generated 8 bytes packets
    • via the ptwrite instruction

  • Software generated packets:
    • bind to the associated ptwrite instruction in the binary
      • the address of it as seen by the program counter
    • are timestamped by Intel PT
      • Intel PT has a dedicated counter different from the TSC

asm volatile ("mov %0, %%rax" : : "r"(my_var) : "rax");

asm volatile ("ptwrite %%rax" : : : "rax");

6 of 36

HOW TO USE INTEL PT?

  • Configuration of packet generation and capabilities via a set of MSRs
    • Use the cpuid instruction to detect Intel PT capabilities

  • On Linux systems
    • via the perf_event_open syscall

7 of 36

PERF_EVENT_OPEN�(HTTPS://MAN7.ORG/LINUX/MAN-PAGES/MAN2/PERF_EVENT_OPEN.2.HTML)

  • Populate struct perf_event_attr
    • .config -> Configure Intel PT’s capabilities
    • .type -> Get the from /sys/bus/event_source/devices/intel_pt/type
    • Save the returned file descriptor (fd)

  • Invoke mmap(fd) so that Intel PT outputs the packets to memory
    • Decode PERF_RECORD_AUX records

8 of 36

PERF_RECORD_AUX ENTRIES

  • Stream of bytes containing Intel PT packets
    • Chapter 33.4 TRACE PACKETS AND DATA TYPES of Intel’s Software Developer’s Manual

  • The PERF_RECORD_AUX entries have to be decoded fast
    • loss of information
    • perf-subsystem may hang

9 of 36

USE CASES

Timing

?

Processor Event Sampling

Instruction Tracing

Record Execution Flow

?

10 of 36

USE CASES

All use cases require a decoder that consumes the Intel PT packets!

  • Some resources need to be given to the Intel PT decoder
    • CPU
    • Memory Bandwidth

11 of 36

USE CASES

Traced CPU

do_work

Tracee CPU

Intel PT decoder

Main Memory

Intel PT HW

Cache Memory

Cache Memory

Read Intel PT packets

Write filtered data

Main Memory

Network

Storage

12 of 36

TIME MEASUREMENTS

  • Traditional way
    • read tsc ← 25-35 cc
    • do_work
    • read tsc ← 25-35 cc
    • record time difference ← few cc

    • PTWRITE Instrumentation
      • mov RAX Marker_Start ← X cc
      • ptwrite RAX ← 1 cc
      • do_work
      • mov RAX Marker_Stop ← Y cc
      • ptwrite RAX ← 1 cc

X = Y = 1 if Marker_Start and Marker_Stop are constants

13 of 36

TIME MEASUREMENTS

  • Low overhead instrumentation
  • Suitable for small pieces of code
    • TSC overhead is too noisy

  • Caveats:
    • ptwrite as rdtsc are NOT serializable
      • The exact timing is affected by out-of-order execution

14 of 36

TIME MEASUREMENTS�(HOW IT WORKS?)

  • Intel PT HW uses its own Cycle Counter (CYC)
    • Runs at Core:Bus ratio frequency
    • TSC is synchronized over all cores
      • One of the reasons why rdtsc has larger overhead
    • Intel PT outputs TSC packets to inform the decoder about the wall time

  • To keep in mind:
    • CYC and TSC frequencies may not be the same
      • a transform is needed!

15 of 36

PROCESSOR EVENT-BASED SAMPLING - PEBS

  • Check with CPUID if PEBS output to Intel PT is supported
  • Some HW PMU counters are PEBS compatible

16 of 36

PROCESSOR EVENT-BASED SAMPLING – PEBS�(HOW IT WORKS?)

  • Do another perf_event_open syscall
    • struct perf_event_attr.config set to PEBS compatible event
      • INST_RETIRED.ANY
      • BR_INST_RETIRED.ALL_BRANCHES
      • BR_MISP_RETIRED.ALL_BRANCHES
    • struct perf_event_attr.aux_output = 1
    • [ Use the fd of Intel PT as a group leader ]

  • The Intel PT HW of the traced CPU will generate PEBS records
    • encoded as BBP, BIP and BEP packets

17 of 36

PROCESSOR EVENT-BASED SAMPLING – PEBS�(HOW IT WORKS?)

  • The BBP, BIP and BEP packets

CPU State

18 of 36

PEBS + PDIST

  • PDIST = Precise Distribution
    • PEBS records are being generated precisely upon completion of the instruction

  • perf_event_open doesn’t allow for selecting the PMU counter
    • First call luckily chooses PMU counter 0

19 of 36

PEBS + PDIST

  • PDIST is a HW feature so that PEBS are generated precisely upon the completion of the instruction

  • Limitations:
    • Selected counters
    • Counter reload value must not be less than 256 for PDIST to operate
      • In other words, PDIST works well at a “distance” of 256 retired instructions

20 of 36

TRACE OF EXECUTION

int main(int argc, char* argv[ ]) {

for (;;) {

unsigned long long start = read_tsc();

unsigned long long elapsed = start & 0xFFFllu;

asm volatile ("mov %0, %%rax" : : "r"(elapsed) : "rax");

asm volatile ("ptwrite %%rax" : : : "rax");

asm volatile ("mfence" : : : "memory");

if (elapsed > 1000llu) {

a++;

} else if (elapsed > 500llu) {

b++;

} else if (elapsed > 100llu) {

c++;

} else {

d++;

}

}

return 0;

}

elapsed == 36 leads to d++

21 of 36

TRACE OF EXECUTION

PTW = 36

FUP = 40178c :: PTWRITE PTWRITE :: 17 140862

CYC = 14

SHORT_TNT :: 06

CYC = e7

BBP :: SZ = 00000000 TYPE = 00000004

CYC = 5

BIP :: ID = 00000000 :: PAYLOAD = 40179c :: COND_BR JBE

CYC = 11

BIP :: ID = 00000001 :: PAYLOAD = 1

CYC = 16

BIP :: ID = 00000002 :: PAYLOAD = e174078715c

CYC = af

BEP :: IP = 00000001

FUP = 4017af :: BINARY CMP :: 17 140862

CYC = 90

SHORT_TNT :: 0e

CYC = 2d

PTW = 914

elapsed == 36 leads to d++

22 of 36

TRACE OF EXECUTION

PTW = 36

FUP = 40178c :: PTWRITE PTWRITE :: 17 140862

CYC = 14

SHORT_TNT :: 06 => first branch after ptwrite is taken as 616=01102

CYC = e7

BBP :: SZ = 00000000 TYPE = 00000004

CYC = 5

BIP :: ID = 00000000 :: PAYLOAD = 40179c :: COND_BR JBE

CYC = 11

BIP :: ID = 00000001 :: PAYLOAD = 1

CYC = 16

BIP :: ID = 00000002 :: PAYLOAD = e174078715c

CYC = af

BEP :: IP = 00000001

FUP = 4017af :: BINARY CMP :: 17 140862

CYC = 90

SHORT_TNT :: 0e => the upcoming two branches after CMP are taken e16=11102

CYC = 2d

PTW = 914

elapsed == 36 leads to d++

23 of 36

TRACE OF EXECUTION

40178c: ptwrite %rax

401791: mfence

if (elapsed > 1000llu) {

401794: cmpq $0x3e8,-0x8(%rbp)

40179c: jbe 4017af <main+0x6a>

a++;

40179e: mov 0xc5b4c(%rip),%eax # 4c72f0 <a>

4017a4: add $0x1,%eax

4017a7: mov %eax,0xc5b43(%rip) # 4c72f0 <a>

4017ad: jmp 401754 <main+0xf>

} else if (elapsed > 500llu) {

4017af: cmpq $0x1f4,-0x8(%rbp)

4017b7: jbe 4017ca <main+0x85>

b++;

4017b9: mov 0xc5b35(%rip),%eax # 4c72f4 <b>

4017bf: add $0x1,%eax

4017c2: mov %eax,0xc5b2c(%rip) # 4c72f4 <b>

4017c8: jmp 401754 <main+0xf>

} else if (elapsed > 100llu) {

4017ca: cmpq $0x64,-0x8(%rbp)

4017cf: jbe 4017e5 <main+0xa0>

c++;

4017d1: mov 0xc5b21(%rip),%eax # 4c72f8 <c>

4017d7: add $0x1,%eax

4017da: mov %eax,0xc5b18(%rip) # 4c72f8 <c>

4017e0: jmp 401754 <main+0xf>

} else {

d++;

4017e5: mov 0xc5b11(%rip),%eax # 4c72fc <d>

4017eb: add $0x1,%eax

4017ee: mov %eax,0xc5b08(%rip) # 4c72fc <d>

for (;;) {

4017f4: jmp 401754 <main+0xf>

4017f9: nopl 0x0(%rax)

elapsed == 36 leads to d++

PTW = 36

SHORT_TNT :: 06 => first branch after ptwrite is taken as 616=01102

BIP :: 40179c :: COND_BR JBE

FUP :: 4017af :: BINARY CMP

SHORT_TNT :: 0e => the upcoming two branches after CMP are taken e16=11102

24 of 36

TRACE OF EXECUTION

  • Show objdump output and explain XED instruction decoding
    • https://github.com/intelxed/xed

25 of 36

26 of 36

CPU State

27 of 36

INSTRUCTION TRACING

  • Live instruction tracing on real HW
    • requires a modified Linux kernel

    • CPU runs in single step execution mode
      • generates the SIGTRAP interrupt
      • output the program counter via ptwrite
      • return from interrupt right after knowing the PC

push EFLAGS to stack

modify single step flag

pop EFLAGS from stack

ptwrite PC in the SIGTRAP IRQ

(x86_64 doesn’t have instructions to read the PC)

28 of 36

LIVE INSTRUCTION TRACING

  • “Accurate” timing
    • time of instruction = time between two consecutive interrupts
    • assumes that the noise from the SIGTRAP IRQ is somehow constant!

  • Actual content of the registers
    • get the most probable input for that sequence of instructions

  • Power/Thermal insights

29 of 36

LIVE INSTRUCTION TRACING�(CONCLUSIONS)

  • Method 1:
    • Single step execution with help of ptwrite
    • Disadvantages:
      • Slow method
      • Software needs to handle timeouts

    • Advantages:
      • Software execution can be correlated to the state of the kernel not only state of CPU
  • Method 2:
    • Natural execution of SW
    • Disadvantages:
      • Intel PT may generate too much data
        • Especially when analysing branches
      • No correlation to state of the kernel
    • Advantages:
      • No need to modify the SW and/or the kernel

30 of 36

MORE ADVANCED USE CASES

31 of 36

STACK UNWINDING

  • perf is copying the content of stack into a user space buffer
    • not useful for time sensitive code

  • Use ptwrite to dump the stack pointers
    • much much faster than libunwind
      • combined with special dwarf information
      • additional ELF section embedded in the binary
        • the kernel uses this section to perform stack unwinding

32 of 36

COMPILE TIME INSTRUMENTATION

  • The compiler adds ptwrite instructions to measure the timing of different pieces of code
    • Enhanced PGO

  • A GCC plugin is under development
    • A demo for another time

33 of 36

CHALLENGES

  • The Intel PT decoder needs to be fast
    • writing the decoded data back to memory with minimal filtering
      • use of non-temporal stores to not pollute LLC (L3 cache)

  • Make use of Memory Bandwidth Allocation
    • The tracer CPU can be enforced to use minimal bandwidth

34 of 36

OTHER FINDINGS

  • Bug in 6.11 kernel on hybrid systems (RaptorLake)
    • confirmed by Alexander Shishkin alexander.shishkin@linux.intel.com

  • aux_output for PEBS events return “Operation Not Supported”

35 of 36

VISION

  • Make the perf subsystem great again!
    • Full user-space toolkit

36 of 36