THE INTEL PROCESSOR TRACE
BOGDAN TANASA
28th of May, 2025
Lund Linux Con
AGENDA
Processor Performance Insights and Optimization
(from a tooling perspective)
INTRODUCTION�(INTEL PROCESSOR TRACE)
INTRODUCTION�(PEBS)
Main Advantage
Rich Data – Includes the CPU state not just the values of the counters
INTRODUCTION�(INTEL PROCESSOR TRACE)
asm volatile ("mov %0, %%rax" : : "r"(my_var) : "rax");
asm volatile ("ptwrite %%rax" : : : "rax");
HOW TO USE INTEL PT?
PERF_EVENT_OPEN�(HTTPS://MAN7.ORG/LINUX/MAN-PAGES/MAN2/PERF_EVENT_OPEN.2.HTML)
PERF_RECORD_AUX ENTRIES
USE CASES
Timing
?
Processor Event Sampling
Instruction Tracing
Record Execution Flow
?
USE CASES
All use cases require a decoder that consumes the Intel PT packets!
USE CASES
Traced CPU
do_work
Tracee CPU
Intel PT decoder
Main Memory
Intel PT HW
Cache Memory
Cache Memory
Read Intel PT packets
Write filtered data
Main Memory
Network
Storage
TIME MEASUREMENTS
X = Y = 1 if Marker_Start and Marker_Stop are constants
TIME MEASUREMENTS
TIME MEASUREMENTS�(HOW IT WORKS?)
PROCESSOR EVENT-BASED SAMPLING - PEBS
PROCESSOR EVENT-BASED SAMPLING – PEBS�(HOW IT WORKS?)
PROCESSOR EVENT-BASED SAMPLING – PEBS�(HOW IT WORKS?)
CPU State
PEBS + PDIST
PEBS + PDIST
TRACE OF EXECUTION
int main(int argc, char* argv[ ]) {
for (;;) {
unsigned long long start = read_tsc();
unsigned long long elapsed = start & 0xFFFllu;
asm volatile ("mov %0, %%rax" : : "r"(elapsed) : "rax");
asm volatile ("ptwrite %%rax" : : : "rax");
asm volatile ("mfence" : : : "memory");
if (elapsed > 1000llu) {
a++;
} else if (elapsed > 500llu) {
b++;
} else if (elapsed > 100llu) {
c++;
} else {
d++;
}
}
return 0;
}
elapsed == 36 leads to d++
TRACE OF EXECUTION
PTW = 36
FUP = 40178c :: PTWRITE PTWRITE :: 17 140862
CYC = 14
SHORT_TNT :: 06
CYC = e7
BBP :: SZ = 00000000 TYPE = 00000004
CYC = 5
BIP :: ID = 00000000 :: PAYLOAD = 40179c :: COND_BR JBE
CYC = 11
BIP :: ID = 00000001 :: PAYLOAD = 1
CYC = 16
BIP :: ID = 00000002 :: PAYLOAD = e174078715c
CYC = af
BEP :: IP = 00000001
FUP = 4017af :: BINARY CMP :: 17 140862
CYC = 90
SHORT_TNT :: 0e
CYC = 2d
PTW = 914
elapsed == 36 leads to d++
TRACE OF EXECUTION
PTW = 36
FUP = 40178c :: PTWRITE PTWRITE :: 17 140862
CYC = 14
SHORT_TNT :: 06 => first branch after ptwrite is taken as 616=01102
CYC = e7
BBP :: SZ = 00000000 TYPE = 00000004
CYC = 5
BIP :: ID = 00000000 :: PAYLOAD = 40179c :: COND_BR JBE
CYC = 11
BIP :: ID = 00000001 :: PAYLOAD = 1
CYC = 16
BIP :: ID = 00000002 :: PAYLOAD = e174078715c
CYC = af
BEP :: IP = 00000001
FUP = 4017af :: BINARY CMP :: 17 140862
CYC = 90
SHORT_TNT :: 0e => the upcoming two branches after CMP are taken e16=11102
CYC = 2d
PTW = 914
elapsed == 36 leads to d++
TRACE OF EXECUTION
40178c: ptwrite %rax
401791: mfence
if (elapsed > 1000llu) {
401794: cmpq $0x3e8,-0x8(%rbp)
40179c: jbe 4017af <main+0x6a>
a++;
40179e: mov 0xc5b4c(%rip),%eax # 4c72f0 <a>
4017a4: add $0x1,%eax
4017a7: mov %eax,0xc5b43(%rip) # 4c72f0 <a>
4017ad: jmp 401754 <main+0xf>
} else if (elapsed > 500llu) {
4017af: cmpq $0x1f4,-0x8(%rbp)
4017b7: jbe 4017ca <main+0x85>
b++;
4017b9: mov 0xc5b35(%rip),%eax # 4c72f4 <b>
4017bf: add $0x1,%eax
4017c2: mov %eax,0xc5b2c(%rip) # 4c72f4 <b>
4017c8: jmp 401754 <main+0xf>
} else if (elapsed > 100llu) {
4017ca: cmpq $0x64,-0x8(%rbp)
4017cf: jbe 4017e5 <main+0xa0>
c++;
4017d1: mov 0xc5b21(%rip),%eax # 4c72f8 <c>
4017d7: add $0x1,%eax
4017da: mov %eax,0xc5b18(%rip) # 4c72f8 <c>
4017e0: jmp 401754 <main+0xf>
} else {
d++;
4017e5: mov 0xc5b11(%rip),%eax # 4c72fc <d>
4017eb: add $0x1,%eax
4017ee: mov %eax,0xc5b08(%rip) # 4c72fc <d>
for (;;) {
4017f4: jmp 401754 <main+0xf>
4017f9: nopl 0x0(%rax)
elapsed == 36 leads to d++
PTW = 36
SHORT_TNT :: 06 => first branch after ptwrite is taken as 616=01102
BIP :: 40179c :: COND_BR JBE
FUP :: 4017af :: BINARY CMP
SHORT_TNT :: 0e => the upcoming two branches after CMP are taken e16=11102
TRACE OF EXECUTION
CPU State
INSTRUCTION TRACING
push EFLAGS to stack
modify single step flag
pop EFLAGS from stack
ptwrite PC in the SIGTRAP IRQ
(x86_64 doesn’t have instructions to read the PC)
LIVE INSTRUCTION TRACING
LIVE INSTRUCTION TRACING�(CONCLUSIONS)
MORE ADVANCED USE CASES
STACK UNWINDING
COMPILE TIME INSTRUMENTATION
CHALLENGES
OTHER FINDINGS
VISION