1 of 38

Leveraging the Perf Tool for PostgreSQL Performance Optimization

POSTGRES BANGALORE MEETUP 7

Nitin Jadhav

Senior Software Engineer at Microsoft

2 of 38

Speaker: Nitin Jadhav

Senior Software Engineer @ Microsoft
Azure Database for PostgreSQL team
Experience on onboarding new major and minor version of PostgreSQL
Experience on integrating extensions into the managed database.
Handling customers issues including data corruption issues.
Contribute to the community.

3 of 38

Agenda

Context and Scope

CPU Performance Fundamentals

The Perf Tool

Connect CPU Concepts with Perf (with examples)

Apply Perf Insights to PostgreSQL (with a practical example)

4 of 38

Context and Scope

5 of 38

The path to CPU performance in PostgreSQL

Step	Category	Performance improvement method
1	DBA	Tune postgresql.conf parameters (e.g., work_mem, shared_buffers)
2	DBA	Optimize queries: add indexes, rewrite queries, update statistics
3	PostgreSQL hacker	Modify code: implement better algorithms
4	PostgreSQL hacker	Modify code: Make the planner smarter
5	PostgreSQL hacker	Modify code: leverage modern CPU architecture
6	Compiler hacker	Enhance compiler optimizations
7	CPU designer	Design faster CPUs. Utilise more transistors

6 of 38

The path to CPU performance in PostgreSQL

Step	Category	Performance improvement method
1	DBA	Tune postgresql.conf parameters (e.g., work_mem, shared_buffers)
2	DBA	Optimize queries: add indexes, rewrite queries, update statistics
3	PostgreSQL hacker	Modify code: implement better algorithms
4	PostgreSQL hacker	Modify code: Make the planner smarter
5	PostgreSQL hacker	Modify code: leverage modern CPU architecture
6	Compiler hacker	Enhance compiler optimizations
7	CPU designer	Design faster CPUs. Utilise more transistors

7 of 38

CPU Performance Fundamentals

Cache Locality

Branch Misprediction

Pipelined Execution

8 of 38

Cache Locality

CPUs rely on fast caches to avoid expensive memory access.
Access latency hierarchy:

CPU Registers: ~0.3 ns
L1 Cache: ~1 ns
L2 Cache: ~4 ns
L3 Cache: ~10 ns
RAM: ~100 ns+
Disk: ~10 ms (orders of magnitude slower)

Sequential access is efficient because it leverages spatial locality.
Scattered/random access is inefficient → causes cache misses.
Cache misses stall execution while waiting for data from slower memory.
Frequent function calls can evict hot code from the instruction cache, increasing stalls.

9 of 38

Branch Misprediction

Wastes cycles by flushing the pipeline and reorder buffer.

Impact:

Unpredictable branches (data-dependent conditions).
Indirect branches (function pointers, virtual calls).

Primary Causes:

Flatten execution paths to remove unnecessary branches.
Replace indirect calls with direct calls where possible.
Loop unrolling to minimize loop exit mispredictions.

How to Reduce Mispredictions:

10 of 38

Instruction Execution Stages

Instruction Fetch

Instruction Decode

Execute

Memory Access

Write Back

11 of 38

Pipelined Execution

Serial Execution

Pipelined Execution

12 of 38

Pipelined Execution – Data Hazards

Problem: Conflict of operands between instructions

Example: Need result of previous instruction

13 of 38

Pipelined Execution – Control Hazards

Branch: Two possible paths in control flow.
Which instruction to fetch next?
Decision depends on EX stage

14 of 38

Pipelined Execution – Key Insights

Pipeline stall = Cycles where the CPU cannot do useful operations.

Common stall causes:

Cache misses → Waiting for slow memory access (~100ns+).
Branch misprediction → Flushing and restarting the pipeline.

Fewer stalls = higher throughput.

15 of 38

The Perf Tool

16 of 38

Perf - Introduction

What is perf?�A powerful Linux performance analysis tool for developers and system administrators.

Purpose:

Understand performance of processes and the Linux kernel.
Diagnose bottlenecks, tune applications, and optimize system performance.

Capabilities:

Monitor CPU usage and hardware events.
Analyze detailed function calls in complex software stacks.
Flexible interface for retrieving and displaying performance metrics.

17 of 38

Perf - Workflows

1. perf list to find events.

2. perf stat to count the events.

3. perf record to write events to a file.

4. perf report to browse the recorded file.

18 of 38

Perf - List the Available Events

Purpose: Displays all available performance monitoring events.

Event Types Included:

Hardware events (e.g., CPU cycles, instructions)
Software events (e.g., context switches, page faults)
Tracepoints, cache events, and more

Behavior:

Full list requires sudo
Without sudo: Shows a limited set of events

19 of 38

Perf - View CPU Real-Time System Profile

Command: sudo perf top

Purpose: Monitor CPU activity and performance in real time.

What It Shows:

Live view of functions consuming CPU cycles.
Sorted by highest usage for quick hotspot identification.

Key Features:

Updates continuously (like top for performance events).
Helps identify performance bottlenecks at function level.

20 of 38

Perf - View CPU Performance Statistics with perf

Purpose: Collect detailed CPU performance metrics for a command or workload.

What It Provides:

CPU cycles, instructions executed
Cache references and misses
Branch instructions and mispredictions
Context switches and page faults

Use Cases:

Measure efficiency of code execution
Compare performance across different configurations

21 of 38

Perf - View CPU Performance for a Command

Purpose: Measure CPU performance metrics while running a specific command.

What It Provides:

CPU cycles, instructions executed
Cache references and misses
Branch instructions and mispredictions
Context switches and page faults

Use Cases:

Measure efficiency of code execution
Compare performance across different configurations

22 of 38

Perf - Record and View Performance Results

Purpose: Capture detailed performance data for later analysis.

What It Does:

Records CPU cycle events during command execution.
Stores data in perf.data file for post-analysis.

Follow-Up Analysis:

Use perf report to view recorded data:

Benefits:

Identify hotspots and functions consuming most CPU cycles.
Enables deep dive into performance bottlenecks.

23 of 38

Connect CPU Concepts with Perf (with examples)

24 of 38

Example 1

Program1.c

Program2.c

25 of 38

Example 1 – Perf Results

Program1.c

Program2.c

Conclusion: Since Program2.c

allocates and stores data

sequentially, it achieves better

cache utilization, which reduces

CPU cycles and ultimately speeds

up execution.

26 of 38

Example 2

Program3.c

Program4.c

27 of 38

Example 2 – Perf Results

Program3.c

Program4.c

Conclusion: Since Program2.c

enables the CPU to predict

branches accurately, it minimizes

branch mispredictions, which in

turn speeds up execution.

28 of 38

Apply Perf Insights to PostgreSQL

29 of 38

Refactor ExecScan() to allow inlining of its core logic

30 of 38

Problem

31 of 38

Solution

32 of 38

Solution

33 of 38

Solution

34 of 38

Perf Results

Before

After

Conclusion: This leads to fewer instructions and branches, reducing branch misses

and ultimately improving overall performance.

35 of 38

Perf Results - Before

36 of 38

Perf Results - After

37 of 38

Acknowledgements

Amit Langote
Andres Freund
David Rowley
Divya Bhargov

38 of 38

Thank you