1 of 38

Leveraging the Perf Tool for PostgreSQL Performance Optimization

POSTGRES BANGALORE MEETUP 7

Nitin Jadhav

Senior Software Engineer at Microsoft

2 of 38

Speaker: Nitin Jadhav

  • Senior Software Engineer @ Microsoft
  • Azure Database for PostgreSQL team
  • Experience on onboarding new major and minor version of PostgreSQL
  • Experience on integrating extensions into the managed database.
  • Handling customers issues including data corruption issues.
  • Contribute to the community.

3 of 38

Agenda

  • Context and Scope

  • CPU Performance Fundamentals

  • The Perf Tool

  • Connect CPU Concepts with Perf (with examples)

  • Apply Perf Insights to PostgreSQL (with a practical example)

4 of 38

Context and Scope

5 of 38

The path to CPU performance in PostgreSQL

Step

Category

Performance improvement method

1

DBA

Tune postgresql.conf parameters (e.g., work_mem, shared_buffers)

2

DBA

Optimize queries: add indexes, rewrite queries, update statistics

3

PostgreSQL hacker

Modify code: implement better algorithms

4

PostgreSQL hacker

Modify code: Make the planner smarter

5

PostgreSQL hacker

Modify code: leverage modern CPU architecture

6

Compiler hacker

Enhance compiler optimizations

7

CPU designer

Design faster CPUs. Utilise more transistors

6 of 38

The path to CPU performance in PostgreSQL

Step

Category

Performance improvement method

1

DBA

Tune postgresql.conf parameters (e.g., work_mem, shared_buffers)

2

DBA

Optimize queries: add indexes, rewrite queries, update statistics

3

PostgreSQL hacker

Modify code: implement better algorithms

4

PostgreSQL hacker

Modify code: Make the planner smarter

5

PostgreSQL hacker

Modify code: leverage modern CPU architecture

6

Compiler hacker

Enhance compiler optimizations

7

CPU designer

Design faster CPUs. Utilise more transistors

7 of 38

CPU Performance Fundamentals

Cache Locality

Branch Misprediction

Pipelined Execution

8 of 38

Cache Locality

  • CPUs rely on fast caches to avoid expensive memory access.
  • Access latency hierarchy:
    • CPU Registers: ~0.3 ns
    • L1 Cache: ~1 ns
    • L2 Cache: ~4 ns
    • L3 Cache: ~10 ns
    • RAM: ~100 ns+
    • Disk: ~10 ms (orders of magnitude slower)
  • Sequential access is efficient because it leverages spatial locality.
  • Scattered/random access is inefficient → causes cache misses.
  • Cache misses stall execution while waiting for data from slower memory.
  • Frequent function calls can evict hot code from the instruction cache, increasing stalls.

9 of 38

Branch Misprediction

    • Wastes cycles by flushing the pipeline and reorder buffer.

Impact:

    • Unpredictable branches (data-dependent conditions).
    • Indirect branches (function pointers, virtual calls).

Primary Causes:

    • Flatten execution paths to remove unnecessary branches.
    • Replace indirect calls with direct calls where possible.
    • Loop unrolling to minimize loop exit mispredictions.

How to Reduce Mispredictions:

10 of 38

Instruction Execution Stages

Instruction Fetch

Instruction Decode

Execute

Memory Access

Write Back

11 of 38

Pipelined Execution

  • Serial Execution
  • Pipelined Execution

12 of 38

Pipelined Execution – Data Hazards

Problem: Conflict of operands between instructions

Example: Need result of previous instruction

13 of 38

Pipelined Execution – Control Hazards

  • Branch: Two possible paths in control flow.
  • Which instruction to fetch next?
  • Decision depends on EX stage

14 of 38

Pipelined Execution – Key Insights

  • Pipeline stall = Cycles where the CPU cannot do useful operations.

  • Common stall causes:
    • Cache misses → Waiting for slow memory access (~100ns+).
    • Branch misprediction → Flushing and restarting the pipeline.

  • Fewer stalls = higher throughput.

15 of 38

The Perf Tool

16 of 38

Perf - Introduction

What is perf?�A powerful Linux performance analysis tool for developers and system administrators.

Purpose:

    • Understand performance of processes and the Linux kernel.
    • Diagnose bottlenecks, tune applications, and optimize system performance.

Capabilities:

    • Monitor CPU usage and hardware events.
    • Analyze detailed function calls in complex software stacks.
    • Flexible interface for retrieving and displaying performance metrics.

17 of 38

Perf - Workflows

1. perf list to find events.

2. perf stat to count the events.

3. perf record to write events to a file.

4. perf report to browse the recorded file.

18 of 38

Perf - List the Available Events

Purpose: Displays all available performance monitoring events.

Event Types Included:

  • Hardware events (e.g., CPU cycles, instructions)
  • Software events (e.g., context switches, page faults)
  • Tracepoints, cache events, and more

Behavior:

  • Full list requires sudo
  • Without sudo: Shows a limited set of events

19 of 38

Perf - View CPU Real-Time System Profile

Command: sudo perf top

Purpose: Monitor CPU activity and performance in real time.

What It Shows:

  • Live view of functions consuming CPU cycles.
  • Sorted by highest usage for quick hotspot identification.

Key Features:

  • Updates continuously (like top for performance events).
  • Helps identify performance bottlenecks at function level.

20 of 38

Perf - View CPU Performance Statistics with perf

Purpose: Collect detailed CPU performance metrics for a command or workload.

What It Provides:

  • CPU cycles, instructions executed
  • Cache references and misses
  • Branch instructions and mispredictions
  • Context switches and page faults

Use Cases:

  • Measure efficiency of code execution
  • Compare performance across different configurations

21 of 38

Perf - View CPU Performance for a Command

Purpose: Measure CPU performance metrics while running a specific command.

What It Provides:

  • CPU cycles, instructions executed
  • Cache references and misses
  • Branch instructions and mispredictions
  • Context switches and page faults

Use Cases:

  • Measure efficiency of code execution
  • Compare performance across different configurations

22 of 38

Perf - Record and View Performance Results

Purpose: Capture detailed performance data for later analysis.

What It Does:

  • Records CPU cycle events during command execution.
  • Stores data in perf.data file for post-analysis.

Follow-Up Analysis:

  • Use perf report to view recorded data:

Benefits:

  • Identify hotspots and functions consuming most CPU cycles.
  • Enables deep dive into performance bottlenecks.

23 of 38

Connect CPU Concepts with Perf (with examples)

24 of 38

Example 1

Program1.c

Program2.c

25 of 38

Example 1 – Perf Results

Program1.c

Program2.c

Conclusion: Since Program2.c

allocates and stores data

sequentially, it achieves better

cache utilization, which reduces

CPU cycles and ultimately speeds

up execution.

26 of 38

Example 2

Program3.c

Program4.c

27 of 38

Example 2 – Perf Results

Program3.c

Program4.c

Conclusion: Since Program2.c

enables the CPU to predict

branches accurately, it minimizes

branch mispredictions, which in

turn speeds up execution.

28 of 38

Apply Perf Insights to PostgreSQL

29 of 38

Refactor ExecScan() to allow inlining of its core logic

30 of 38

Problem

31 of 38

Solution

32 of 38

Solution

33 of 38

Solution

34 of 38

Perf Results

Before

After

Conclusion: This leads to fewer instructions and branches, reducing branch misses

and ultimately improving overall performance.

35 of 38

Perf Results - Before

36 of 38

Perf Results - After

37 of 38

Acknowledgements

  • Amit Langote
  • Andres Freund
  • David Rowley
  • Divya Bhargov

38 of 38

Thank you