1 of 27

CPython 3.13 Bytecode Optimization Proposal

Ken Jin & Jules Poon

NUS TEST Lab

2 of 27

Background

  • CPython has a DSL for its bytecode since 3.12.
  • CPython has started splitting its bytecode instructions into simpler micro-ops “uops”, that are closer to a RISC form than a CISC form the current bytecode are.
  • We shall call the original CISC bytecode Tier 1 bytecode, and the smaller uops Tier 2 bytecode.
  • The aim of this is to help the optimizer identify redundant/repeated work done in each Tier 1 bytecode, and remove that.
  • We will trace backward jumps and form trees of superblocks.

3 of 27

Proposed Pipeline

Tier 1 bytecode

Tier 2 bytecode + watchers

Jump target calculation and setup

Cleanup

Partial eval + Type prop

Jump relocation

Undecided:

Use-def/liveness analysis for dead store elimination. Might be too expensive to compute for a Tier 2 interpreter.

4 of 27

Tier 1 Bytecode & Conversion to Tier 2 + Watchers

  • In progress, WIP by Mark, Guido, Brandt, Irit.

For watchers:

  • Convert LOAD_ATTR and LOAD_GLOBAL to LOAD_CONST by adding to the constant pool and installing watchers.
  • Install watchers for CALL as well.
  • Only do this for LOAD_ATTR that are relatively stable (inspect the counter), LOAD_ATTR is rather polymorphic even in pyperformance. So we don’t want to hurt polymorphic performance.
  • We don’t want to do this at this pass because doing it at the uops level is complicated.

5 of 27

Jump target calculation and setup

Problem: jumps and their targets will move after optimization passes, as code size shrinks.

Solution: Single pass to replace all jump instructions and their target instructions’ opcodes with negative index. Map the index to the original instruction in an array. This will allow us to restore them after we move them around during optimization passes.

Already done.

6 of 27

Partial Evaluation + Type prop

  • Done in a single pass because it uses the same algorithm.
  • Constant propagation, type guard elimination, typed instructions.
  • More about the algorithm in a later section.
  • WIP, partially done.

Inlining heuristics:

  • No need for that, because we trace into function calls during the region formation phase.

7 of 27

Cleanup

  • After optimization passes, there will be lots of redundant SAVE_IPs left.
  • Need a single pass to clean that up.
  • Already done.

8 of 27

Jump relocation

  • Replace all the negative opcodes back with our jump instructions and targets.
  • Re-compute our jump targets in that same pass, and populate them.
  • Already done.

9 of 27

The Abstract Interpretation Algorithm

10 of 27

Partial evaluation

The general idea:

Evaluating a program with respect to its “staticness” produces a residual program that should be faster.

Simplified version, adapted for CPython from https://pages.cs.wisc.edu/~horwitz/CS704-NOTES/9.PARTIAL-EVALUATION.html

  • We need a partition of all variables as either “static” or “dynamic”.
  • Operations on static variables do not need to be emitted, just compute their value and store it virtually.
  • Operations on dynamic variables need to be emitted, we need to re-materialize the static operands.

11 of 27

Partial evaluation via abstract interpretation

  • Create an interpreter that walks over the instructions.
  • The interpreter is responsible for creating the partition of variables and maintaining an accurate representation at any point of time.
  • We use this partition to determine the static/dynamic split.
  • Introduce the notion of “pure” instructions in the DSL. An instruction is pure iff its inputs are static then it can be computed without any side effects.
  • The abstract interpreter interprets but does not emit static pure instructions.
  • When encountering a dynamic/non-pure instruction, re-virtualize static operands by adding a LOAD_CONST if necessary.
  • We need to emit an INSERT instruction which inserts TOS into the right place, as the real stack might be incongruent with the virtual stack.
  • Note: we can only do limited dead store elimination in traces, as stores can be used outside of out trace.

12 of 27

The partitioning abstract interpreter

  • Number of fixed nodes = stack maxsize + local size.
  • A node is a tagged pointer.
  • Nodes point to other nodes, to show that they originated from/have the same properties as their parent.
  • This relationship is modelled as a tree.
  • The tree represents a partition.
  • The root node of a tree points to a custom RootNode PyObject that contains all the partitions properties (static/dynamic, constant value, its type, etc.).
  • The RootNode’s refcount is determined by its direct children.
  • Prevent cycles from forming by ensuring node operations on a same tree are non-cycle forming.

13 of 27

Partition view

l1

l2

l3

l4

l5

s5

s4

s3

s2

s1

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 5

type: PyLong_Type

Locals:

Stack:

14 of 27

Node operations

OVERWRITE(src, dst): Make dst belong to src’s partition, and remove it from its old partition. There are some special cases that are handled by our algorithm and documented in the paper.

SET(src, dst): Union dst and src partition. The overall partition root node is src’s root node.

SWAP(src, dst): Swap src and dst between their respective partitions. This is just to handle SWAP in CPython bytecode.

15 of 27

A demo of abstract interpretation + partial evaluation

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

<Abstract locals and stack>

16 of 27

A demo of abstract interpretation

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

17 of 27

A demo of abstract interpretation

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

18 of 27

A demo of abstract interpretation

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

19 of 27

A demo of abstract interpretation

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

20 of 27

A demo of abstract interpretation

ptr

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 2

21 of 27

A demo of abstract interpretation

ptr

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 2

22 of 27

A demo of abstract interpretation

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 3

23 of 27

A demo of abstract interpretation

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 3

24 of 27

A demo of abstract interpretation

ptr

ptr

LOAD_CONST 1

STORE_FAST x

LOAD_FAST x

LOAD_CONST 2

ADD

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 1

Emitted:

LOAD_CONST 3

SAVE_IP

RETURN_VALUE

RootNode

Refcount: 1

static_or_dynamic: static

const_val: 3

25 of 27

Stack accessors

In short:

  • Replace stack accessors with `get_const` if they are pure+static-const. As this implies all operands can be resolved.
  • There are 3 types of variables:
    • Static-const
    • Static-non-const (can be used to elide stack operations for locals loads and stores)
    • Dynamic

26 of 27

Stack accessors

27 of 27

DSL Changes

  • Indicating whether an operation is pure or not.
  • Making implicit sources explicit. E.g. LOAD_FAST has no stack sources but it actually has an implicit source from the locals array. Likewise for LOAD_CONST.
  • Generate the path for const+static with stack accessors.