1 of 29

STRUCTURED KERNELS

A better (more structured) way to write kernels!

2 of 29

Hackathon: The Plan

[This powerpoint]

What are structured kernels
Why are they useful
High level design

[Code walkthrough]

Briefly look

[Porting tips]

https://docs.google.com/document/d/1YA4ZG0eTywBv-5Nlaejo9Ho-r95KJZZ4c1t37EPlJQk/edit?usp=sharing

Use the chat for questions/discussions throughout the hackathon!

3 of 29

Contents

Benefits of structured kernels
How to write a new operator: Recap
How to make an operator structured
Under the hood
Current state

4 of 29

WHAT ARE THEY?

5 of 29

What are they?

A new way of writing kernels
Relies on the improved codegen to generate more boiler-plate code

Generates the functional/inplace/out variants
Handles Tensor creation and resizing

What you need to write:

A shape-checking function

Asserts on inputs
Determines size of the output

An out kernel

Can focus on the kernel logic

Idea: Saves you from writing several lines of error-prone code.

Effects multiplied by the hundreds of kernels in PyTorch

Check out Ed’s RFC! https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md

6 of 29

Benefit: Shape Checking

Structured kernels clearly separate shape checking from actual computation.
New Shape Checking API

Substitute your input tensors with “meta” tensors
Run the operator
No tensor storage is allocated! (memory)
No kernel calculation is performed! (compute)

It can be useful to know what the shape of the output would be without running the op.

* Exact meta tensor user API is subject to change

7 of 29

Benefit: Shape Checking – Under the Hood

“Meta” tensor is just an ordinary tensor with:

no underlying storage
“Meta” Dispatch Key
(we’ll talk about internals soon)

* Exact meta tensor user API is subject to change

8 of 29

Benefit: Shape Checking – Potential Use Cases

Torch.jit.trace()

Tracing a model (usually) only requires knowledge of the input/output shapes of each op

Lazy Tensor

Some operators only require knowledge of shapes, e.g. size()
(a+b).size() doesn’t actually need to compute a+b
XLA and MLC are both examples (XLA actually had to implement their own shape logic)

* Exact meta tensor user API is subject to change

9 of 29

HOW TO WRITE A NEW OPERATOR: RECAP

10 of 29

How to write a new operator: Recap

Add an entry to native_functions.yaml
Specify a function for each backend

11 of 29

How to write a new operator: Recap

Add another entry for the out= variant
Specify a function for each backend

12 of 29

How to write a new operator: Recap

Has to handle many things:

Error checking inputs
Resizing the output
Actual kernel computation

We could call at::native::empty_cpu here for better perf (but different for cuda!)

Write a matching kernel in the at::native namespace
Total kernels to write is: (# variants) * (# backends)
Codegen machinery glues together everything else!

Python bindings
C++ API
Dispatcher registration

13 of 29

MAKING IT STRUCTURED

14 of 29

Making it Structured

(1) Add two new keywords in native_functions.yaml

15 of 29

Making it Structured

(1) Add two new keywords in native_functions.yaml

(2) Write a meta function

Checks inputs
Determines output size
Calls set_output (codegen’d)

16 of 29

Making it Structured

(1) Add two new keywords in native_functions.yaml

(2) Write a meta function

Checks inputs
Determines output size
Calls set_output (codegen’d)

(3) Write out= kernel (once per backend)

Output is already allocated and properly resized
Shape checks have passed
Device guards are set
Version counter bumps all handled

17 of 29

WHAT’S GOING ON UNDER THE HOOD

18 of 29

What’s going on under the hood

These are class methods

The design is class-based!
Boilerplate is factored into several classes
Codegen creates most of these classes

19 of 29

What’s going on under the hood

at::impl::MetaBase

at::meta::upsample_nearest1d

….

Meta() defined per op!

These are class methods

aten/src/ATen/TensorMeta.h

build/aten/src/ATen/MetaFunctions.h (codegen’d)

The design is class-based!
Boilerplate is factored into several classes
Codegen creates most of these classes

20 of 29

What’s going on under the hood

at::impl::MetaBase

at::meta::upsample_nearest1d

at::native::structured_upsample_nearest1d_out_cpu

at::native::structured_upsample_nearest1d_out_cuda

Impl() defined per backend!

These are class methods

Meta() defined per op!

build/aten/src/ATen/MetaFunctions.h (codegen’d)

aten/src/ATen/TensorMeta.h

build/aten/src/ATen/NativeFunctions.h (codegen’d)

The design is class-based!
Boilerplate is factored into several classes
Codegen creates most of these classes

21 of 29

What’s going on under the hood

at::impl::MetaBase

at::meta::upsample_nearest1d

at::native::structured_upsample_nearest1d_out_cpu

at::native::structured_upsample_nearest1d_out_cuda

structured_upsample_nearest1d_out_cpu_out

structured_upsample_nearest1d_out_cpu_functional

structured_upsample_nearest1d_out_cuda_out

structured_upsample_nearest1d_out_cuda_functional

Set_output() defined for functional/inplace/out!

These are class methods

Impl() defined per backend!

Meta() defined per op!

aten/src/ATen/TensorMeta.h

build/aten/src/ATen/MetaFunctions.h (codegen’d)

build/aten/src/ATen/NativeFunctions.h (codegen’d)

build/aten/src/ATen/RegisterCPU.cpp

(codegen’d)

set_output() is defined in codegen.

Handles empty_cpu / empty_cuda / resize_output

The design is class-based!
Boilerplate is factored into several classes
Codegen creates most of these classes

22 of 29

Under the hood: Dispatcher Registration

In build/aten/src/Aten/RegisterCPU.cpp

Before (unstructured kernel)

Wrapper just calls the kernel!

In build/aten/src/Aten/RegisterCPU.cpp

After (structured kernel)

The kernel doesn’t “exist” yet
We just have the pieces spread across the class hierarchy
Wrapper composes the pieces

23 of 29

Under the hood: TensorIterator Integration

TensorIterator is used in ~1/3 of kernels
So we need some amount of integration with structured kernels
Example: add

24 of 29

Under the hood: TensorIterator Integration

TensorIterator is used in ~1/3 of kernels
So we need some amount of integration with structured kernels
Example: add

at::impl::MetaBase

at::meta::add_Tensor

at::native::structured_add_out

structured_add_out_out

structured_add_out_functional

structured_add_out_inplace

structured_add_out_out

structured_add_out_functional

structured_add_out_inplace

cpu

cuda

build/aten/src/ATen/RegisterCPU.cpp

(codegen’d)

build/aten/src/ATen/RegisterCUDA.cpp

(codegen’d)

at::TensorIteratorBase

at::TensorIterator

Implements meta() (you write it)

Implements impl() (you write it)

Implements set_output() (codegen’d)

Method on TensorIteratorBase

25 of 29

Under the hood: Shape Computation

Structured kernels already split up the shape computation and the actual kernel
Just don’t call impl()
op.meta()

Asserts valid inputs
Computes output size
Creates output tensor

Doesn’t allocate data though! (unique to meta key)

In build/aten/src/Aten/RegisterMeta.cpp

Skipped call to op.impl()

26 of 29

CURRENT STATE

27 of 29

Current State

Changes have landed, a few ops have already been ported!

https://github.com/pytorch/pytorch/pull/52277 (TensorIterator)
https://github.com/pytorch/pytorch/pull/51917 (non-TensorIterator)

Some ops don’t have out variants, and will remain unstructured

Factory ops
View ops

Currently support for CPU, CUDA, and DefaultBackend kernels

More eventually (e.g. Quantized)

Out-of-tree support for Structured Kernels eventually

Goal: external backends can write their out kernels with the same structured kernel guarantees

Output is already allocated and resized, shape checks have passed, device guards set, etc.

Can be implemented with another dispatch key (Common)

28 of 29

Current State

… But the journey isn’t over (pytorch has a lot of operators)

We’ll probably make a github issue for porting soon

29 of 29

QUESTIONS?