1 of 29

STRUCTURED KERNELS

A better (more structured) way to write kernels!

2 of 29

Hackathon: The Plan

  • [This powerpoint]
    • What are structured kernels
    • Why are they useful
    • High level design
  • [Code walkthrough]
    • Briefly look
  • [Porting tips]
  • Use the chat for questions/discussions throughout the hackathon!

3 of 29

Contents

  • Benefits of structured kernels
  • How to write a new operator: Recap
  • How to make an operator structured
  • Under the hood
  • Current state

4 of 29

WHAT ARE THEY?

5 of 29

What are they?

  • A new way of writing kernels
  • Relies on the improved codegen to generate more boiler-plate code
    • Generates the functional/inplace/out variants
    • Handles Tensor creation and resizing
  • What you need to write:
    • A shape-checking function
      • Asserts on inputs
      • Determines size of the output
    • An out kernel
      • Can focus on the kernel logic
  • Idea: Saves you from writing several lines of error-prone code.
    • Effects multiplied by the hundreds of kernels in PyTorch

6 of 29

Benefit: Shape Checking

  • Structured kernels clearly separate shape checking from actual computation.
  • New Shape Checking API
    • Substitute your input tensors with “meta” tensors
    • Run the operator
    • No tensor storage is allocated! (memory)
    • No kernel calculation is performed! (compute)
  • It can be useful to know what the shape of the output would be without running the op.

* Exact meta tensor user API is subject to change

7 of 29

Benefit: Shape Checking – Under the Hood

  • “Meta” tensor is just an ordinary tensor with:
    • no underlying storage
    • “Meta” Dispatch Key
    • (we’ll talk about internals soon)

* Exact meta tensor user API is subject to change

8 of 29

Benefit: Shape Checking – Potential Use Cases

  • Torch.jit.trace()
    • Tracing a model (usually) only requires knowledge of the input/output shapes of each op
  • Lazy Tensor
    • Some operators only require knowledge of shapes, e.g. size()
    • (a+b).size() doesn’t actually need to compute a+b
    • XLA and MLC are both examples (XLA actually had to implement their own shape logic)

* Exact meta tensor user API is subject to change

9 of 29

HOW TO WRITE A NEW OPERATOR: RECAP

10 of 29

How to write a new operator: Recap

  • Add an entry to native_functions.yaml
  • Specify a function for each backend

11 of 29

How to write a new operator: Recap

  • Add another entry for the out= variant
  • Specify a function for each backend

12 of 29

How to write a new operator: Recap

Has to handle many things:

  • Error checking inputs
  • Resizing the output
  • Actual kernel computation

We could call at::native::empty_cpu here for better perf (but different for cuda!)

  • Write a matching kernel in the at::native namespace
  • Total kernels to write is: (# variants) * (# backends)
  • Codegen machinery glues together everything else!
    • Python bindings
    • C++ API
    • Dispatcher registration

13 of 29

MAKING IT STRUCTURED

14 of 29

Making it Structured

  • (1) Add two new keywords in native_functions.yaml

15 of 29

Making it Structured

  • (1) Add two new keywords in native_functions.yaml
  • (2) Write a meta function
    • Checks inputs
    • Determines output size
    • Calls set_output (codegen’d)

16 of 29

Making it Structured

  • (1) Add two new keywords in native_functions.yaml
  • (2) Write a meta function
    • Checks inputs
    • Determines output size
    • Calls set_output (codegen’d)
  • (3) Write out= kernel (once per backend)
    • Output is already allocated and properly resized
    • Shape checks have passed
    • Device guards are set
    • Version counter bumps all handled

17 of 29

WHAT’S GOING ON UNDER THE HOOD

18 of 29

What’s going on under the hood

These are class methods

  • The design is class-based!
  • Boilerplate is factored into several classes
  • Codegen creates most of these classes

19 of 29

What’s going on under the hood

at::impl::MetaBase

at::meta::upsample_nearest1d

….

Meta() defined per op!

These are class methods

aten/src/ATen/TensorMeta.h

build/aten/src/ATen/MetaFunctions.h (codegen’d)

  • The design is class-based!
  • Boilerplate is factored into several classes
  • Codegen creates most of these classes

20 of 29

What’s going on under the hood

at::impl::MetaBase

at::meta::upsample_nearest1d

at::native::structured_upsample_nearest1d_out_cpu

at::native::structured_upsample_nearest1d_out_cuda

Impl() defined per backend!

These are class methods

Meta() defined per op!

build/aten/src/ATen/MetaFunctions.h (codegen’d)

aten/src/ATen/TensorMeta.h

build/aten/src/ATen/NativeFunctions.h (codegen’d)

  • The design is class-based!
  • Boilerplate is factored into several classes
  • Codegen creates most of these classes

21 of 29

What’s going on under the hood

at::impl::MetaBase

at::meta::upsample_nearest1d

at::native::structured_upsample_nearest1d_out_cpu

at::native::structured_upsample_nearest1d_out_cuda

structured_upsample_nearest1d_out_cpu_out

structured_upsample_nearest1d_out_cpu_functional

structured_upsample_nearest1d_out_cuda_out

structured_upsample_nearest1d_out_cuda_functional

Set_output() defined for functional/inplace/out!

These are class methods

Impl() defined per backend!

Meta() defined per op!

aten/src/ATen/TensorMeta.h

build/aten/src/ATen/MetaFunctions.h (codegen’d)

build/aten/src/ATen/NativeFunctions.h (codegen’d)

build/aten/src/ATen/RegisterCPU.cpp

(codegen’d)

set_output() is defined in codegen.

Handles empty_cpu / empty_cuda / resize_output

  • The design is class-based!
  • Boilerplate is factored into several classes
  • Codegen creates most of these classes

22 of 29

Under the hood: Dispatcher Registration

In build/aten/src/Aten/RegisterCPU.cpp

Before (unstructured kernel)

  • Wrapper just calls the kernel!

In build/aten/src/Aten/RegisterCPU.cpp

After (structured kernel)

  • The kernel doesn’t “exist” yet
  • We just have the pieces spread across the class hierarchy
  • Wrapper composes the pieces

23 of 29

Under the hood: TensorIterator Integration

  • TensorIterator is used in ~1/3 of kernels
  • So we need some amount of integration with structured kernels
  • Example: add

24 of 29

Under the hood: TensorIterator Integration

  • TensorIterator is used in ~1/3 of kernels
  • So we need some amount of integration with structured kernels
  • Example: add

at::impl::MetaBase

at::meta::add_Tensor

at::native::structured_add_out

structured_add_out_out

structured_add_out_functional

structured_add_out_inplace

structured_add_out_out

structured_add_out_functional

structured_add_out_inplace

cpu

cuda

build/aten/src/ATen/RegisterCPU.cpp

(codegen’d)

build/aten/src/ATen/RegisterCUDA.cpp

(codegen’d)

at::TensorIteratorBase

at::TensorIterator

Implements meta() (you write it)

Implements impl() (you write it)

Implements set_output() (codegen’d)

Method on TensorIteratorBase

25 of 29

Under the hood: Shape Computation

  • Structured kernels already split up the shape computation and the actual kernel
  • Just don’t call impl()
  • op.meta()
    • Asserts valid inputs
    • Computes output size
    • Creates output tensor
      • Doesn’t allocate data though! (unique to meta key)

In build/aten/src/Aten/RegisterMeta.cpp

Skipped call to op.impl()

26 of 29

CURRENT STATE

27 of 29

Current State

  • Changes have landed, a few ops have already been ported!
  • Some ops don’t have out variants, and will remain unstructured
    • Factory ops
    • View ops
  • Currently support for CPU, CUDA, and DefaultBackend kernels
    • More eventually (e.g. Quantized)
  • Out-of-tree support for Structured Kernels eventually
    • Goal: external backends can write their out kernels with the same structured kernel guarantees
      • Output is already allocated and resized, shape checks have passed, device guards set, etc.
    • Can be implemented with another dispatch key (Common)

28 of 29

Current State

  • … But the journey isn’t over (pytorch has a lot of operators)
    • We’ll probably make a github issue for porting soon

29 of 29

QUESTIONS?