1 of 17

Substrait    

Engine & Language Independent Data Compute Instructions

2 of 17

Who?

OSS

  • Substrait: Co-creator, SMC member
  • Apache Arrow: Co-creator, Founding PMC Chair
  • Apache Calcite: Founding PMC member
  • Apache Drill: Co-creator, Founding PMC Chair

Commercial

  • Sundeck: CEO & Co-founder
  • Dremio: CTO & Co-founder

3 of 17

Warehouse, Lakehouse, soon we’ll see the Fairhouse

Fairhouse

Lakehouse

Query Engines

APIs

Tables & Catalogs

Cloud Warehouse

SQL API

Processing Engine

Storage (internal)

Spark API/SQL

Spark Engine

2015

2020

Delta Lake

Apache Hudi

Compute Engines

Spark API/SQL

Spark Engine

Snowflake SQL

Snowflake Engine

Redshift SQL

Dremio Engine

Velox Engine

Arrow Engine

Pandas

Ibis

Tables & Catalogs

Apache Iceberg

Delta Lake

Hudi

Tabular

Datalake Formation

2025

  • Built on cloud storage
  • Unlimited Scale

Snowflake, Redshift

Dremio SQL

Dremio Engine

Snowflake SQL

Snowflake Engine

Apache Iceberg

  • Shared data ownership
  • API choice independent of storage format
  • Reduction in data gravity

Databricks, Dremio, Starburst

Warehouse Appliance

SQL API

Processing Engine

Storage (internal)

  • Specialized hardware
  • Designed specifically for analytical use

Teradata, Netezza

2000

  • Rise of the generic engine
  • Engine options independent of api choice

4 of 17

Best-of-breed Decomposition Requires Components

API

Engine

API

Engine

Compiler

Storage

Storage

actually more like this…

Computation

Data

Data

Instruction

Instruction

Instruction

5 of 17

Substrait

How to collaborate on these layers?

API

Engine

Compiler

Storage

Computation

Data

Data

Instruction

Instruction

Instruction

  • API independent computation definition
  • Engine independent compilers
  • Engine independent computational storage
  • High performance independent in-memory format
  • In-engine optimized wire-friendly representation

6 of 17

The Power of Intermediate Representations

JVM Bytecode

LLVM IR

Substrait

Abstract Level

Low

Low

High

FE Innovations

Scala, Clojure, Kotlin

Rust, Swift

New Languages, abstractions

BE Innovations

Dalvik, Graal

WASM

Computational storage, hardware accelerators, specialized engines

7 of 17

Substrait: Cross-Language Serialization for Relational Algebra

Status

  • Formed September 2021
  • Several integrations ongoing, 40+ contributors from multiple companies

Purpose

  • Create a well-defined, cross-language specification for data compute operations

Why

  • New kernels/engines should work with existing analysis experiences
  • It should be easy to create new computation design languages/platforms
  • Innovation is stifled when each new data system needs to solve all FE and BE problems

8 of 17

Theoretical Integrations

substrait

C++ Kernel

ibis

python

r

babel sql

views

duck sql

scala

9 of 17

Principles

  • Specification-first
  • Language independent & serializable
  • Plans are self-contained, have clear intention
    • Allow for dumb consumers
  • Structured hierarchy of primitives
  • Common primitive definitions within project
    • Common types, functions, relational operators
  • Extensibility with discipline

10 of 17

Primitives

  • Types
    • Simple (e.g. i32, fp32, string)
    • Compound (e.g. varchar<N>, fixedbinary<N>)
    • Complex (e.g. List<E>, Map<K,V>, Struct<T,U,...Z>)
  • Expressions
    • Switch statements
    • Field selection (simple and complex)
    • Literals
  • Functions
    • Scalar
    • Aggregate
    • Window
    • Table
  • Relations
    • Production
    • Consumption
    • Distribution
    • Transformation
  • Plans
    • Splittable
    • Normalized for space efficiency
  • Serialization
    • Binary (currently protobuf)
    • Text (tbd, likely yaml)

Plan

Relations

Types

i8

i32

fp32

struct

Expressions

Functions

add

avg

rank

sum

join

agg

filter

read

case

field

literal

orlist

Serialization

protobuf

text

11 of 17

Components

  • Specification/Format
  • Language specific helper libraries (Java, C++, C#, Go, Rust)
  • Plan Validator
  • Integration Tests
  • Network Protocol (On top of Arrow Flight SQL)
  • Integations
    • Next page…

12 of 17

Extensibility with Discipline

  • Project inclusive of patterns that show up in most projects
    • int32, decimal, add, subtract, aggregate, join, hash join, etc.
  • Specification defined extensibility
    • Separation between optimization and semantic differences
    • Well-defined ways to sync independent systems around extensions
  • Types
    • Support for physical variations of existing types (row-wise vs columnar, rle or not, etc)
    • Declare custom types in YAML and use in functions, expressions, etc.
  • Functions
    • Declare custom functions via YAML, standard referencing scheme
    • User-defined functions (write once, use many times)
    • Embedded (business logic closure in scala, python, llvm, webassembly)
  • Relations
    • Extend existing relations for execution optimization information
    • Declare new relations via serialization extensions (such as protobuf
    • User-defined and Embedded patterns

13 of 17

No Formal Concept of "logical" vs "physical"

Project/User Definitions are inconsistent.

There are just things that are closer to and further from implementation

More General

More specific

Gluten

14 of 17

Project Governance

  • Similar to Apache
  • Consensus-driven
  • SMC & Committers
  • All work and decision-making is done in the public (no private lists)
  • Avoid control by any particular organization or person (part of governance)
  • Apache 2.0 Licensed
  • Github Project
  • Contributions from several companies

substrait.io/governance/

15 of 17

Theoretical Integrations

substrait

C++ Kernel

ibis

python

r

babel sql

views

duck sql

scala

16 of 17

Actual Implementations

substrait

C++ Kernel

ibis

python

r

babel sql

views

duck sql

scala

1

2

3

8

4

8

5

4

6

7

17 of 17

Join the Community

  • Start using Substrait
  • Join the community
  • Share your feedback
  • Substrait Slack
    • Subscribe
  • Github