1 of 20

The Types

�Piotr Findeisen @SDF

2 of 20

Agenda

  • argument from authority (aka "about me")
  • type systems
  • Apache Arrow types & functions
  • curr: Apache DataFusion types & functions
  • next: The Types - the future

2

@SDF

3 of 20

About me

  • too long to count experience working with relational databases�
  • core Trino maintainer since '17 & Starburst founding engineer
    • by numbers, top contributor & top maintainer
    • timestamps, time zones, nulls, NaNs, decimals, domains, predicate derivation
    • type mappings in SQL connectors (Trino ↔ other systems)
    • predicate pushdowns, connectors, SPI, DSL for pushdown into SQL databases
    • memory management & revocable memory, spill, bytecode generation (JIT)
    • data lakes, hive, delta, iceberg�
  • Apache Iceberg committer, Apache DataFusion contributor�
  • building multi-dialect SQL compiler, analysis and execution toolchain at SDF on top of DataFusion

3

@SDF

4 of 20

What is the type system?

  • what is the value?
    • e.g. integer numbers, floats, strings, date/time, collections�
  • defines what's a valid value?
    • e.g. -1 is not a valid unsigned integer value, 2024-09-37 is not a valid date, NaN is not a valid integer�
  • relations between types
    • inheritance, polymorphism, implicit coercions?�
  • can be abstract, but usually defines value representation in memory�
  • part of a bigger thing, defines what can be done with a value

4

@SDF

5 of 20

What do we need a type system for?

  • necessary building block for all programming languages
    • values, variables, function parameters have types
  • if f is a function and x is a variable, is f(x) valid?
    • type checking: is "ten" * 10 valid?
    • f(x) valid but requires "adjusting" x's type (coercion)
    • static vs dynamic languages�
  • overload resolution
    • e.g. how to display an x?
    • static languages only

5

functions!

@SDF

6 of 20

f(x)

  • types: the x, the data�
  • functions: the f, the calculus

6

@SDF

7 of 20

Arrow types

  • Arrow is a columnar memory format
    • … for in-memory data processing, optimized, vectorized, cross-language, with well-defined serialization, along with ecosystem of tools
    • everything is a vector�
  • Example: a collection of English words can be stored as a StringArray, LargeStringArray, StringViewArray, DictionaryArray<Key>, RunArray<Key>
    • LargeStringViewArray, RunArray for large string variant - to be added?
  • Arrow DataType is… a type of a vector. Each ↑ has a different type
    • denotes element logical value (e.g. string) + physical representation of a vector (collection)

7

@SDF

8 of 20

Arrow functions

  • Arrow provides rich library of functions that operate on vectors ("kernels")�
  • dynamic typing!

8

/// Perform `lhs + rhs`, returning an error on overflow

pub fn add(lhs: &dyn Datum, rhs: &dyn Datum)�-> Result<ArrayRef, ArrowError> {

arithmetic_op(Op::Add, lhs, rhs)

}

@SDF

9 of 20

SQL types

  • SQL is "relational algebra" - operations on multisets of tuples (rows)
    • columnar by nature!�
  • relation is the first class citizen, and table schema is the the relation type: R(a, b, c)
  • rows hold values: relation type R(a, b, c) holds tuples consisting of a, b and c
    • values have statically known types�

9

@SDF

10 of 20

SQL functions

  • projections, predicates, joins are defined in terms of row-level operations�
  • expression language
    • functions defined directly on scalar values: = || + - * / CAST concat substring
    • aggregate functions defined on abstract collections of scalar values: min max sum
    • static typing

10

@SDF

11 of 20

Apache DataFusion

"SQL query engine built on Apache Arrow"�

11

@SDF

12 of 20

curr Apache DataFusion

  • "SQL query engine built on Apache Arrow"�
  • DF execution is columnar & uses Arrow, Arrow types are vectors (columns)
    • match made in heaven
  • DataFusion types = Apache Arrow types
    • neat!

12

@SDF

13 of 20

Problems? - too much

Arrow type system represents what the value is�(and how it's stored right now)�

SQL type system represents what a value can be �(storage-agnostic)

13

@SDF

14 of 20

Problems? - too little

  • DataFusion lacks types because Arrow lacks them!
    • char(n)
    • varchar(n)
    • timestamp with time zone (that's my favorite)
    • timestamp with local time zone (point in time without zone information)
    • time with time zone (SQL's biggest abomination)
    • JSON, VARIANT, Geometry, HLL, digests
    • extensions for applications building on top of DF; including user defined types (UDT)

14

@SDF

15 of 20

Problems? - complexity

  • DataFusion has expression language for projections & filters, represented by the Expr enum
  • expressions — Like, IsNotNull, Cast, Case, … — are defined in row-oriented semantics, but they have to discern between vector representations
  • it's not possible to define a function e.g. regexp_match(string, string)
    • function implementor needs to explicitly support 5 different kinds of array encodings
  • constants have a type like everything else and it's inefficient to represent them as arrays ⇒ ScalarValue (also 5 different kinds)
    • function implementor needs to explicitly support different representation for constants vs array. e.g. regexp_match needs accept 10 × 10 = 100 arg types combinations

15

@SDF

16 of 20

Problems? - overhead

  • relational operations
    • these could benefit from statically-known columnar runtime representation, right?�
  • table scan
    • e.g. each Parquet file can have different encoding for a particular column (dictionary or not), being able to funnel different representations into downstream would be more efficient�
  • UNION
    • if sources produce different representation of a string column (e.g. unioning a table with VALUES), is it necessary to convert them to common representation?�
  • run-time adaptive processing
    • "it seemed customer table will use flat representation, but there happen to be only few distinct values in the country column"

16

@SDF

17 of 20

🤔

�The SQL is a statically typed, abstract language.

�Should SQL query engine use statically typed,�physical representation?

17

@SDF

18 of 20

next Apache DataFusion - The Types

  • DataFusion type system needs to express more SQL types (expressivity!)�and have extensions for applications building on top of DataFusion
    • ⇒ the new (logical | DF-native) type system - The Types
  • Logical plans need to use The Types
    • functions need to bind to The Types (simplicity!)
    • expression language needs to use them as well�
  • Physical plans could use Arrow data types
    • would prevent runtime adaptivity
    • ⇒ should use The Types (performance!)

18

@SDF

19 of 20

Logical → Physical

  • a Type must map to physical Arrow representations
    • e.g. integerDataType::Int32; JSONDataType::Utf8�
  • any of these can be RLE/REE or Dictionary at runtime
  • make implementing function sane
    • boring physical representation should be abstracted to
      • e.g. for dictionary data, calculating f(x) directly on the dictionary is always a good idea, no need to bother f with that
      • null-checks are boring too (and very boring to do optimally)�
    • representation of x for f should be as abstract & simple as possible

19

@SDF

20 of 20

Challenges

20

@SDF