1 of 58

MotherDuck team

(Peter Boncz)

MotherDuck:

DuckDB in the cloud

and in the client

2 of 58

What is DuckDB?

�A lightweight, in-process SQL Analytics Engine that is taking the data world by storm.

3 of 58

4 of 58

A great burger is more than just good beef!

MAKING ANALYTICS EASY

Easy to install / no dependencies

Run anywhere (including the browser)

Query dataframes directly

Friendliest SQL syntax in the world

5 of 58

6 of 58

7 of 58

DuckDB: embedded analytics

  • Created by Hannes Mühleisen and Mark Raasveldt
  • Idea: analytical SQL system as a linkable library
  • From research on data systems support for data science:
    • why don’t data scientists use database systems?

⇒ make database technology better suited for data science

(+ data decentralization, fast encrypted execution, …)

  • Active discord, blog, starting events, traction:
    • >10K github stars, >1M downloads/month (9x increase YoY)
    • DuckDB Labs spin-off (+MotherDuck)

8 of 58

DuckDB - overview

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

9 of 58

DuckDB - Table Storage

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

10 of 58

DuckDB - Compressed Storage

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

11 of 58

DuckDB - vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

12 of 58

DuckDB - vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

13 of 58

DuckDB - compressed vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

14 of 58

DuckDB - vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

15 of 58

DuckDB - Out-of-Core

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

16 of 58

DuckDB - Extensions

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

17 of 58

DuckDB - External Formats

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

18 of 58

DuckDB - Pluggable Catalog

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

19 of 58

DuckDB - Pluggable Filesystem

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

20 of 58

DuckDB - WASM

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

21 of 58

Incubating MotherDuck

22 of 58

Incubating MotherDuck

23 of 58

What is MotherDuck?

�A serverless DuckDB platform for low-cost, low-latency analytics that combines the power of your laptop and the modern cloud.

24 of 58

What is MotherDuck?

�A serverless DuckDB platform for low-cost, low-latency analytics that combines the power of your laptop and the modern cloud.

With Hybrid Query Processing!

25 of 58

What is Hybrid Query Processing?

  • Every client has a DuckDB
    • DuckDB is an embedded DBMS
    • So.. JDBC driver links DuckDB into your application
  • Every DuckDB client can contact MotherDuck
    • install motherduck;
    • load motherduck;
  • Local Databases and Remote Databases
    • Can be queried as one
    • Some execution local, some in the cloud

26 of 58

Why Hybrid Query Processing?

  • Scale Existing applications
    • Enable data science team-work (database sharing)
    • Run workloads in the cloud, put pipelines in production
  • Enable New Applications
    • DuckDB runs as WASM in web page. + Laptops + Mobile apps
      • Ultra-low latency visualization (on local queries)
    • Secure Applications, Multi-Region cloud execution..
  • Reduce Cloud Compute
    • Use local compute resources
    • Efficient single node architecture

27 of 58

MotherDuck Architecture

28 of 58

MotherDuck Architecture

Client Extension

29 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

Bind

30 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

Bind

31 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

FROM parquet_scan(“/home/tab.parquet”) t SELECT sum(t.c)

“recognize tables &

where they

come from”

Bind

32 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

FROM parquet_scan(“s3://tab.parquet”) t SELECT sum(t.c)

Bind

“recognize tables &

where they

come from”

33 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

FROM local_db.tab t SELECT sum(t.c)

Bind

“recognize tables &

where they

come from”

34 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

FROM remote_db.tab t SELECT sum(t.c)

Bind

“recognize tables &

where they

come from”

35 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

FROM remote_db.tab t SELECT sum(t.c)

Virtual catalog

Bind

“recognize tables &

where they

come from”

36 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

FROM range(1,100000000) LIMIT 100

“location agnostic” data source

Bind

“recognize tables &

where they

come from”

37 of 58

Query Pipelines

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

38 of 58

Query Pipelines

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

join

scan(t1)

scan(t2)

scan(t3)

39 of 58

Query Pipelines

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

join

scan(t1)

scan(t2)

scan(t3)

40 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

“decide which pipeline to run where”

Bind

41 of 58

Local-Remote Planning

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

join

scan(t1)

scan(t2)

scan(t3)

42 of 58

Local-Remote Planning

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

join

scan(t1)

scan(t2)

scan(t3)

local

remote

agnostic

43 of 58

Local-Remote Planning

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

join

scan(t1)

scan(t2)

scan(t3)

44 of 58

Bridge Operators

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

join

scan(t1)

scan(t2)

scan(t3)

bridge operator

45 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

Execute

results

Parse

“bridge operators transport data in flow”

Bind

46 of 58

MotherDuck Architecture

Storage Extension

47 of 58

DuckDB Storage

Commit() => Write Ahead Log => Checkpoint() => Database File

DuckDB

WAL

Database File

many small appends

random writes

48 of 58

MotherDuck Storage

Mapping the same database format on cloud resources..

DuckDB

WAL

Database File

many small appends

random writes

WAL

Database File

(FUSE)

S3

EFS

49 of 58

Differential Storage

50 of 58

Read

51 of 58

Write

DuckDB:

  • Range 1: 200 bytes from [400, 600]
  • Range 2: 100 bytes from [0, 100]
  • Range 3: 300 bytes from [1000, 1300]
  • Range 4: 50 bytes from [575, 625]

MotherDuck:

  • Range 1: 200 bytes from [0, 200]
  • Range 2: 100 bytes from [200, 300]
  • Range 3: 300 bytes from [300, 600]
  • Range 4: 50 bytes from [600,650]

52 of 58

Checkpoint

DuckDB:

  • Write all dirty pages and page meta-data to the database file
  • Delete WAL

MotherDuck:

  • Migrate active snapshot layer to S3 (background)
  • Create new empty active snapshot layer

53 of 58

Snapshot (“time travel”):

Fork:

54 of 58

MotherDuck Summary

  • Hybrid query processing
    • Remote-local planning, extension modules
    • Container-based scaling up, and down to 0
  • Cloud Storage
    • Scalable Cloud storage, decoupled from Compute
    • Coming: Immutable storage & time travel with DuckDB databases
  • User Interface
    • Tad: Notebook++ (DuckDB-WASM powered): Local pivot table

55 of 58

Research Opportunities

  • Query Optimization
    • Cost based local-remote planning ⇒ asymmetrical client vs server
    • Driving cloud scheduling for query balance
  • Taking Hybrid query processing multi-region
    • Client ⇒ server1 ⇒ server2 ⇒ ..
  • Different Parties owning Client and Server
    • Privacy-preserving decentralized architectures
  • AI features
    • Inside the Database Service and the user interfaces
    • AI-powered data science functionalities
    • Making use of local and server-side GPU resources

56 of 58

57 of 58

Thank You!

58 of 58

CWI Database Architectures (DA)

Dutch Seminar on Data Systems Design (with UvA, TU/e, TuD)

subscribe: https://dsdsd.da.cwi.nl

By Fall 2024 and 2025 we will be looking for motivated MSc students (final projects)

  • CWI, Databricks, DuckDB Labs & MotherDuck

See: https://bit.ly/cwida-msc-2023 (current cohort topics)