MotherDuck team
(Peter Boncz)
MotherDuck:
DuckDB in the cloud
and in the client
What is DuckDB?
�A lightweight, in-process SQL Analytics Engine that is taking the data world by storm.
A great burger is more than just good beef!
MAKING ANALYTICS EASY
Easy to install / no dependencies
Run anywhere (including the browser)
Query dataframes directly
Friendliest SQL syntax in the world
DuckDB: embedded analytics
⇒ make database technology better suited for data science
(+ data decentralization, fast encrypted execution, …)
DuckDB - overview
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - Table Storage
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - Compressed Storage
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - vectors
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - vectors
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - compressed vectors
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - vectors
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - Out-of-Core
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - Extensions
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - External Formats
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - Pluggable Catalog
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - Pluggable Filesystem
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
DuckDB - WASM
Mark Raasveldt (April 2023)
CMU DUCKDB TALK
Incubating MotherDuck
Incubating MotherDuck
What is MotherDuck?
�A serverless DuckDB platform for low-cost, low-latency analytics that combines the power of your laptop and the modern cloud.
What is MotherDuck?
�A serverless DuckDB platform for low-cost, low-latency analytics that combines the power of your laptop and the modern cloud.
With Hybrid Query Processing!
What is Hybrid Query Processing?
Why Hybrid Query Processing?
MotherDuck Architecture
MotherDuck Architecture
Client Extension
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
Bind
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
Bind
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
FROM parquet_scan(“/home/tab.parquet”) t SELECT sum(t.c)
“recognize tables &
where they
come from”
Bind
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
FROM parquet_scan(“s3://tab.parquet”) t SELECT sum(t.c)
Bind
“recognize tables &
where they
come from”
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
FROM local_db.tab t SELECT sum(t.c)
Bind
“recognize tables &
where they
come from”
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
FROM remote_db.tab t SELECT sum(t.c)
Bind
“recognize tables &
where they
come from”
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
FROM remote_db.tab t SELECT sum(t.c)
Virtual catalog
Bind
“recognize tables &
where they
come from”
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
FROM range(1,100000000) LIMIT 100
“location agnostic” data source
Bind
“recognize tables &
where they
come from”
Query Pipelines
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
Query Pipelines
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
aggregate
join
join
scan(t1)
scan(t2)
scan(t3)
Query Pipelines
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
aggregate
join
join
scan(t1)
scan(t2)
scan(t3)
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
“decide which pipeline to run where”
Bind
Local-Remote Planning
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
aggregate
join
join
scan(t1)
scan(t2)
scan(t3)
Local-Remote Planning
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
aggregate
join
join
scan(t1)
scan(t2)
scan(t3)
local
remote
agnostic
Local-Remote Planning
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
aggregate
join
join
scan(t1)
scan(t2)
scan(t3)
Bridge Operators
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b
aggregate
join
join
scan(t1)
scan(t2)
scan(t3)
bridge operator
Hybrid Query Processing
local (client)
remote (duckling)
Extension
DuckDB
Parse
Bind
Optimize
Execute
Bind
Optimize
Execute
Execute
results
Parse
“bridge operators transport data in flow”
Bind
MotherDuck Architecture
Storage Extension
DuckDB Storage
Commit() => Write Ahead Log => Checkpoint() => Database File
DuckDB
WAL
Database File
many small appends
random writes
MotherDuck Storage
Mapping the same database format on cloud resources..
DuckDB
WAL
Database File
many small appends
random writes
WAL
Database File
(FUSE)
S3
EFS
Differential Storage
Read
Write
DuckDB:
MotherDuck:
Checkpoint
DuckDB:
MotherDuck:
Snapshot (“time travel”):
Fork:
MotherDuck Summary
Research Opportunities
Thank You!
CWI Database Architectures (DA)
Dutch Seminar on Data Systems Design (with UvA, TU/e, TuD)
subscribe: https://dsdsd.da.cwi.nl
By Fall 2024 and 2025 we will be looking for motivated MSc students (final projects)
See: https://bit.ly/cwida-msc-2023 (current cohort topics)
�