1 of 58

MotherDuck team

(Peter Boncz)

MotherDuck:

DuckDB in the cloud

and in the client

2 of 58

What is DuckDB?

�A lightweight, in-process SQL Analytics Engine that is taking the data world by storm.

3 of 58

4 of 58

A great burger is more than just good beef!

MAKING ANALYTICS EASY

Easy to install / no dependencies

Run anywhere (including the browser)

Query dataframes directly

Friendliest SQL syntax in the world

5 of 58

6 of 58

7 of 58

DuckDB: embedded analytics

Created by Hannes Mühleisen and Mark Raasveldt
Idea: analytical SQL system as a linkable library
From research on data systems support for data science:

why don’t data scientists use database systems?

⇒ make database technology better suited for data science

(+ data decentralization, fast encrypted execution, …)

Active discord, blog, starting events, traction:

>10K github stars, >1M downloads/month (9x increase YoY)
DuckDB Labs spin-off (+MotherDuck)

8 of 58

DuckDB - overview

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

9 of 58

DuckDB - Table Storage

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

10 of 58

DuckDB - Compressed Storage

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

11 of 58

DuckDB - vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

12 of 58

DuckDB - vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

13 of 58

DuckDB - compressed vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

14 of 58

DuckDB - vectors

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

15 of 58

DuckDB - Out-of-Core

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

16 of 58

DuckDB - Extensions

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

17 of 58

DuckDB - External Formats

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

18 of 58

DuckDB - Pluggable Catalog

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

19 of 58

DuckDB - Pluggable Filesystem

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

20 of 58

DuckDB - WASM

Mark Raasveldt (April 2023)

CMU DUCKDB TALK

21 of 58

Incubating MotherDuck

22 of 58

Incubating MotherDuck

23 of 58

What is MotherDuck?

�A serverless DuckDB platform for low-cost, low-latency analytics that combines the power of your laptop and the modern cloud.

24 of 58

What is MotherDuck?

�A serverless DuckDB platform for low-cost, low-latency analytics that combines the power of your laptop and the modern cloud.

With Hybrid Query Processing!

25 of 58

What is Hybrid Query Processing?

Every client has a DuckDB

DuckDB is an embedded DBMS
So.. JDBC driver links DuckDB into your application

Every DuckDB client can contact MotherDuck

install motherduck;
load motherduck;

Local Databases and Remote Databases

Can be queried as one
Some execution local, some in the cloud

26 of 58

Why Hybrid Query Processing?

Scale Existing applications

Enable data science team-work (database sharing)
Run workloads in the cloud, put pipelines in production

Enable New Applications

DuckDB runs as WASM in web page. + Laptops + Mobile apps

Ultra-low latency visualization (on local queries)

Secure Applications, Multi-Region cloud execution..

Reduce Cloud Compute

Use local compute resources
Efficient single node architecture

We think there are cool use cases for Hybrid Execution:

It allows existing data science users of DuckDB to scale their applications: start to share databases in a team, rather than work alone on a laptop. Deploy data pipelines that you designed on your laptop to the cloud, to takes these pipelines into production
It allows new classes of applications. DuckDB does not have any software dependencies. Therefore it works in Web Assembly (wasm). So it can run in the browser. And there, it can power interactive visualizations: because queries do not require a network round-trip and run in a millisecond. But we can also envision other applications, where encrypted data is decrypted only on the client. Or applications where the client is a database owned by the user. Or multi-region applications with multiple servers performing hybrid execution placed in multiple clouds, to reduce cloud egress cost by running queries where the data is.
DuckDB is a much more an efficient architecture than Spark: it allows to do more with less hardware. And it makes use of your local hardware where possible. This adds to significantly lower cloud usage.

27 of 58

MotherDuck Architecture

28 of 58

MotherDuck Architecture

Client Extension

29 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

Bind

30 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

Bind

31 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

FROM parquet_scan(“/home/tab.parquet”) t SELECT sum(t.c)

“recognize tables &

where they

come from”

Bind

32 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

FROM parquet_scan(“s3://tab.parquet”) t SELECT sum(t.c)

Bind

“recognize tables &

where they

come from”

33 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

FROM local_db.tab t SELECT sum(t.c)

Bind

“recognize tables &

where they

come from”

34 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

FROM remote_db.tab t SELECT sum(t.c)

Bind

“recognize tables &

where they

come from”

35 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

FROM remote_db.tab t SELECT sum(t.c)

Virtual catalog

Bind

“recognize tables &

where they

come from”

36 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

FROM range(1,100000000) LIMIT 100

“location agnostic” data source

Bind

“recognize tables &

where they

come from”

37 of 58

Query Pipelines

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

38 of 58

Query Pipelines

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

scan(t1)

scan(t2)

scan(t3)

39 of 58

Query Pipelines

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

scan(t1)

scan(t2)

scan(t3)

40 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

“decide which pipeline to run where”

Bind

41 of 58

Local-Remote Planning

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

scan(t1)

scan(t2)

scan(t3)

42 of 58

Local-Remote Planning

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

scan(t1)

scan(t2)

scan(t3)

local

remote

agnostic

43 of 58

Local-Remote Planning

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

scan(t1)

scan(t2)

scan(t3)

44 of 58

Bridge Operators

SELECT count(*) FROM t1 JOIN t2 ON t1.a = t2.a JOIN t3 ON t3.b = t2.b

aggregate

join

scan(t1)

scan(t2)

scan(t3)

bridge operator

45 of 58

Hybrid Query Processing

local (client)

remote (duckling)

Extension

DuckDB

Parse

Bind

Optimize

Execute

Bind

Optimize

Execute

results

Parse

“bridge operators transport data in flow”

Bind

46 of 58

MotherDuck Architecture

Storage Extension

47 of 58

DuckDB Storage

Commit() => Write Ahead Log => Checkpoint() => Database File

DuckDB

WAL

Database File

many small appends

random writes

48 of 58

MotherDuck Storage

Mapping the same database format on cloud resources..

DuckDB

WAL

Database File

many small appends

random writes

WAL

Database File

(FUSE)

S3

EFS

49 of 58

Differential Storage

50 of 58

Read

51 of 58

Write

DuckDB:

Range 1: 200 bytes from [400, 600]
Range 2: 100 bytes from [0, 100]
Range 3: 300 bytes from [1000, 1300]
Range 4: 50 bytes from [575, 625]

MotherDuck:

Range 1: 200 bytes from [0, 200]
Range 2: 100 bytes from [200, 300]
Range 3: 300 bytes from [300, 600]
Range 4: 50 bytes from [600,650]

52 of 58

Checkpoint

DuckDB:

Write all dirty pages and page meta-data to the database file
Delete WAL

MotherDuck:

Migrate active snapshot layer to S3 (background)
Create new empty active snapshot layer

53 of 58

Snapshot (“time travel”):

Fork:

54 of 58

MotherDuck Summary

Hybrid query processing

Remote-local planning, extension modules
Container-based scaling up, and down to 0

Cloud Storage

Scalable Cloud storage, decoupled from Compute
Coming: Immutable storage & time travel with DuckDB databases

User Interface

Tad: Notebook++ (DuckDB-WASM powered): Local pivot table

Thi wraps up the discussion of the MotherDuck Architecture.

It makes use of DuckDB extension modules, introducing parser, catalog, optimizer and operator extensions. The server backbowns are containers that run DuckDB (ducklings) and are variable in their resource footprint (number of cores and memory).

Cloud storage makes use of the DuckDB standard database format, that is single-file and does in-place updates; providing infrastructure to turn that into differential storage and mapping DuckDB storage onto cloud storage efficiently. Differential storage better suits cloud storage and also can power features like time travel.

On top, DuckDB provides an interactive notebook GUI, with a DuckDB-wasm powered pivot table that allows very fast slicing, dicing and reordering thanks to purely local follow-up queries.

55 of 58

Research Opportunities

Query Optimization

Cost based local-remote planning ⇒ asymmetrical client vs server
Driving cloud scheduling for query balance

Taking Hybrid query processing multi-region

Client ⇒ server1 ⇒ server2 ⇒ ..

Different Parties owning Client and Server

Privacy-preserving decentralized architectures

AI features

Inside the Database Service and the user interfaces
AI-powered data science functionalities
Making use of local and server-side GPU resources

56 of 58

57 of 58

Thank You!

58 of 58

CWI Database Architectures (DA)

Dutch Seminar on Data Systems Design (with UvA, TU/e, TuD)

subscribe: https://dsdsd.da.cwi.nl

By Fall 2024 and 2025 we will be looking for motivated MSc students (final projects)

CWI, Databricks, DuckDB Labs & MotherDuck

See: https://bit.ly/cwida-msc-2023 (current cohort topics)

�