1 of 36

CSE 344: Section 10

Parallel DBs & NoSQL

Dec 04th, 2025

2 of 36

Administrivia

Homework 7:

Due at 10:00 pm on Friday, Dec 5th
ONE late day allowed!

Take-home final

Due 10:20 am on Wednesday, Dec 10th
This is a morning due date!

All regular OH will be canceled during finals week

3 of 36

Parallel DBs

4 of 36

Parallel Query Processing

Data Replication: Copying the entire dataset

Expensive
OLAP workloads (i.e. mostly read-only) unlikely to cause consistency issues
Mostly used for intra-query parallelism

Data Partitioning (“sharding”): splitting the dataset into heterogeneous pieces

Avoiding consistency issues is complicated
Can accommodate OLTP workloads

Rarely compatible with OLAP workloads, which usually require the full dataset

Commonly used in inter-operator parallelism

5 of 36

Horizontal vs Vertical Partitioning

Horizontal partitioning: Every attribute is available on a replica

But not every tuple is present

Vertical partitioning: All tuple values are available on a replica

But not every attribute is present

6 of 36

Horizontal Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

7 of 36

Horizontal Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

8 of 36

Horizontal Data Partitioning

Payroll

ID	Name	Job	Salary

…

1

2

3

ID	Name	Job	Salary

ID	Name	Job	Salary

ID	Name	Job	Salary

Payroll1

Payroll2

Payroll3

9 of 36

Vertical Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

10 of 36

Vertical Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

11 of 36

Vertical Data Partitioning

11

Payroll

ID	Name	Job	Salary

…

1

2

3

ID	Name

Job

Salary

Payroll1

Payroll2

Payroll3

12 of 36

Horizontal Partitioning Strategies

Block Partition: Tuples partitioned by a raw size of chunk randomly
Range Partition: Each chunk contains tuples in a specific attribute’s range
Hash Partition: Each node contains tuples with a range of attribute hashes

Variables:

Input data size = N
Number of servers/nodes = p

13 of 36

Block Partitioning

…

T(R) = N

T(R₂) = N/p

T(R_p) = N/p

T(R₁) = N/p

p nodes






…	…

14 of 36

Range Partitioning

A






…	…

A

A

A

…

R₂, v₁ < A <= v₂

R_N, v_N < A < inf

R₁, -inf < A <= v₁

p nodes

15 of 36

Hash Partitioning

A






…	…

A

A

A

…

R₂, 2 = h(A)%p

R_N, 0 = h(A)%p

R₁, 1 = h(A)%p

p nodes

h(A)

16 of 36

Partition Quality

Uniform Partition: Chunks have roughly the same size

Skewed Partition: Some chunks are much larger than others, leading to a computational bottleneck for some queries

Block partition: Always uniform
Range partition: May be skewed
Hash partition: May be skewed

17 of 36

Shared-Nothing Model

18 of 36

Data Partitioning

We focus on shared-nothing architecture and intra-operator parallelism.

Block Partition

Tuples partitioned by raw size (no ordering considered)

Hash partitioned on attribute A

Node contains tuples with chosen attribute hashes

Range partitioned on attribute A

Node contains tuples in chosen attribute ranges

Ask in class what shared-nothing and intra-operator parallelism mean

A shared-nothing architecture (SN) is a distributed-computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.

Intra-operator parallelism – process each operator on multiple nodes

Block Partition - Tuples are partitioned by size

Hash Partition - Node contains tuples with chosen attribute hashes

hash key is used to distribute rows evenly across the different partitions

data is randomly distributed across the partitions rather than grouped

https://orahow.com/hash-partitioning-with-examples-in-oracle/

Range Partition - Node contains tuples in chosen attribute ranges

Hash partitioning is used where ranges aren’t appropriate, i.e. employee number, customer ID, etc

Range partitioning is often used to organize data by time intervals on a column of type DATE

https://web.stanford.edu/dept/itss/docs/oracle/10gR2/server.102/b14223/parpart.htm#:~:text=You%20do%20this%20by%20specifying,row%2Dto%2Dpartition%20mapping.

19 of 36

Moving Data

We have a “network layer” to move tuples temporarily between nodes.

Transferring data is expensive, so we need to be efficient (especially on joins and grouping).

20 of 36

Moving Data:

Partitioned Hash-Join Mechanism

Join on some attribute (e.g. R.x and S.y)
Call hash function h(z)

Key Points:

Hash shuffle tuples on join attributes

R1, S1

R2, S2

Rn, Sn

R1’, S1’

R2’, S2’

Rn’, Sn’

...

Contains tuples s.t.

h(R.x) = h(S.y) = red

Contains tuples s.t.

h(R.x) = h(S.y) = green

Contains

tuples s.t.

h(R.x) = h(S.y) = blue

21 of 36

Moving Data:

Broadcast Join Mechanism

Takes advantage of small datasets (can all fit into main memory)

Key Points:

Partition type of R doesn’t matter
S is unpartitioned and small
Distribute S across all R

R1

R2

Rn

R1’, S

R2’, S

Rn’, S

...

S

Contains all of S

22 of 36

Parallel Query Plans

Now, we need to know how to derive parallel plans from single node plans.

Which RA operations can you do without talking to other nodes?
Which RA operations require moving tuples?

⋈

σ

π

23 of 36

Parallel DB Practice!

24 of 36

We have a distributed database that holds the relations:

Drug(spec VARCHAR(255), compatibility INT)

Person(name VARCHAR(100) primary key, compatibility INT)

We want to compute:

SELECT P.name, count(D.spec)

FROM Person AS P, Drug AS D

WHERE P.compatibility = D.compatibility

GROUP BY P.name;

Drug is block-partitioned

Person is hash-partitioned on compatibility [h(n)]

You have three nodes. Draw a parallel query plan.

25 of 36

Ɣ_{P.name, count(D.spec)}(P ⋈ D)

Node 1

Node 2

Node 3

P ⋈ D

⋈

Ɣ_{P.name,count(D.spec)}

Ɣ

Hash [h(n)] Drug on compatibility

Takes advantage of:

Hash partitioning of [h(n)]
PK uniqueness of name

Block partitioning of Drug

26 of 36

NoSQL

27 of 36

Data Models

Key-value stores

Opaque value
e.g., Memcached, Redis

Document stores

“key-object”
e.g., SimpleDB, CouchDB, MongoDB, AsterixDB

Column-oriented

“column groups”
e.g., Bigtable, HBase, Cassandra, ClickHouse

Graph

E.g. Neo4j

28 of 36

Structured Data

29 of 36

Unstructured Data

30 of 36

Semi-structured Data

31 of 36

JSON

32 of 36

JSON and Semi-Structured Data

JSON, XML, Protobuf (also an IDL)

Familiar - as your HTTP request/response

Good for data exchange
Maps to OOP paradigm

Also - as a database file

Flexible tree-structured model
Query langs: XQuery, XPath, etc.

33 of 36

NoSQL

No clear definition :\

Non-relational
+ scalability, + availability, + flexibility
- consistency, - OLAP performance
Open source implementations

Motivation

The need to scale
Lots of web apps mostly OLTP queries

Read/write intensive
but fewer joins & aggregates

SQL vs. NoSQL Databases: What’s the Difference?

34 of 36

SQL vs NoSQL

Structure

Language

Scaling

Convergence

35 of 36

K/V Store Example

https://docs.google.com/document/d/1PASC_IcB5WfQuNJk7-HwUWLK2Sp56qTKwAW_7Kaj9pk/edit?tab=t.0

36 of 36

Section Worksheet!