1 of 36

CSE 344: Section 10

Parallel DBs & NoSQL

Dec 04th, 2025

2 of 36

Administrivia

  • Homework 7:
    • Due at 10:00 pm on Friday, Dec 5th
    • ONE late day allowed!

  • Take-home final
    • Due 10:20 am on Wednesday, Dec 10th
    • This is a morning due date!

  • All regular OH will be canceled during finals week

3 of 36

Parallel DBs

4 of 36

Parallel Query Processing

  • Data Replication: Copying the entire dataset
    • Expensive
    • OLAP workloads (i.e. mostly read-only) unlikely to cause consistency issues
    • Mostly used for intra-query parallelism
  • Data Partitioning (“sharding”): splitting the dataset into heterogeneous pieces
    • Avoiding consistency issues is complicated
    • Can accommodate OLTP workloads
      • Rarely compatible with OLAP workloads, which usually require the full dataset
    • Commonly used in inter-operator parallelism

5 of 36

Horizontal vs Vertical Partitioning

  • Horizontal partitioning: Every attribute is available on a replica
    • But not every tuple is present

  • Vertical partitioning: All tuple values are available on a replica
    • But not every attribute is present

6 of 36

Horizontal Data Partitioning

Payroll

1

2

3

ID

Name

Job

Salary

7 of 36

Horizontal Data Partitioning

Payroll

1

2

3

ID

Name

Job

Salary

8 of 36

Horizontal Data Partitioning

Payroll

ID

Name

Job

Salary

1

2

3

ID

Name

Job

Salary

ID

Name

Job

Salary

ID

Name

Job

Salary

Payroll1

Payroll2

Payroll3

9 of 36

Vertical Data Partitioning

Payroll

1

2

3

ID

Name

Job

Salary

10 of 36

Vertical Data Partitioning

Payroll

1

2

3

ID

Name

Job

Salary

11 of 36

Vertical Data Partitioning

11

Payroll

ID

Name

Job

Salary

1

2

3

ID

Name

Job

Salary

Payroll1

Payroll2

Payroll3

12 of 36

Horizontal Partitioning Strategies

  • Block Partition: Tuples partitioned by a raw size of chunk randomly
  • Range Partition: Each chunk contains tuples in a specific attribute’s range
  • Hash Partition: Each node contains tuples with a range of attribute hashes

  • Variables:
    • Input data size = N
    • Number of servers/nodes = p

13 of 36

Block Partitioning

T(R) = N

T(R2) = N/p

T(Rp) = N/p

T(R1) = N/p

p nodes

14 of 36

Range Partitioning

A

A

A

A

R2, v1 < A <= v2

RN, vN < A < inf

R1, -inf < A <= v1

p nodes

15 of 36

Hash Partitioning

A

A

A

A

R2, 2 = h(A)%p

RN, 0 = h(A)%p

R1, 1 = h(A)%p

p nodes

h(A)

16 of 36

Partition Quality

  • Uniform Partition: Chunks have roughly the same size

  • Skewed Partition: Some chunks are much larger than others, leading to a computational bottleneck for some queries
    • Block partition: Always uniform
    • Range partition: May be skewed
    • Hash partition: May be skewed

17 of 36

Shared-Nothing Model

18 of 36

Data Partitioning

We focus on shared-nothing architecture and intra-operator parallelism.

  • Block Partition
    • Tuples partitioned by raw size (no ordering considered)
  • Hash partitioned on attribute A
    • Node contains tuples with chosen attribute hashes
  • Range partitioned on attribute A
    • Node contains tuples in chosen attribute ranges

19 of 36

Moving Data

We have a “network layer” to move tuples temporarily between nodes.

Transferring data is expensive, so we need to be efficient (especially on joins and grouping).

20 of 36

Moving Data:

Partitioned Hash-Join Mechanism

  • Join on some attribute (e.g. R.x and S.y)
  • Call hash function h(z)

Key Points:

  • Hash shuffle tuples on join attributes

R1, S1

R2, S2

Rn, Sn

R1’, S1’

R2’, S2’

Rn’, Sn’

...

...

Contains tuples s.t.

h(R.x) = h(S.y) = red

Contains tuples s.t.

h(R.x) = h(S.y) = green

Contains

tuples s.t.

h(R.x) = h(S.y) = blue

21 of 36

Moving Data:

Broadcast Join Mechanism

Takes advantage of small datasets (can all fit into main memory)

Key Points:

  • Partition type of R doesn’t matter
  • S is unpartitioned and small
  • Distribute S across all R

R1

R2

Rn

R1’, S

R2’, S

Rn’, S

...

...

S

Contains all of S

Contains all of S

Contains all of S

22 of 36

Parallel Query Plans

Now, we need to know how to derive parallel plans from single node plans.

  • Which RA operations can you do without talking to other nodes?
  • Which RA operations require moving tuples?

σ

π

23 of 36

Parallel DB Practice!

24 of 36

We have a distributed database that holds the relations:

Drug(spec VARCHAR(255), compatibility INT)

Person(name VARCHAR(100) primary key, compatibility INT)

We want to compute:

SELECT P.name, count(D.spec)

FROM Person AS P, Drug AS D

WHERE P.compatibility = D.compatibility

GROUP BY P.name;

Drug is block-partitioned

Person is hash-partitioned on compatibility [h(n)]

You have three nodes. Draw a parallel query plan.

25 of 36

ƔP.name, count(D.spec)(P D)

Node 1

Node 2

Node 3

P D

ƔP.name,count(D.spec)

Ɣ

Ɣ

Hash [h(n)] Drug on compatibility

Takes advantage of:

  1. Hash partitioning of [h(n)]
  2. PK uniqueness of name

Block partitioning of Drug

26 of 36

NoSQL

27 of 36

Data Models

  • Key-value stores
    • Opaque value
    • e.g., Memcached, Redis
  • Document stores
    • “key-object”
    • e.g., SimpleDB, CouchDB, MongoDB, AsterixDB
  • Column-oriented
    • “column groups”
    • e.g., Bigtable, HBase, Cassandra, ClickHouse
  • Graph
    • E.g. Neo4j

28 of 36

Structured Data

29 of 36

Unstructured Data

30 of 36

Semi-structured Data

31 of 36

JSON

32 of 36

JSON and Semi-Structured Data

JSON, XML, Protobuf (also an IDL)

Familiar - as your HTTP request/response

  • Good for data exchange
  • Maps to OOP paradigm

Also - as a database file

  • Flexible tree-structured model
  • Query langs: XQuery, XPath, etc.

33 of 36

NoSQL

  • No clear definition :\
    • Non-relational
    • + scalability, + availability, + flexibility
    • - consistency, - OLAP performance
    • Open source implementations
  • Motivation
    • The need to scale
    • Lots of web apps mostly OLTP queries
      • Read/write intensive
      • but fewer joins & aggregates

34 of 36

SQL vs NoSQL

Structure

Language

Scaling

Convergence

35 of 36

36 of 36

Section Worksheet!