1 of 35

CSE 344: Section 9

Parallel DBs & Data Governance

May 29th, 2025

2 of 35

Administrivia

Homework 6:

Due at 10:00 pm on Friday, May 30th (tomorrow!).
Late deadline Sunday, June 1st

3 of 35

Parallel DBs

4 of 35

Parallel Query Processing

Data Replication: Copying the entire dataset

Expensive
OLAP workloads (i.e. mostly read-only) unlikely to cause consistency issues
Mostly used for intra-query parallelism

Data Partitioning (“sharding”): splitting the dataset into heterogeneous pieces

Avoiding consistency issues is complicated
Can accommodate OLTP workloads

Rarely compatible with OLAP workloads, which usually require the full dataset

Commonly used in inter-operator parallelism

5 of 35

Horizontal vs Vertical Partitioning

Horizontal partitioning: Every attribute is available on a replica

But not every tuple is present

Vertical partitioning: All tuple values are available on a replica

But not every attribute is present

6 of 35

Horizontal Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

7 of 35

Horizontal Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

8 of 35

Horizontal Data Partitioning

Payroll

ID	Name	Job	Salary

…

1

2

3

ID	Name	Job	Salary

ID	Name	Job	Salary

ID	Name	Job	Salary

Payroll1

Payroll2

Payroll3

9 of 35

Vertical Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

10 of 35

Vertical Data Partitioning

Payroll

…

1

2

3

ID	Name	Job	Salary

…

11 of 35

Vertical Data Partitioning

11

Payroll

ID	Name	Job	Salary

…

1

2

3

ID	Name

Job

Salary

Payroll1

Payroll2

Payroll3

12 of 35

Horizontal Partitioning Strategies

Block Partition: Tuples partitioned by a raw size of chunk randomly
Range Partition: Each chunk contains tuples in a specific attribute’s range
Hash Partition: Each node contains tuples with a range of attribute hashes

Variables:

Input data size = N
Number of servers/nodes = p

13 of 35

Block Partitioning

…

T(R) = N

T(R₂) = N/p

T(R_p) = N/p

T(R₁) = N/p

p nodes






…	…

14 of 35

Range Partitioning

A






…	…

A

A

A

…

R₂, v₁ < A <= v₂

R_N, v_N < A < inf

R₁, -inf < A <= v₁

p nodes

15 of 35

Hash Partitioning

A






…	…

A

A

A

…

R₂, 2 = h(A)%p

R_N, 0 = h(A)%p

R₁, 1 = h(A)%p

p nodes

h(A)

16 of 35

Partition Quality

Uniform Partition: Chunks have roughly the same size

Skewed Partition: Some chunks are much larger than others, leading to a computational bottleneck for some queries

Block partition: Always uniform
Range partition: May be skewed
Hash partition: May be skewed

17 of 35

Worksheet Question 3 & 4

18 of 35

Question 3

The most common query does a join on the mid attribute of both tables, and a group by on the same attribute. We’ll get the best performance by hash partitioning both tables on mid to avoid the shuffle required for both the join and group-by. Range partitioning also works but is less common in practice.

While we can’t minimize skew with respect to the above query, we can spread the both tables as evenly as possible among the machines with horizontal partitioning. Of the schemes learned in class, we could block partition or hash partition on the primary key of each table. It may even be possible to range partition but we’d need to know the statistics of the attributes to choose the correct range values.

19 of 35

Question 4

20 of 35

Question 4

Since the query load consisted solely of counts of in-transit letters, the majority of the other columns in the table are not used, so one way is to vertically partition the data by the Status attribute

Since no mention is made of expected query load, we should choose a horizontal partitioning scheme because we don’t know which columns are needed. So for this case Block partitioning would work well; range partitioning or hash partitioning would also work well if we do not choose an attribute set consisting solely of the recipient column.

21 of 35

Question 4 cont.

The query load consists primarily but not solely of randomly-sampled letter contents, grouped by sender. The database contains every letter sent since the creation of the Postal Service in 1771.

As before, we should choose a horizontal partitioning scheme (although we do know that the letter contents attribute will be used eventually, we don’t have information on if/which other columns are needed). If we are grouping primarily by sender, this would be a reasonable attribute to partition on.

(The founding of the postal service is a red herring in this question)

22 of 35

Shared-Nothing Model

23 of 35

Data Partitioning

We focus on shared-nothing architecture and intra-operator parallelism.

Block Partition

Tuples partitioned by raw size (no ordering considered)

Hash partitioned on attribute A

Node contains tuples with chosen attribute hashes

Range partitioned on attribute A

Node contains tuples in chosen attribute ranges

Ask in class what shared-nothing and intra-operator parallelism mean

A shared-nothing architecture (SN) is a distributed-computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.

Intra-operator parallelism – process each operator on multiple nodes

Block Partition - Tuples are partitioned by size

Hash Partition - Node contains tuples with chosen attribute hashes

hash key is used to distribute rows evenly across the different partitions

data is randomly distributed across the partitions rather than grouped

https://orahow.com/hash-partitioning-with-examples-in-oracle/

Range Partition - Node contains tuples in chosen attribute ranges

Hash partitioning is used where ranges aren’t appropriate, i.e. employee number, customer ID, etc

Range partitioning is often used to organize data by time intervals on a column of type DATE

https://web.stanford.edu/dept/itss/docs/oracle/10gR2/server.102/b14223/parpart.htm#:~:text=You%20do%20this%20by%20specifying,row%2Dto%2Dpartition%20mapping.

24 of 35

Moving Data

We have a “network layer” to move tuples temporarily between nodes.

Transferring data is expensive, so we need to be efficient (especially on joins and grouping).

25 of 35

Moving Data:

Partitioned Hash-Join Mechanism

Join on some attribute (e.g. R.x and S.y)
Call hash function h(z)

Key Points:

Hash shuffle tuples on join attributes

R1, S1

R2, S2

Rn, Sn

R1’, S1’

R2’, S2’

Rn’, Sn’

...

Contains tuples s.t.

h(R.x) = h(S.y) = red

Contains tuples s.t.

h(R.x) = h(S.y) = green

Contains

tuples s.t.

h(R.x) = h(S.y) = blue

26 of 35

Moving Data:

Broadcast Join Mechanism

Takes advantage of small datasets (can all fit into main memory)

Key Points:

Partition type of R doesn’t matter
S is unpartitioned and small
Distribute S across all R

R1

R2

Rn

R1’, S

R2’, S

Rn’, S

...

S

Contains all of S

27 of 35

Parallel Query Plans

Now, we need to know how to derive parallel plans from single node plans.

Which RA operations can you do without talking to other nodes?
Which RA operations require moving tuples?

⋈

σ

π

28 of 35

Parallel DB Practice!

29 of 35

We have a distributed database that holds the relations:

Drug(spec VARCHAR(255), compatibility INT)

Person(name VARCHAR(100) primary key, compatibility INT)

We want to compute:

SELECT P.name, count(D.spec)

FROM Person AS P, Drug AS D

WHERE P.compatibility = D.compatibility

GROUP BY P.name;

Drug is block-partitioned

Person is hash-partitioned on compatibility [h(n)]

You have three nodes. Draw a parallel query plan.

30 of 35

Ɣ_{P.name, count(D.spec)}(P ⋈ D)

Node 1

Node 2

Node 3

P ⋈ D

⋈

Ɣ_{P.name,count(D.spec)}

Ɣ

Hash [h(n)] Drug on compatibility

Takes advantage of:

Hash partitioning of [h(n)]
PK uniqueness of name

Block partitioning of Drug

31 of 35

Data Governance

32 of 35

The Motivation

People, organizations and the law can expect platforms to protect valuable data.

If you don’t need sensitive data, don’t save it!

If you do need data, anonymize it before saving or working with it.

Get rid of data you no longer need.

33 of 35

Anonymization

Masking/Obfuscation

Block access to sensitive data or make it unreadable
Most DBMSs have built in ways to do this

Psuedonymization

Use valid, but fake replacements for sensitive data

Perturbation/Generalization

Rounding/bucketing: often means providing a close, but slightly off value from some kind of ‘rounding’
Return partial data
Noise: appending unnecessary information

34 of 35

Access Control

Question who needs to know what!

Who has read access? Write access?
What columns do people need? What rows?

It’s always nice to be able to generalize permissions based on roles.

This can go hand in hand with anonymization. If I block access to something, that’s a form of masking/obfuscation.

35 of 35

Section Worksheet!