CSE 344: Section 9
Parallel DBs & Data Governance
May 29th, 2025
Administrivia
Parallel DBs
Parallel Query Processing
Horizontal vs Vertical Partitioning
Horizontal Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Horizontal Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Horizontal Data Partitioning
Payroll
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
ID | Name | Job | Salary |
| | | |
| | | |
Payroll1
Payroll2
Payroll3
Vertical Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Vertical Data Partitioning
Payroll
…
1
2
3
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
Vertical Data Partitioning
11
Payroll
ID | Name | Job | Salary |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
…
…
1
2
3
ID | Name |
| |
| |
| |
| |
| |
| |
| |
| |
| |
Job |
|
|
|
|
|
|
|
|
|
Salary |
|
|
|
|
|
|
|
|
|
Payroll1
Payroll2
Payroll3
Horizontal Partitioning Strategies
Block Partitioning
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
…
T(R) = N
T(R2) = N/p
T(Rp) = N/p
T(R1) = N/p
p nodes
| |
| |
| |
| |
| |
… | … |
| |
| |
| |
| |
Range Partitioning
A | |
| |
| |
| |
| |
| |
| |
… | … |
A | |
| |
| |
A | |
| |
| |
A | |
| |
| |
…
R2, v1 < A <= v2
RN, vN < A < inf
R1, -inf < A <= v1
p nodes
Hash Partitioning
A | |
| |
| |
| |
| |
| |
| |
… | … |
A | |
| |
| |
A | |
| |
| |
A | |
| |
| |
…
R2, 2 = h(A)%p
RN, 0 = h(A)%p
R1, 1 = h(A)%p
p nodes
h(A)
Partition Quality
Worksheet Question 3 & 4
Question 3
Question 4
Question 4
Question 4 cont.
The query load consists primarily but not solely of randomly-sampled letter contents, grouped by sender. The database contains every letter sent since the creation of the Postal Service in 1771.
As before, we should choose a horizontal partitioning scheme (although we do know that the letter contents attribute will be used eventually, we don’t have information on if/which other columns are needed). If we are grouping primarily by sender, this would be a reasonable attribute to partition on.
(The founding of the postal service is a red herring in this question)
Shared-Nothing Model
Data Partitioning
We focus on shared-nothing architecture and intra-operator parallelism.
Moving Data
We have a “network layer” to move tuples temporarily between nodes.
Transferring data is expensive, so we need to be efficient (especially on joins and grouping).
Moving Data:
Partitioned Hash-Join Mechanism
Key Points:
R1, S1
R2, S2
Rn, Sn
R1’, S1’
R2’, S2’
Rn’, Sn’
...
...
Contains tuples s.t.
h(R.x) = h(S.y) = red
Contains tuples s.t.
h(R.x) = h(S.y) = green
Contains
tuples s.t.
h(R.x) = h(S.y) = blue
Moving Data:
Broadcast Join Mechanism
Takes advantage of small datasets (can all fit into main memory)
Key Points:
R1
R2
Rn
R1’, S
R2’, S
Rn’, S
...
...
S
Contains all of S
Contains all of S
Contains all of S
Parallel Query Plans
Now, we need to know how to derive parallel plans from single node plans.
⋈
σ
π
Parallel DB Practice!
We have a distributed database that holds the relations:
Drug(spec VARCHAR(255), compatibility INT)
Person(name VARCHAR(100) primary key, compatibility INT)
We want to compute:
SELECT P.name, count(D.spec)
FROM Person AS P, Drug AS D
WHERE P.compatibility = D.compatibility
GROUP BY P.name;
Drug is block-partitioned
Person is hash-partitioned on compatibility [h(n)]
You have three nodes. Draw a parallel query plan.
ƔP.name, count(D.spec)(P ⋈ D)
Node 1
Node 2
Node 3
P ⋈ D
⋈
⋈
ƔP.name,count(D.spec)
Ɣ
Ɣ
Hash [h(n)] Drug on compatibility
Takes advantage of:
Block partitioning of Drug
Data Governance
The Motivation
People, organizations and the law can expect platforms to protect valuable data.
If you don’t need sensitive data, don’t save it!
If you do need data, anonymize it before saving or working with it.
Get rid of data you no longer need.
Anonymization
Access Control
Question who needs to know what!
It’s always nice to be able to generalize permissions based on roles.
This can go hand in hand with anonymization. If I block access to something, that’s a form of masking/obfuscation.
Section Worksheet!