Data 101 – Data Engineering
Lecture 2
1
The Basics
Waitlist Policy
Discussions
Office Hours
Late Policy
6
Inclusivity
This class (and associated platforms like Ed, OH) is meant to be an open, equitable, and inclusive environment.
See Berkeley’s principles of community: https://diversity.berkeley.edu/principles-community
No harassment, trolling, abuse etc. will be tolerated.
Please bring any issues to our attention.
Questions?
Data Engineering
Relational Algebra
9
Today: The Relational Data Model & Algebra
A Bit of History
The Relational Data Model
A relation = a set of tuples, each w/ a predefined set of attributes
Often have many relations (in data warehouse, database, or data lake)
12
Tuples |
Rows |
Records
Attributes | Columns
Table/Relation
(an instance of)
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
... | … | … | … | … | … | … |
Example: Police Stop Relation
From the Stanford Open Policing project
Goal: to study racial disparities in policing
Learn more at https://openpolicing.stanford.edu/findings/
The example shows the first four traffic violation-based stops from Oakland, and a subset of the columns. (Note: the age attribute is made up)
How did we extract the data to get to this tabular form? Will discuss later!
13
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
… | … | … | … | … | … | … |
The Relational Data Model
Each relation is defined by a set of attributes
Each attribute has a type, called domain
The domain must be atomic: string, integer, …
Each relation contains a set of tuples or records
Each tuple has values for each attribute
14
Tuples |
Rows |
Records
Attributes | Columns
Table/Relation
(an instance of)
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
… | … | … | … | … | … | … |
Equivalent Representations
15
race | gender | age | citation | arrest |
asian | F | 23 | True | False |
white | M | 45 | False | False |
black | M | 37 | False | False |
hispanic | F | 58 | True | False |
race | citation | gender | age | arrest |
black | True | F | 23 | False |
asian | False | M | 45 | False |
white | False | M | 37 | False |
hispanic | True | F | 58 | False |
Relational Schema & Instance
16
Metadata
Data
Name | Artist | Album | Peak |
22 | Taylor Swift | Red | 20 |
Blinding Lights | The Weeknd | After Hours | 1 |
Truly Madly Deeply | Savage Garden | Savage Garden | 1 |
2 soon | keshi | THE REAPER | 78 |
Shelter | Porter Robinson | Shelter (single) | 16 |
… | … | … | … |
Schema info is just one type of metadata, more on that later
Recap: The Relational Data Model
17
Set Operations: Union and Difference
18
Expressing Union in Data Systems
Spark
>>> BerkeleyStops.union (OaklandStops)
Pandas
>>> pandas.concat([BerkeleyStops, OaklandStops], axis= 0)
SQL
>>> (SELECT * FROM BerkeleyStops) UNION (SELECT * FROM OaklandStops)
Can I store the transformed (unioned) result somewhere? Perhaps as another relation? Yes!
19
Basic SQL Unit
Expressing Union in Data Systems
Spark
>>> EastBayStops = BerkeleyStops.union (OaklandStops)
Pandas
>>> EastBayStops = pandas.concat([BerkeleyStops, OaklandStops], axis= 0)
SQL
>>> CREATE TABLE EastBayStops AS
((SELECT * FROM BerkeleyStops) UNION (SELECT * FROM OaklandStops))
plus a few other ways (we’ll discuss later)
We’ll skip storing the transformed result someplace, knowing that it can be done.
20
Unary Op.: Selection (Cross out rows)
21
Unary Op.: Selection (Cross out rows)
22
Expressing Selection in Data Systems
Spark
>>> Stops.filter (”age < 40”)
>>> Stops.filter (Stops.age < 40)
>>> Stops.where (”age < 40”)
Pandas
>>> Stops[Stops[‘age’] < 40]
SQL
>>> SELECT * FROM Stops WHERE age < 40
23
Another Selection Example
Only retain the tuples where gender = M and age > 40, maybe to study if racial disparities are greater in this subpopulation
24
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
Another Selection Example
Only retain the tuples where gender = M and age > 40
25
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
Another Selection Example
Only retain the tuples where gender = M and age > 40
26
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
Another Selection Example
Only retain the tuples where gender = M and age > 40
27
id | race | gender | age | warning | citation | arrest |
1 | white | M | 45 | True | False | False |
Unary Op.: Projection (Cross Out Columns)
28
Why might getting rid of unwanted attributes be a potentially good idea?
Unary Op.: Projection (Cross Out Columns)
29
Unary Op.: Projection (Cross Out Columns)
30
Expressing Projection in Data Systems
Spark
>>> Stops.select (‘id’, ‘race’, ‘gender’, ‘age’, ‘arrest’)
>>> Stops.drop (‘warning’, ‘citation’)
Pandas
>>> Stops[[‘id’, ‘race’, ‘gender’, ‘age’, ‘arrest’]]
>>> Stops.drop ([‘warning’, ‘citation’], axis=1)
SQL
>>> SELECT id, race, gender, age, arrest FROM Stops
31
Another Projection (+ Selection) Example
32
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
Another Projection (+ Selection) Example
33
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
Another Projection (+ Selection) Example
34
id | race | gender | age | warning | citation | arrest |
17213 | asian | F | 23 | False | True | False |
1 | white | M | 45 | True | False | False |
2 | black | M | 37 | False | False | False |
19 | hispanic | F | 58 | False | True | False |
id | race | gender | age |
17213 | asian | F | 23 |
19 | hispanic | F | 58 |
Cartesian (or Cross) Product
35
36
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Stops
Zips
X
37
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Stops
Zips
X
38
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
id | race | Stops.location | age | warning | citation | arrest | Zips.location | zipcode |
17213 | asian | MacArthur | 23 | False | True | False | MacArthur | 94621 |
1 | white | West Oakland | 45 | True | False | False | MacArthur | 94621 |
2 | black | West Oakland | 37 | False | False | False | MacArthur | 94621 |
19 | hispanic | Civic Center | 58 | False | True | False | MacArthur | 94621 |
17213 | asian | MacArthur | 23 | False | True | False | West Oakland | 94609 |
1 | white | West Oakland | 45 | True | False | False | West Oakland | 94609 |
2 | black | West Oakland | 37 | False | False | False | West Oakland | 94609 |
19 | hispanic | Civic Center | 58 | False | True | False | West Oakland | 94609 |
17213 | asian | MacArthur | 23 | False | True | False | Civic Center | 94612 |
1 | white | West Oakland | 45 | True | False | False | Civic Center | 94612 |
2 | black | West Oakland | 37 | False | False | False | Civic Center | 94612 |
19 | hispanic | Civic Center | 58 | False | True | False | Civic Center | 94612 |
39
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
id | race | Stops.location | age | warning | citation | arrest | Zips.location | zipcode |
17213 | asian | MacArthur | 23 | False | True | False | MacArthur | 94621 |
1 | white | West Oakland | 45 | True | False | False | MacArthur | 94621 |
2 | black | West Oakland | 37 | False | False | False | MacArthur | 94621 |
19 | hispanic | Civic Center | 58 | False | True | False | MacArthur | 94621 |
17213 | asian | MacArthur | 23 | False | True | False | West Oakland | 94609 |
1 | white | West Oakland | 45 | True | False | False | West Oakland | 94609 |
2 | black | West Oakland | 37 | False | False | False | West Oakland | 94609 |
19 | hispanic | Civic Center | 58 | False | True | False | West Oakland | 94609 |
17213 | asian | MacArthur | 23 | False | True | False | Civic Center | 94612 |
1 | white | West Oakland | 45 | True | False | False | Civic Center | 94612 |
2 | black | West Oakland | 37 | False | False | False | Civic Center | 94612 |
19 | hispanic | Civic Center | 58 | False | True | False | Civic Center | 94612 |
Is this really what we might want?
Expressing Cross Product in Data Systems
Spark
>>> Stops.crossJoin (Zips)
Pandas
No direct way to do this.
SQL
>>> SELECT * FROM Stops, Zips
40
Renaming
41
Example
42
Stops
CAStops
id | race | arrest |
17213 | asian | False |
1 | white | False |
2 | black | False |
19 | hispanic | False |
number | race | arrested |
17213 | asian | False |
1 | white | False |
2 | black | False |
19 | hispanic | False |
Expressing Renaming in Data Systems
Spark
>>> Stops.withColumnRenamed(‘id’,‘number’)
.withColumnRenamed(‘arrest’,‘arrested’)
Pandas
>>> Stops.rename(columns = {‘id’: ’number’, ‘arrest’: ‘arrested’})
SQL
>>> SELECT id AS number, race, arrest AS arrested FROM R
Q: What about renaming the dataframe or relation name?
43
Recap: Relational Algebra Operations
44
Derived Operations
45
Theta Join
46
47
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
48
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
49
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
id | race | Stops.location | age | warning | citation | arrest | Zips.location | zipcode |
17213 | asian | MacArthur | 23 | False | True | False | MacArthur | 94621 |
1 | white | West Oakland | 45 | True | False | False | West Oakland | 94609 |
2 | black | West Oakland | 37 | False | False | False | West Oakland | 94609 |
19 | hispanic | Civic Center | 58 | False | True | False | Civic Center | 94612 |
50
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
id | race | Stops.location | age | warning | citation | arrest | Zips.location | zipcode |
17213 | asian | MacArthur | 23 | False | True | False | MacArthur | 94621 |
1 | white | West Oakland | 45 | True | False | False | West Oakland | 94609 |
2 | black | West Oakland | 37 | False | False | False | West Oakland | 94609 |
19 | hispanic | Civic Center | 58 | False | True | False | Civic Center | 94612 |
We often select on the same attributes having the
same values; two copies are not needed
51
Natural Join
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
id | race | location | age | warning | citation | arrest | zipcode |
17213 | asian | MacArthur | 23 | False | True | False | 94621 |
1 | white | West Oakland | 45 | True | False | False | 94609 |
2 | black | West Oakland | 37 | False | False | False | 94609 |
19 | hispanic | Civic Center | 58 | False | True | False | 94612 |
Natural Join
52
Exercise�
Express Stops Natural Join Zips using other operators
Stops (I, R, L, A), Zips (L, Z)
53
id | race | location | age |
17213 | asian | MacArthur | 23 |
1 | white | West Oakland | 45 |
2 | black | West Oakland | 37 |
19 | hispanic | Civic Center | 58 |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Exercise: Express Natural Join using other operators
Stops (I, R, L, A), Zips (L, Z)
54
Quick Sanity Checks
55
Expressing Joins in Data Systems
Spark Theta Join
>>> Stops.join(Zips, [Stops.location == Zips.location])
Spark Natural Join
No convenient way; use theta join w/ other operators
SQL Theta Join
>>> SELECT * FROM Stops, Zips
WHERE Stops.location = Zips.location
SQL Natural Join
>>> SELECT * FROM Stops NATURAL JOIN Zips
56
Expressing Joins in Data Systems
Pandas Theta Join/Natural Join
Neither directly supported, but a weird hybrid supported…
>>> Stops.merge(Zips, on=‘location’)
Will perform a Theta (rather Equi-Join) on location, and then drop one copy of location
Somewhere between Theta and Natural Join!
57
Recap: Relational Algebra Operations
58
Exercise (If there is time)
59
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Exercise (If there is time)
60
id | race | location | age | warning | citation | arrest |
17213 | asian | MacArthur | 23 | False | True | False |
1 | white | West Oakland | 45 | True | False | False |
2 | black | West Oakland | 37 | False | False | False |
19 | hispanic | Civic Center | 58 | False | True | False |
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Exercise (If there is time)
61
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Exercise (If there is time)
62
Recap: Relational Algebra Operations
63
Extended Relational Algebra Operations
Will discuss all of these in the context of SQL!
What can RA not do?
Example 1: Transpose
location | zipcode |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
MacArthur | West Oakland | Civic Center |
94621 | 94609 | 94612 |
MacArthur | 94621 |
West Oakland | 94609 |
Civic Center | 94612 |
Data Matrix
Transposed Data Matrix
Example 2: One Hot Encoding
id | race | arrest |
17213 | asian | False |
1 | white | False |
2 | black | False |
19 | hispanic | False |
id | arrest | asian | white | black | hispanic |
17213 | False | True | False | False | False |
1 | False | False | True | False | False |
2 | False | False | False | True | False |
19 | False | False | False | False | True |
Example 3: Pivot Tables
From Petersohn et al. VLDB 2020
Transpose!
What was the point?
69
Sets vs. Bags
70
Operations on Bags
71
Operations on Bags
72