1 of 72

Data 101 – Data Engineering

Lecture 2

1

2 of 72

The Basics

  • Class website: data101.org
    • If enrolled or waitlisted, please join Ed and Gradescope
    • Links on the course website!
  • Please do not email course staff unless there are sensitive issues
    • Private Ed post is almost always a better idea

3 of 72

Waitlist Policy

  • Will let the waitlist play out
  • Cannot take concurrent enrollment students at this time
    • You are welcome to audit!

4 of 72

Discussions

  • All in-person discussions held in Evans 458 (pre-semester survey form had a typo)
    • M 12-1 (Cathy)
    • M 1-2 (Audrey)
    • M 2-3 (Shadaj)
    • M 3-4 (Evelyn)
  • One virtual section, taught by Shreya
    • Tues 4-5
    • May not be recorded.
  • Section begins week of Sept. 12

5 of 72

Office Hours

  • Check course calendar on website.
  • All OH held in Warren 101
  • Begin week of Sept 5

6 of 72

Late Policy

  • Up to four late days for projects and multivitamins each, to be used in any way that makes sense, without any explanation.
    • Cannot mix and match across categories
    • But in our experience this should be plenty!
  • Any submission after the deadline will be rounded up to the closest day (i.e., 24 hour), so even one hour late will count as the use of a full day.
  • If you go beyond your late day allocation, you lose 15% of the grade for that project per day late
    • Not applicable to multivitamins – would like you to stay on schedule
  • In case of medical issues, please contact the staff as soon as you are able. (private Ed post works best)

6

7 of 72

Inclusivity

This class (and associated platforms like Ed, OH) is meant to be an open, equitable, and inclusive environment.

See Berkeley’s principles of community: https://diversity.berkeley.edu/principles-community

No harassment, trolling, abuse etc. will be tolerated.

Please bring any issues to our attention.

8 of 72

Questions?

9 of 72

Data Engineering

Relational Algebra

9

10 of 72

Today: The Relational Data Model & Algebra

  • Just like:
    • Matrix manipulation (and a lot of ML!) is built on linear algebra
    • Arithmetic & calculus is built on elementary algebra

  • A firm grounding in the relational data model & algebra allows us to understand and quantify the transformational capabilities of various data systems
    • Provides a “baseline” of functionality, and
    • Study the impact of deviations from the data model and algebra

11 of 72

A Bit of History

  • Introduced in a seminal paper by Edgar F. Codd in 1970

  • Formed the basis for the relational database revolution and many subsequent data systems

12 of 72

The Relational Data Model

A relation = a set of tuples, each w/ a predefined set of attributes

Often have many relations (in data warehouse, database, or data lake)

12

Tuples |

Rows |

Records

Attributes | Columns

Table/Relation

(an instance of)

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

...

13 of 72

Example: Police Stop Relation

From the Stanford Open Policing project

Goal: to study racial disparities in policing

Learn more at https://openpolicing.stanford.edu/findings/

The example shows the first four traffic violation-based stops from Oakland, and a subset of the columns. (Note: the age attribute is made up)

How did we extract the data to get to this tabular form? Will discuss later!

13

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

14 of 72

The Relational Data Model

Each relation is defined by a set of attributes

Each attribute has a type, called domain

The domain must be atomic: string, integer, …

Each relation contains a set of tuples or records

Each tuple has values for each attribute

14

Tuples |

Rows |

Records

Attributes | Columns

Table/Relation

(an instance of)

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

15 of 72

Equivalent Representations

  • Since a relation has a set of attributes and a set of tuples, we can reorder both and not change the relation

15

race

gender

age

citation

arrest

asian

F

23

True

False

white

M

45

False

False

black

M

37

False

False

hispanic

F

58

True

False

race

citation

gender

age

arrest

black

True

F

23

False

asian

False

M

45

False

white

False

M

37

False

hispanic

True

F

58

False

16 of 72

Relational Schema & Instance

  • Schema: The structure, format, or scaffolding
    • Schema for a Relation: Relation names plus attribute names & domains
      • Songs (name String, artist String, album String, peak Integer, …)
    • Overall schema: Schemas for many relations
      • Songs (…)
      • Artists (name String, grammys Integer, age Integer)
  • Instance: Specific instantiation of the Relation
    • A Relation with “values filled in”
  • Schema = metadata; Instance = data

16

Metadata

Data

Name

Artist

Album

Peak

22

Taylor Swift

Red

20

Blinding Lights

The Weeknd

After Hours

1

Truly Madly Deeply

Savage Garden

Savage Garden

1

2 soon

keshi

THE REAPER

78

Shelter

Porter Robinson

Shelter (single)

16

Schema info is just one type of metadata, more on that later

17 of 72

Recap: The Relational Data Model

  • Data is stored in relations
  • Relations are defined by a set of attributes/columns w/ specific domains
    • Together this defines the “schema” of a relation
  • At any time, relations comprise of a set of tuples/rows/records
    • This constitutes the “instance” of a relation
  • There may be many relations.

  • Next, we will define Relational Algebra: the operations that help us transform relations (the operands)
    • Parallels to linear algebra or set algebra

17

18 of 72

Set Operations: Union and Difference

  • Union & Difference
    • Binary operations from set algebra
    • Recall that relations are sets
  • Union Notation:
    • R and S must have same schema
    • Output schema: same as input
  • Example: OaklandStops (id, race, arrest) U BerkeleyStops (id, race, arrest)

  • Difference is similar : tuples in R and not in S

18

 

 

 

 

19 of 72

Expressing Union in Data Systems

Spark

>>> BerkeleyStops.union (OaklandStops)

Pandas

>>> pandas.concat([BerkeleyStops, OaklandStops], axis= 0)

SQL

>>> (SELECT * FROM BerkeleyStops) UNION (SELECT * FROM OaklandStops)

Can I store the transformed (unioned) result somewhere? Perhaps as another relation? Yes!

19

 

Basic SQL Unit

20 of 72

Expressing Union in Data Systems

Spark

>>> EastBayStops = BerkeleyStops.union (OaklandStops)

Pandas

>>> EastBayStops = pandas.concat([BerkeleyStops, OaklandStops], axis= 0)

SQL

>>> CREATE TABLE EastBayStops AS

((SELECT * FROM BerkeleyStops) UNION (SELECT * FROM OaklandStops))

plus a few other ways (we’ll discuss later)

We’ll skip storing the transformed result someplace, knowing that it can be done.

20

 

21 of 72

Unary Op.: Selection (Cross out rows)

  • Returns all tuples that satisfy a condition (also called filter)
  • Notation
    • Condition can involve attributes in with =, >, <, AND, OR, etc.
  • Output schema: same as input
  • Say we want to only study the stops where the driver age is < 40

21

 

 

 

22 of 72

Unary Op.: Selection (Cross out rows)

  • Returns all tuples that satisfy a condition (also called filter)
  • Notation
    • Condition can involve attributes in with =, >, <, AND, OR, etc.
  • Output schema: same as input
  • Say we want to only study the stops where the driver age is < 40

22

 

 

 

 

23 of 72

Expressing Selection in Data Systems

Spark

>>> Stops.filter (”age < 40”)

>>> Stops.filter (Stops.age < 40)

>>> Stops.where (”age < 40”)

Pandas

>>> Stops[Stops[‘age’] < 40]

SQL

>>> SELECT * FROM Stops WHERE age < 40

23

 

24 of 72

Another Selection Example

Only retain the tuples where gender = M and age > 40, maybe to study if racial disparities are greater in this subpopulation

24

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

25 of 72

Another Selection Example

Only retain the tuples where gender = M and age > 40

25

 

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

26 of 72

Another Selection Example

Only retain the tuples where gender = M and age > 40

26

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

 

27 of 72

Another Selection Example

Only retain the tuples where gender = M and age > 40

27

 

id

race

gender

age

warning

citation

arrest

1

white

M

45

True

False

False

28 of 72

Unary Op.: Projection (Cross Out Columns)

  • Notation
  • Where

  • That is, you can only sub select from columns that are present in R
  • Output Schema
  • Example: say we want to predict arrests (rather than citations or warrants) from Stops (id, race, gender, age, warning, arrest, citation), we can get rid of those attributes

28

 

 

 

 

Why might getting rid of unwanted attributes be a potentially good idea?

29 of 72

Unary Op.: Projection (Cross Out Columns)

  • Notation
  • Where

  • That is, you can only sub select from columns that are present in R
  • Output Schema
  • Example: say we want to predict arrests (rather than citations or warnings) from Stops (id, race, gender, age, warning, arrest, citation), we can get rid of those attributes

29

 

 

 

 

 

30 of 72

Unary Op.: Projection (Cross Out Columns)

  • Notation
  • Where

  • That is, you can only sub select from columns that are present in R
  • Output Schema
  • Example: say we want to predict arrests from Stops

  • Note: Automatically deletes any duplicate rows to return a set!

30

 

 

 

 

 

31 of 72

Expressing Projection in Data Systems

Spark

>>> Stops.select (‘id’, ‘race’, ‘gender’, ‘age’, ‘arrest’)

>>> Stops.drop (‘warning’, ‘citation’)

Pandas

>>> Stops[[‘id’, ‘race’, ‘gender’, ‘age’, ‘arrest’]]

>>> Stops.drop ([‘warning’, ‘citation’], axis=1)

SQL

>>> SELECT id, race, gender, age, arrest FROM Stops

31

 

32 of 72

Another Projection (+ Selection) Example

  • Say we want to retain all records pertaining to citations (i.e., citations are true), and also drop arrest, warning, and citation attributes.

32

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

33 of 72

Another Projection (+ Selection) Example

33

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

 

  • Say we want to retain all records pertaining to citations (i.e., citations are true), and also drop arrest, warning, and citation attributes.

34 of 72

Another Projection (+ Selection) Example

34

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

 

id

race

gender

age

17213

asian

F

23

19

hispanic

F

58

  • Say we want to retain all records pertaining to citations (i.e., citations are true), and also drop arrest, warning, and citation attributes.

35 of 72

Cartesian (or Cross) Product

  • Notation
  • Where

  • Output Schema
  • Meaning: associate each tuple on LHS with RHS
  • Why would we ever want to do this? Hold your horses!
  • One nuance:

35

 

 

 

 

 

36 of 72

36

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

Stops

Zips

X

37 of 72

37

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

Stops

Zips

X

38 of 72

38

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

id

race

Stops.location

age

warning

citation

arrest

Zips.location

zipcode

17213

asian

MacArthur

23

False

True

False

MacArthur

94621

1

white

West Oakland

45

True

False

False

MacArthur

94621

2

black

West Oakland

37

False

False

False

MacArthur

94621

19

hispanic

Civic Center

58

False

True

False

MacArthur

94621

17213

asian

MacArthur

23

False

True

False

West Oakland

94609

1

white

West Oakland

45

True

False

False

West Oakland

94609

2

black

West Oakland

37

False

False

False

West Oakland

94609

19

hispanic

Civic Center

58

False

True

False

West Oakland

94609

17213

asian

MacArthur

23

False

True

False

Civic Center

94612

1

white

West Oakland

45

True

False

False

Civic Center

94612

2

black

West Oakland

37

False

False

False

Civic Center

94612

19

hispanic

Civic Center

58

False

True

False

Civic Center

94612

39 of 72

39

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

id

race

Stops.location

age

warning

citation

arrest

Zips.location

zipcode

17213

asian

MacArthur

23

False

True

False

MacArthur

94621

1

white

West Oakland

45

True

False

False

MacArthur

94621

2

black

West Oakland

37

False

False

False

MacArthur

94621

19

hispanic

Civic Center

58

False

True

False

MacArthur

94621

17213

asian

MacArthur

23

False

True

False

West Oakland

94609

1

white

West Oakland

45

True

False

False

West Oakland

94609

2

black

West Oakland

37

False

False

False

West Oakland

94609

19

hispanic

Civic Center

58

False

True

False

West Oakland

94609

17213

asian

MacArthur

23

False

True

False

Civic Center

94612

1

white

West Oakland

45

True

False

False

Civic Center

94612

2

black

West Oakland

37

False

False

False

Civic Center

94612

19

hispanic

Civic Center

58

False

True

False

Civic Center

94612

Is this really what we might want?

40 of 72

Expressing Cross Product in Data Systems

Spark

>>> Stops.crossJoin (Zips)

Pandas

No direct way to do this.

SQL

>>> SELECT * FROM Stops, Zips

40

 

41 of 72

Renaming

  • Does not change the data/instance, only the metadata/schema (the relation and attribute names)
  • Notation

  • Input schema

  • Output schema (NB: can omit output name if we don’t want to rename)

  • Example:

41

 

 

 

 

 

 

 

42 of 72

Example

42

Stops

CAStops

 

id

race

arrest

17213

asian

False

1

white

False

2

black

False

19

hispanic

False

number

race

arrested

17213

asian

False

1

white

False

2

black

False

19

hispanic

False

43 of 72

Expressing Renaming in Data Systems

Spark

>>> Stops.withColumnRenamed(‘id’,‘number’)

.withColumnRenamed(‘arrest’,‘arrested’)

Pandas

>>> Stops.rename(columns = {‘id’: ’number’, ‘arrest’: ‘arrested’})

SQL

>>> SELECT id AS number, race, arrest AS arrested FROM R

Q: What about renaming the dataframe or relation name?

43

 

44 of 72

Recap: Relational Algebra Operations

  • Set ops. — union, difference (expand/shrink vertically)
  • Selection (cross out rows)
  • Projection (cross out columns)
  • Cartesian product (expand horizontally and vertically)
  • Renaming (change schema)

44

45 of 72

Derived Operations

  • Set operations: can derive intersection from union and difference (Exercise!)
  • Joins!
    • Special case of cartesian product and selection that appears frequently in queries

45

46 of 72

Theta Join

  • A Cartesian product followed by a selection
  • Notation

  • Input Schemas

  • Output Schema

  • Perform the cross-product, remove tuples that don’t satisfy
  • Special case is an equality condition: Equi join
  • Like cross-product use prefixes to distinguish common attribs

46

 

 

 

 

 

 

47 of 72

47

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

 

48 of 72

48

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

 

49 of 72

49

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

id

race

Stops.location

age

warning

citation

arrest

Zips.location

zipcode

17213

asian

MacArthur

23

False

True

False

MacArthur

94621

1

white

West Oakland

45

True

False

False

West Oakland

94609

2

black

West Oakland

37

False

False

False

West Oakland

94609

19

hispanic

Civic Center

58

False

True

False

Civic Center

94612

 

50 of 72

50

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

id

race

Stops.location

age

warning

citation

arrest

Zips.location

zipcode

17213

asian

MacArthur

23

False

True

False

MacArthur

94621

1

white

West Oakland

45

True

False

False

West Oakland

94609

2

black

West Oakland

37

False

False

False

West Oakland

94609

19

hispanic

Civic Center

58

False

True

False

Civic Center

94612

We often select on the same attributes having the

same values; two copies are not needed

 

51 of 72

51

 

Natural Join

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

id

race

location

age

warning

citation

arrest

zipcode

17213

asian

MacArthur

23

False

True

False

94621

1

white

West Oakland

45

True

False

False

94609

2

black

West Oakland

37

False

False

False

94609

19

hispanic

Civic Center

58

False

True

False

94612

52 of 72

Natural Join

  • Notation

  • Input Schemas

  • Output Schema

    • But with duplicate attributes removed
  • Sequence:
    • cross product
    • match tuples based on values of shared attributes
    • remove duplicate attributes

52

 

 

 

 

53 of 72

Exercise�

Express Stops Natural Join Zips using other operators

Stops (I, R, L, A), Zips (L, Z)

53

id

race

location

age

17213

asian

MacArthur

23

1

white

West Oakland

45

2

black

West Oakland

37

19

hispanic

Civic Center

58

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

54 of 72

Exercise: Express Natural Join using other operators

Stops (I, R, L, A), Zips (L, Z)

54

 

55 of 72

Quick Sanity Checks

  • Say R (A, B), S (A, B) are natural joined. What would the result be?
    • Intersection of R and S

  • Say R (A, B), S (C, D) are natural joined. What would the result be?

55

56 of 72

Expressing Joins in Data Systems

Spark Theta Join

>>> Stops.join(Zips, [Stops.location == Zips.location])

Spark Natural Join

No convenient way; use theta join w/ other operators

SQL Theta Join

>>> SELECT * FROM Stops, Zips

WHERE Stops.location = Zips.location

SQL Natural Join

>>> SELECT * FROM Stops NATURAL JOIN Zips

56

 

 

57 of 72

Expressing Joins in Data Systems

Pandas Theta Join/Natural Join

Neither directly supported, but a weird hybrid supported…

>>> Stops.merge(Zips, on=‘location’)

Will perform a Theta (rather Equi-Join) on location, and then drop one copy of location

Somewhere between Theta and Natural Join!

57

 

 

58 of 72

Recap: Relational Algebra Operations

  • Set ops. — union, difference (expand/shrink vertically)
  • Selection (cross out rows)
  • Projection (cross out columns)
  • Cartesian product (expand horizontally and vertically)
  • Renaming (change schema)

  • Intersection
  • Theta join and Natural join (match and combine tuples across tables)

58

59 of 72

Exercise (If there is time)

  • Given Stops and Zips, collect the ids of the records from 94705 that correspond to asians.

59

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

60 of 72

Exercise (If there is time)

60

 

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

  • Given Stops and Zips, collect the ids of the records from 94705 that correspond to asians.

61 of 72

Exercise (If there is time)

  • Find the Zipcodes that have multiple locations associated with it from Zips (location, zipcode)

61

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

62 of 72

Exercise (If there is time)

  • Find the Zipcodes that have multiple locations associated with it from Zips (location, zipcode)

62

 

63 of 72

Recap: Relational Algebra Operations

  • Set ops. — union, difference (expand/shrink vertically)
  • Selection (cross out rows)
  • Projection (cross out columns)
  • Cartesian product (expand horizontally and vertically)
  • Renaming (change schema)

  • Intersection
  • Theta join and Natural join (match and combine tuples across tables)

63

64 of 72

Extended Relational Algebra Operations

  • Extended Projection
    • Creation of new attributes
  • Extended Joins
    • Outerjoins (left/right)
  • Grouping & Aggregation
  • Sorting
  • Duplicate Elimination
  • Limiting results

Will discuss all of these in the context of SQL!

65 of 72

What can RA not do?

  • Thoughts?

66 of 72

Example 1: Transpose

  • Common operation in LA
    • Other operations in LA are similarly challenging
  • Q: Why is this unsupported in RA?
    • Breaks the assumption that there is a predefined type per column
    • Pragmatics: size/type of output relation is unclear. More next

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

MacArthur

West Oakland

Civic Center

94621

94609

94612

MacArthur

94621

West Oakland

94609

Civic Center

94612

Data Matrix

Transposed Data Matrix

67 of 72

Example 2: One Hot Encoding

  • Encoding categorical (string) values as features
  • Useful primitive in ML, which prefers Booleans and numbers rather than categorical values
  • Moving data to schema

  • Unsupported because it is hard to determine output shape w/o looking at data
    • Cannot type-check a RA expression w/o data
      • E.g., union of a one-hot encoded table with another table
    • Harder to optimize the execution of RA exps

id

race

arrest

17213

asian

False

1

white

False

2

black

False

19

hispanic

False

id

arrest

asian

white

black

hispanic

17213

False

True

False

False

False

1

False

False

True

False

False

2

False

False

False

True

False

19

False

False

False

False

True

68 of 72

Example 3: Pivot Tables

  • Reshaping the data to present it in a more intuitive way
    • Facilitating comparisons
  • Invented in the spreadsheet context by Pito Salas in 1986
    • First appearance in Lotus Improv

  • Again, moving data to schema
    • Again, hard to determine output shape without looking at data

From Petersohn et al. VLDB 2020

Transpose!

69 of 72

What was the point?

  • Relational algebra forms the basics of data processing, i.e., “querying” your data
  • As you can see, a small # of operators lets you do a LOT!
  • Relational databases (and many other data systems) are built on relational algebra
  • Understanding and being able to effectively use relational algebra sets you up for
    • being able to query your data
    • quantifying the capabilities of data systems

69

70 of 72

Sets vs. Bags

  • So far, we have focused on set-oriented relational algebra
  • There is also an analogous “bag”-oriented relational algebra
    • Where you can have multiple copies of each tuple
    • Multi-set instead of a set
  • Why does this matter?
    • Most data systems (rel. databases included) use bags instead of sets for most operations
    • Easier to implement, no need to constantly check for multiple copies
    • The multiple copies may signify some meaning

70

71 of 72

Operations on Bags

  • Selection: preserve # of occurrences
    • Roughly as fast as with sets
  • Projection: preserve # of occurrences
    • Faster than sets, no elimination of duplicates
    • Projection is a very common operation, so we want this to be fast
  • Cartesian product, join: preserve # of occurrences
    • Every copy “matches” with every copy
    • Roughly as fast as with sets

71

72 of 72

Operations on Bags

  • Union {a, b, b, c} U {a, b, c, d} = {a, a, b, b, b, c, c, d}
    • Add number of occurrences
    • Faster than sets, no need to look for duplicate tuples
  • Difference {a, a, a, b, c} — {a, b, b} = {a, a, c}
    • Subtract number of occurrences
    • Roughly as fast as with sets
    • Intersection is similar

72