1 of 64

Relational Algebra

1

Data 101, Fall 2025 @ UC Berkeley

Aditya Parameswaran

2 of 64

Announcements

  • Project 0 is out soon
    • All assignments are due at 5pm, not midnight!
  • Class OH, discussions start next week
  • Data Scholars tutoring starts next week

2

3 of 64

Recall: Colloquial terminology

Table/Relation: Collection of tuples w/a predefined collection of attributes.

Schema: The structure, format, or scaffolding defining a relation.

  • Relation name, attribute names, attribute domains
  • (Domains, for now, similar to “data type”)

Database schema: The structure, format, �or scaffolding defining a collection of relations, �e.g., in a database.

3

We will now define these terms more rigorously with Relational Algebra.

name

artist

album

peak

22

Taylor Swift

Red

20

Blinding Lights

The Weeknd

After Hours

1

360

Charli XCX

Brat

1

Shelter

Porter Robinson

Shelter (single)

16

Table “songs”

Column | Type

----------|---------

name | text

artist | text

album | text

peak | integer

Relation Schema

songs(name TEXT, artist TEXT, � album TEXT, peak INTEGER, …)

Relation “instance”

4 of 64

4

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

5 of 64

Relational Algebra

Algebra (formulaic operations on variables) is core to several disciplines:

  • Elementary algebra is the foundation of arithmetic and calculus
  • Linear Algebra is the foundation of matrix manip, and ⇒ machine learning (ML)

Relational Algebra (RA) is the theory of operations that help us transform relations.

  • Relations are the operands!

5

RA is the core of query optimization, which we will cover later in the semester.

Relational Algebra is the foundation of the relational data model.�A grounding in RA allows us:

  • To understand/quantify the capabilities of data systems.
  • To provide a “baseline” of functionality, and
  • To study the impact of deviations from the data model and algebra.

6 of 64

A Bit of History

RA introduced in a seminal paper by Edgar F. Codd in 1970.

Became the basis for the relational database revolution and many subsequent data systems.

  • SQL: first commercial language using Codd’s model.
  • SQL as ANSI standard in 1986

6

Edgar F. Codd

7 of 64

The formal Relational Data Model

Relation: A set of tuples with a predefined set of attributes.

  • Each attribute has a type, called domain.
  • Each domain is atomic (“simple as possible”): string, integer, etc.
  • Each tuple in the relation has values for each attribute.

Schema: The structure, format, or scaffolding defining a relation.

  • Relation name, attribute names, attribute domains

Instance: A relation with values “filled in”, a specific instantiation

7

A Relation = Schema + Instance.

8 of 64

Set are Unordered: Equivalent Representations

Relations are sets of tuples w/ sets of attribs.

We can reorder both and not change the relation:

8

name

artist

album

peak

22

Taylor Swift

Red

20

Blinding Lights

The Weeknd

After Hours

1

360

Charli XCX

Brat

1

Shelter

Porter Robinson

Shelter (single)

16

name

peak

artist

album

Blinding Lights

1

The Weeknd

After Hours

360

Charli XCX

Brat

1

Shelter

16

Porter Robinson

Shelter (single)

22

20

Taylor Swift

Red

=

9 of 64

Relational Algebra Notation

9

Songs (name String, artist � String, album String,� peak Integer, …)

relation name

relation schema

relation name

relation schema

10 of 64

10

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

11 of 64

Relational Algebra Primitive Operations

There are six primitive RA operations, from which all others can be built:

  • selection
  • projection
  • renaming
  • cartesian product (or simply, cross product)
  • union
  • difference

11

12 of 64

Relational Algebra Primitive Operations

Let’s first focus on unary operators:

  • Operators that
    • take in exactly one operand, relation R
    • and output relation S

There are six primitive RA operations, from which all others can be built:

  • selection
  • projection
  • renaming
  • cartesian product (or simply, cross product)
  • union
  • difference

12

As we cover these next three operators, try brainstorming the corresponding SQL keywords/queries!

13 of 64

Projection: Cross Out Columns

Operation: Keep only columns A1, …, An�

Input: Output:�

Notes:

  • Must select from columns already �present in R.

13

Example:

Column | Type

-----------------+--------

title_id | text

primary_title | text

Titles

Column | Type

-----------------+--------

title_id | text

type | text

primary_title | text

original_title | text

is_adult | bigint

premiered | bigint

ended | bigint

runtime_minutes | bigint

genres | text

same tuples, but with attributes removed

14 of 64

Selection: Cross out rows

Operation: Filter out rows, i.e., keep only rows that satisfy some condition C�

Input: Output: �

Notes:

  • Output has same schema as input
  • Condition C can involve attributes in �R with boolean expressions: �=, >, <, AND, OR, etc.

14

Example:

Column | Type -----------+-------- � person_id | text

name | text

born | bigint

died | bigint

People

Column | Type -----------+-------- � person_id | text

name | text

born | bigint

died | bigint

rows where born > 1980

15 of 64

Renaming: Rename attributes and/or relation name

Operation: Change not the data, but the metadata (relation name, attribute names)

Input: Output: �

15

Notes:

  • If we don’t want to rename the relation name R to S, then drop S from the notation.
  • The → based notation only lists the attributes to rename, rest stay same.

equivalently

Example:

Persons

Column | Type -----------+-------- � person_id | text

name | text

birth | bigint

death | bigint

People

Column | Type -----------+-------- � person_id | text

name | text

born | bigint

died | bigint

equivalently

16 of 64

Relational Algebra so far

Basic unary operations:

  • Projection: Filter relation R to only include � attributes Bi1, Bi2, …, Bin.��
  • Selection: Filter relation R for tuples that � match a condition C.

  • Renaming: Rename attributes Bi1, Bi2, …, Bin � to A1, A2, …, An. Could also � rename relation name itself.

Relational Algebra (RA) is the theory of operations that transform relations.

  • In RA, relations are the operands!

16

Change metadata (schema)?

❌�

17 of 64

17

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

18 of 64

SFW as Relational Algebra

A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.

18

🤔

A. WHERE: Projection, SELECT: Projection, AS: Renaming

B. WHERE: Projection, SELECT: Selection, AS: Renaming

C. WHERE: Selection, SELECT: Projection, AS: Renaming

D. WHERE: Selection, SELECT: Selection, AS: Renaming

E. Something else

WHERE, SELECT, AS - Which RA operators do these intuitively map to?

SELECT Bi1 AS A1,

Bi2 AS A2,

BiN AS AN

FROM R

WHERE C;

19 of 64

SFW as Relational Algebra

SQL “Order of execution”:

  1. FROM clause: get the table.
  2. WHERE clause: filters for the matching tuples.
  3. SELECT clause: filters for the specified attributes.
  4. Rename attributes with AS.

A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.

19

SELECT Bi1 AS A1,

Bi2 AS A2,

BiN AS AN

FROM R

WHERE C;

Relational Algebra steps:

For the Relation R,

1. Select tuples that �match condition C.

2. Project to include�attributes Bi1, Bi2, …, Bin .

3. Rename attributes �Bi1, Bi2, …, Bin to A1, A2, …, An.

20 of 64

Caution! Terminology

SQL “Order of execution”:

  • FROM clause: get the table.
  • WHERE clause: filters for the matching tuples.
  • SELECT clause: filters for the specified attributes.
  • Rename attributes with AS.

A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.

20

SELECT Bi1 AS A1,

Bi2 AS A2,

BiN AS AN

FROM R

WHERE C;

Relational Algebra steps:

For the Relation R,

1. Select tuples that �match condition C.

2. Project to include�attributes Bi1, Bi2, …, Bin .

3. Rename attributes �Bi1, Bi2, …, Bin to A1, A2, …, An.

⚠️ SQL SELECT is RA projection! ⚠️

This is a decades-old mismatch of terminology!!

21 of 64

Practice: Composing relational algebra operations

How would you write a relational algebra expression that represents the following?

21

title_id | primary_title | premiered

------------+-----------------------------------------+-----------

tt0178116 | Zhenatyy kholostyak | 1982

tt1681370 | The Algerian | 2014

tt3501632 | Thor: Ragnarok | 2017

tt9535620 | Vis à vis: Beyond the Veil | 1998

tt3582172 | Luanshi Jiaren | 1941

Titles

Column | Type

-----------------+--------

title_id | text

type | text

primary_title | text

original_title | text

is_adult | bigint

premiered | bigint

ended | bigint

runtime_minutes | bigint

genres | text

(input relation schema)

(output relation instance)

Return the title_id, primary_title, premiered of all movies (i.e., where type is 'movie').

22 of 64

Practice: Composing relational algebra operations

How would you write a relational algebra expression that represents the following?

22

title_id | primary_title | premiered

------------+-----------------------------------------+-----------

tt0178116 | Zhenatyy kholostyak | 1982

tt1681370 | The Algerian | 2014

tt3501632 | Thor: Ragnarok | 2017

tt9535620 | Vis à vis: Beyond the Veil | 1998

tt3582172 | Luanshi Jiaren | 1941

Titles

Column | Type

-----------------+--------

title_id | text

type | text

primary_title | text

original_title | text

is_adult | bigint

premiered | bigint

ended | bigint

runtime_minutes | bigint

genres | text

(input relation schema)

(output relation instance)

Return the title_id, primary_title, premiered of all movies (i.e., where type is 'movie').

1. selection

2. projection

23 of 64

Practice: Composing relational algebra operations

How would you write a relational algebra expression that represents the following?

23

Titles

Column | Type

-----------------+--------

title_id | text

type | text

primary_title | text

original_title | text

is_adult | bigint

premiered | bigint

ended | bigint

runtime_minutes | bigint

genres | text

(input relation schema)

1. selection

2. projection

Pi_{title_id, primary_title, premiered} (

Sigma_{type = movie} (

Pi_{title_id, primary_title, premiered, type}

(Titles)))

24 of 64

24

25 of 64

25

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

26 of 64

Cross Product (i.e., Cartesian Product, Product)

Operation: Associate each tuple on LHS with RHS�

Inputs: Output: �

Notes:

  • If attributes ,�then rename attributes as .

26

27 of 64

Example: Cross Product

27

0

Apricot

2

Cally

4

Eugene

1

Boots

ragdoll

persian

2

4

bengal

5

1

persian

S

T

id

name

breed

id

S.id

name

breed

T.id

0

Apricot

2

Cally

4

Eugene

2

ragdoll

1

persian

0

Apricot

4

Eugene

1

Boots

1

persian

1

persian

2

ragdoll

2

ragdoll

1

Boots

2

Cally

1

persian

2

ragdoll

4

bengal

0

Apricot

2

Cally

1

Boots

0

Apricot

2

Cally

4

Eugene

1

Boots

4

bengal

4

bengal

5

persian

5

persian

5

persian

5

persian

bengal

4

Eugene

4

continued…

By itself, a cross product is not very meaningful, as it combines records that represent different individuals.

Cross products are composed with other operations to create a richer expression of RA.

28 of 64

28

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

29 of 64

Relational Algebra Primitive Operations

There are six primitive RA operations, from which all others can be built:

  • selection
  • projection
  • renaming
  • cartesian product (or simply, cross product)
  • union
  • difference

29

Fundamental binary operators in set algebra!

30 of 64

Set Operations: Union and Difference

Inputs: Output:

Notes:

  • Unions/difference both combine tuples from different relations.
  • Therefore all input/output relations must have the same schema.
  • Set RA: In union there is one copy of each tuple in the output relation.

Union:

Operation:��All rows in R1 or R2.

If relations are sets, then the union and difference operators work �just like in set algebra.

30

Difference:

Operation:��All rows in R1 and not in R2.

31 of 64

Summary of Relational Algebra

There are six primitive relational algebra (RA) operations, from which all others can be built:

  • selection
  • projection
  • renaming

31

Intersection:

Operation:��All rows that are in� both R1 and R2

  • cross product
  • union
  • difference

All other operations can be derived from these primitives.

Exercise for home: How can you derive intersection from a composition of union and difference operations?

🤔🏡

Summary

So Far

32 of 64

32

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

33 of 64

Summary of Relational Algebra

There are six primitive relational algebra (RA) operations, from which all others can be built:

  • selection
  • projection
  • renaming

33

  • cross product
  • union
  • difference

All other operations can be derived from these primitives.

Joins are therefore also a composition of primitive operations!

We’ll derive two new joins today: theta join and natural join. Try out the others yourself!

Summary

So Far

34 of 64

Theta Join (and Equi Join)

Operation: Join two relations such that rows satisfy the condition Θ

More formally: A cross product of R1 and R2, followed by a selection on Θ

Inputs: Output:

34

Notes:

  • We call this join an equi join if Θ is an equality condition.
  • Like cross product, prefixes distinguish common attributes in output schema,�e.g., if .

35 of 64

Theta Join (specifically, Equi Join) SQL Example

35

SELECT *�FROM Crew, People�WHERE � Crew.person_id = � People.person_id;

or equivalently

SELECT *�FROM Crew� INNER JOIN People� ON Crew.person_id = � People.person_id;

Crew

Column | Type -----------+-------- � title_id | text

person_id | text

category | text

job | text

People

Column | Type -----------+--------

person_id | text

name | text

born | bigint

died | bigint

Column | Type -----------------+--------

title_id | text

Crew.person_id | text

category | text

job | text

People.person_id| text

name | text

born | bigint

died | bigint

(using RA notation for attributes)

Theta Join is SQL inner join!

36 of 64

Composing Theta Join Using RA Primitives

Operation: Join two relations such that rows satisfy the condition Θ.

With primitives: A cross product of R1 and R2, followed by a selection on Θ.

Theta Join is an explicit composition of primitive RA operations:

36

🤔

Which of the following expressions are equivalent to this statement?

G.

H.

I.

J. Something else

37 of 64

Composing Theta Join Using RA Primitives

Operation: Join two relations such that rows satisfy the condition Θ.

With primitives: A cross product of R1 and R2, followed by a selection on Θ.

Theta Join is an explicit composition of primitive RA operations. From before:

37

This composition is the formal definition of theta join.

38 of 64

38

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

39 of 64

A new join “naturally” arises!

This is a very common operation!

39

Crew

Column | Type -----------+-------- � title_id | text

person_id | text

category | text

job | text

People

Column | Type -----------+--------

person_id | text

name | text

born | bigint

died | bigint

equi join on person_id

Column | Type -----------------+--------

title_id | text

Crew.person_id | text

category | text

job | text

People.person_id| text

name | text

born | bigint

died | bigint

project/�rename

Column | Type -----------------+--------

title_id | text

person_id | text

category | text

job | text � name | text

born | bigint

died | bigint

40 of 64

A new join “naturally” arises!

This is a very common operation! Often times:

  • Join on same attributes having same values (i.e., equi join on all same attributes)
  • Then, drop one of the duplicate columns and rename the attributes back to

40

Enter the natural join!

Crew

Column | Type -----------+-------- � title_id | text

person_id | text

category | text

job | text

People

Column | Type -----------+--------

person_id | text

name | text

born | bigint

died | bigint

equi join on person_id

Column | Type -----------------+--------

title_id | text

Crew.person_id | text

category | text

job | text

People.person_id| text

name | text

born | bigint

died | bigint

project/�rename

Column | Type -----------------+--------

title_id | text

person_id | text

category | text

job | text � name | text

born | bigint

died | bigint

41 of 64

Natural Join

Operation: Join two relations such that rows are matched on shared� attributes, with no duplicate attribute renaming.

�Inputs: Output:

41

with duplicate attributes removed�from schema

42 of 64

Example: Natural Join in SQL

42

Crew

Column | Type -----------+-------- � title_id | text

person_id | text

category | text

job | text

People

Column | Type -----------+--------

person_id | text

name | text

born | bigint

died | bigint

natural join

Column | Type -----------------+--------

title_id | text

person_id | text

category | text

job | text � name | text

born | bigint

died | bigint

SELECT� title_id,� Crew.person_id AS person_id,� category,� job,� name,� born,� died FROM Crew� INNER JOIN People� ON Crew.person_id � = People.person_id;

SELECT *�FROM Crew� NATURAL JOIN People;

🥹

equivalently

43 of 64

Composing Natural Join

  1. Cross product.
  2. Match tuples based on values of shared attributes.
  3. Rename back to for each .
  4. Drop .

43

Crew

Column | Type -----------+-------- � title_id | text

person_id | text

category | text

job | text

People

Column | Type -----------+--------

person_id | text

name | text

born | bigint

died | bigint

natural join

Column | Type -----------------+--------

title_id | text

person_id | text

category | text

job | text � name | text

born | bigint

died | bigint

1.Cross Product

2. Match tuples

3. Rename same attributes

4. Drop duplicate attributes

Like Theta Join, Natural Join is an explicit composition of RA primitives:

44 of 64

Exercise for Home

Suppose we have the following relations:

44

What is an equivalent operation for the following natural joins?

🤔

  • Union
  • Difference
  • Intersection
  • Cross product
  • Something else
  • None of the above

45 of 64

Exercise for Home

Suppose we have the following relations:

Natural Join as RA primitives:

  • Cross product.
  • Match tuples based on values of shared attributes.
  • Rename back to for each .
  • Drop .

45

What is an equivalent operation for the following natural joins?

C. intersection

R and S have same schema, so any rows that don’t exactly match will be dropped!

46 of 64

Exercise for Home

2.

Suppose we have the following relations:

Natural Join as RA primitives:

  • Cross product.
  • Match tuples based on values of shared attributes.
  • Rename back to for each .
  • Drop .

46

What is an equivalent operation for the following natural joins?

No shared attributes!

  • No rows are dropped from #2
  • No columns are dropped from #3, #4

D. cross product

47 of 64

Summary of Relational Algebra

There are six primitive relational algebra (RA) operations, from which all others can be built:

  • selection
  • projection
  • renaming

47

  • cross product
  • union
  • difference

All other operations can be derived from these primitives:

  • Theta Join, Equi Join
  • Natural Join
  • Intersection (on your own)
  • Other joins
    • For more: outer join, extended RA.

Summary

So Far

48 of 64

48

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

49 of 64

Briefly, Three Types of Data Systems

Let’s ground the utility of RA by using it to understand real systems.

(We’ll cover these systems in more depth later). For now, don’t need to know details or API/syntax—just that they exist.

49

Dataframes

Relational Databases

Spark

50 of 64

Briefly, Three Types of Data Systems

Let’s ground the utility of RA by using it to understand real systems.

(We’ll cover these systems in more depth later). For now, don’t need to know details or API/syntax—just that they exist.

50

Dataframes

Relational Databases

Abstraction/API to represent, clean, and structure your data

  • Incremental, composable operations
  • Embedded into common PLs (pandas: “Python Data Analysis Library”)
  • Originally emerged in the late 70s when statisticians wanted to operate on heterogeneous datasets

Spark

51 of 64

Briefly, Three Types of Data Systems

Let’s ground the utility of RA by using it to understand real systems.

(We’ll cover these systems in more depth later). For now, don’t need to know details or API/syntax—just that they exist.

51

Dataframes

Relational Databases

Abstraction/API to represent, clean, and structure your data

  • Incremental, composable operations
  • Embedded into common PLs (pandas: “Python Data Analysis Library”)
  • Originally emerged in the late 70s when statisticians wanted to operate on heterogeneous datasets

Traditional, time-tested solutions to managing data at scale

  • SQL
  • Declarative
  • Originated in the 70s via parallel efforts at IBM and UC Berkeley

Spark

52 of 64

Briefly, Three Types of Data Systems

Let’s ground the utility of RA by using it to understand real systems.

(We’ll cover these systems in more depth later). For now, don’t need to know details or API/syntax—just that they exist.

52

Dataframes

Relational Databases

Spark

Abstraction/API to represent, clean, and structure your data

  • Incremental, composable operations
  • Embedded into common PLs (pandas: “Python Data Analysis Library”)
  • Originally emerged in the late 70s when statisticians wanted to operate on heterogeneous datasets

Traditional, time-tested solutions to managing data at scale

  • SQL
  • Declarative
  • Originated in the 70s via parallel efforts at IBM and UC Berkeley

Distributed, parallel computing framework

  • Supports various interfaces, all of which are relation-like:
    • Pandas-like DataFrame API, Spark DataFrame API, SQL
  • Originally developed at UC Berkeley in 2012

53 of 64

Expressing Projection across Data Systems

53

Dataframes

Relational Databases

Spark

Titles[['title_id', 'primary_title']]

or equivalently

Titles.drop(columns=['type','original_title',…,'genres'])

SELECT title_id, primary_title

FROM titles;

Titles.select('title_id', 'primary_title')

54 of 64

Expressing Selection across Data Systems

54

Dataframes

Relational Databases

Spark

People[People[‘born’] > 1980]

SELECT *

FROM people

WHERE born > 1980;

People.filter("born > 1980")

or equivalently

People.filter(People.born > 1980)

or equivalently

People.where("born > 1980")

55 of 64

Expressing Renaming across Data Systems (attributes only for now)

55

Dataframes

Relational Databases

Spark

People.rename(columns={'born': 'birth', 'died': 'death'})

SELECT person_id, name, born AS birth, died AS death

FROM people;

People

.withColumnRenamed('born','birth')

.withColumnRenamed('died','death')

Renaming relations: In Spark/pandas, use assignment (=) operator. In SQL, use DML (more next time).

56 of 64

Expressing Cross Product across Data Systems

56

Dataframes

Relational Databases

Spark

SELECT *

FROM titles, crew;

Titles.crossJoin(Crew)

Crew

Column | Type -----------+-------- � title_id | text

person_id | text

category | text

job | text

Titles

Column | Type

-----------------+--------

title_id | text

type | text

primary_title | text

original_title | text

is_adult | bigint

premiered | bigint

ended | bigint

runtime_minutes | bigint

genres | text

Titles.merge(Crew, how=”cross”)

57 of 64

Expressing Union across Data Systems

57

Dataframes

Relational Databases

Spark

pd.concat([Title2019, Title2023],axis=0)

(SELECT * FROM titles_2019)�UNION�(SELECT * FROM titles_2023);

Titles2019.union(Titles2023)

58 of 64

Expressing Theta Join Across Data Systems

58

Dataframes

Relational Databases

Spark

Not supported, but instead hybrid theta join + natural join with pd.merge:

Titles.merge(Crew, on='title_id')

  • Equi join on title_id,then drop one copy of column
  • (Note: pd.join slightly different)

SELECT *�FROM Crew� INNER JOIN People� ON Crew.person_id = People.person_id;

Crew.join(People, [Crew.person_id == People.person_id])

59 of 64

Expressing Natural Join Across Data Systems

59

Dataframes

pd.merge(Titles, Crew)

Relational Databases

Spark

SELECT *

FROM Crew

NATURAL JOIN People;

(No convenient way; use theta join w/ other operators)

60 of 64

60

The Relational Data Model

Projection, Selection, Renaming

Unary Operator Practice

Binary Operator: Cartesian Product

Set Operations

Theta Join (and Equi Join)

Natural Join

From RA to Data Systems

From RA to Data Systems: Sets vs. Bags

Outline

61 of 64

What was the point?

Relational algebra forms the basics of data processing

  • Relational databases (and many other data systems)�are built on relational algebra.
  • A small set of six primitive operators lets you do a LOT!

→ understanding RA lets you quantify the capabilities of data systems.

Small caveat:

  • Focused on set-oriented relational algebra, i.e., relations = as sets of tuples.
  • Most data systems use bags instead of sets for most operations!

61

62 of 64

Operations on Bags

Briefly, bags are sets that allow duplicate elements. Why use bags?

  • Easier to implement; no need to constantly check for multiple copies.
  • Also—multiple copies may signify some meaning, depending on your domain!

62

Bag operation

Perf vs. Sets

Comments

Selection

Preserve # of occurrences

Roughly as fast

Projection

"

Faster than sets

Common operation (SQL SELECT). Want this to be fast!

No checking of dupes

Cartesian Product

"

Roughly as fast

Every copy “matches” with your copy. Includes join

Union

Add # of occurrences

Faster than sets

No checking of dupes

Difference

Subtract # of occurrences

Roughly as fast

Both require checking dupes

Renaming

(unchanged)

63 of 64

SQL: Bag/Set operations with subqueries

Bag (i.e., multiset) operations for union, difference, and intersection, respectively:

  • <subquery> UNION ALL <subquery>
  • <subquery> EXCEPT ALL <subquery>
  • <subquery> INTERSECT ALL <subquery>

Enforce set behavior instead

  • <subquery> UNION <subquery>
  • <subquery> EXCEPT <subquery>
  • <subquery> INTERSECT <subquery>
  • Effectively SELECT DISTINCT … on the bag operation

63

Example: Find members of Crew that are not in People:

(SELECT person_id FROM Crew)

EXCEPT ALL

(SELECT person_id FROM People)

A subquery is a parenthesized SQL query. More next time!

64 of 64

Summary of Relational Algebra

There are six primitive relational algebra (RA) operations, from which all others can be built:

  • selection
  • projection
  • renaming

64

  • cross product
  • union
  • difference

All other operations can be derived from these primitives:

  • Theta Join, Equi Join
  • Natural Join
  • Other joins (on your own)
    • For more: outer join, extended RA.
  • Intersection (on your own)

While RA is defined on sets, many data systems actually implement RA with bags. It’s faster and supports duplicates!

Summary

So Far