1 of 44

Introduction to Data Management�DSC 100

Lecture 4 Part 1:

Relational Algebra

DSC 100

1

2 of 44

Relational Algebra

DSC 100

2

3 of 44

Relational Algebra

  • In SQL we say what we want
  • In RA we can express how to get it
  • RA = set-at-a-time algebra for relations

  • Every DBMS implementations converts a SQL query to RA in order to execute it
  • An RA expression is called a query plan

DSC 100

3

4 of 44

Basics

  • Inputs: Relations (with attributes)
  • RA: defines a function on relations
    • Returns a relation
    • Can be composed together
    • Often displayed using a tree rather than linearly
    • Use Greek symbols: σ, π, δ, etc

DSC 100

4

5 of 44

Sets v.s. Bags

  • Sets: {a,b,c}, {a,d,e,f}, { }, . . .
  • Bags: {a, a, b, c}, {b, b, b, b, b}, . . .

Relational Algebra has two flavors:

  • Set semantics = standard Relational Algebra
  • Bag semantics = extended Relational Algebra

DB systems implement bag semantics (Why?)

DSC 100

5

6 of 44

Relational Algebra Operators

  • Union , intersection , difference -
  • Selection σ
  • Projection π
  • Cartesian product ×, join
  • (Rename ρ)
  • Duplicate elimination δ
  • Grouping and aggregation ɣ
  • Sorting 𝛕

DSC 100

6

RA

Extended RA

All operators take in 1 or 2 relations as inputs �and return another relation

7 of 44

Union and Difference

DSC 100

7

What do they mean over bags ?

R1 ∪ R2

R1 – R2

Only make sense if R1, R2 have the same schema

8 of 44

What about Intersection ?

  • Derived operator using minus

  • Derived using join

DSC 100

8

R1 ∩ R2 = R1 – (R1 – R2)

R1 ∩ R2 = R1 ⨝ R2

9 of 44

Selection

  • Returns all tuples which satisfy a condition

  • Examples
    • σSalary > 40000 (Employee)
    • σname = “Smith” (Employee)
  • The condition c can be =, <, <=, >, >=, <>�combined with AND, OR, NOT

DSC 100

9

σc(R)

10 of 44

DSC 100

10

σSalary > 40000 (Employee)

SSN

Name

Salary

1234545

John

20000

5423341

Smith

60000

4352342

Fred

50000

SSN

Name

Salary

5423341

Smith

60000

4352342

Fred

50000

Employee

11 of 44

Projection

  • Eliminates columns

  • Example: project social-security number and names:
    • πSSN, Name (Employee) 🡪 Answer(SSN, Name)

DSC 100

11

π A1,…,An (R)

Different semantics over sets or bags! Why?

12 of 44

DSC 100

12

π Name,Salary (Employee)

SSN

Name

Salary

1234545

John

20000

5423341

John

60000

4352342

John

20000

Name

Salary

John

20000

John

60000

John

20000

Employee

Name

Salary

John

20000

John

60000

Bag semantics

Set semantics

Which is more efficient?

13 of 44

Composing RA Operators

DSC 100

13

no

name

zip

disease

1

p1

98125

flu

2

p2

98125

heart

3

p3

98120

lung

4

p4

98120

heart

Patient

σdisease=‘heart’(Patient)

no

name

zip

disease

2

p2

98125

heart

4

p4

98120

heart

zip

disease

98125

flu

98125

heart

98120

lung

98120

heart

πzip,disease(Patient)

πzip,disease(σdisease=‘heart’(Patient))

zip

disease

98125

heart

98120

heart

14 of 44

Cartesian Product

  • Each tuple in R1 with each tuple in R2

  • Rare in practice; mainly used to express joins

DSC 100

14

R1 × R2

15 of 44

Cross-Product Example

DSC 100

15

Name

SSN

John

999999999

Tony

777777777

Employee

EmpSSN

DepName

999999999

Emily

777777777

Joe

Dependent

Employee × Dependent

Name

SSN

EmpSSN

DepName

John

999999999

999999999

Emily

John

999999999

777777777

Joe

Tony

777777777

999999999

Emily

Tony

777777777

777777777

Joe

16 of 44

Renaming

  • Changes the schema, not the instance

  • Example:
    • Given Employee(Name, SSN)
    • ρN, S(Employee) 🡪 Answer(N, S)

DSC 100

16

ρB1,…,Bn (R)

17 of 44

Natural Join

  • Meaning: R1⨝ R2 = ΠA(σθ (R1 × R2))

  • Where:
    • Selection σθ checks equality of all common attributes (i.e., attributes with same names)
    • Projection ΠA eliminates duplicate common attributes

DSC 100

17

R1 ⨝ R2

18 of 44

Natural Join Example

DSC 100

18

A

B

X

Y

X

Z

Y

Z

Z

V

B

C

Z

U

V

W

Z

V

A

B

C

X

Z

U

X

Z

V

Y

Z

U

Y

Z

V

Z

V

W

R

S

R S =

ΠABC(σR.B=S.B(R × S))

19 of 44

Natural Join Example 2

DSC 100

19

age

zip

disease

54

98125

heart

20

98120

flu

AnonPatient P

Voters V

P V

name

age

zip

Alice

54

98125

Bob

20

98120

age

zip

disease

name

54

98125

heart

Alice

20

98120

flu

Bob

20 of 44

Natural Join

  • Given schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⨝ S ?

  • Given R(A, B, C), S(D, E), what is R ⨝ S?

  • Given R(A, B), S(A, B), what is R ⨝ S?

DSC 100

20

21 of 44

Introduction to Data Management�DSC 100

Lecture 4 Part 2:

Relational Algebra

DSC 100

21

22 of 44

Theta Join

  • A join that involves a predicate

  • Here θ can be any condition
  • No projection in this case!
  • For our voters/patients example:

DSC 100

22

R1 ⨝θ R2 = σθ (R1× R2)

P ⨝ P.zip = V.zip and P.age >= V.age -1 and P.age <= V.age +1 V

AnonPatient (age, zip, disease)�Voters (name, age, zip)

23 of 44

Equijoin

  • A theta join where θ is an equality predicate

  • By far the most used variant of join in practice
  • What is the relationship with natural join?

DSC 100

23

R1 ⨝θ R2 = σθ (R1 × R2)

24 of 44

Equijoin Example

DSC 100

24

age

zip

disease

54

98125

heart

20

98120

flu

AnonPatient P

Voters V

P P.age=V.age V

name

age

zip

p1

54

98125

p2

20

98120

P.age

P.zip

P.disease

V.name

V.age

V.zip

54

98125

heart

p1

54

98125

20

98120

flu

p2

20

98120

25 of 44

Natural Join Example

DSC 100

25

age

zip

disease

54

98125

heart

20

98120

flu

AnonPatient P

Voters V

P V

name

age

zip

p1

54

98125

p2

20

98120

age

zip

disease

name

V.age

V.zip

54

98125

heart

p1

54

98125

20

98120

flu

p2

20

98120

26 of 44

Join Summary

  • Theta-join: R ⨝θ S = σθ (R × S)
    • Join of R and S with a join condition θ
    • Cross-product followed by selection θ
    • No projection
  • Equijoin: R ⨝θ S = σθ (R × S)
    • Join condition θ consists only of equalities
    • No projection
  • Natural join: R ⨝ S = πAθ (R × S))
    • Equality on all fields with same name in R and in S
    • Projection πA drops all redundant attributes

DSC 100

26

27 of 44

So Which Join Is It ?

When we write R ⨝ S we usually mean an equijoin, but we often omit the equality predicate when it is clear from the context

DSC 100

27

28 of 44

More Joins

  • Outer join
    • Include tuples with no matches in the output
    • Use NULL values for missing attributes
    • Does not eliminate duplicate columns

  • Variants
    • Left outer join
    • Right outer join
    • Full outer join

DSC 100

28

29 of 44

Outer Join Example

DSC 100

29

age

zip

disease

54

98125

heart

20

98120

flu

33

98120

lung

AnonPatient P

P J

P.age

P.zip

P.disease

J.job

J.age

J.zip

54

98125

heart

lawyer

54

98125

20

98120

flu

cashier

20

98120

33

98120

lung

null

null

null

AnnonJob J

job

age

zip

lawyer

54

98125

cashier

20

98120

30 of 44

Some Examples

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,qty,price)

Name of supplier of parts with size greater than 10

πsname(Supplier ⨝ (Supply ⨝ (σpsize>10 (Part)))

Name of supplier of red parts or parts with size greater than 10

πsname(Supplier ⨝ (Supply ⨝ (σ psize>10 (Part) ∪ σpcolor=‘red’ (Part) ) ) )

πsname(Supplier ⨝ (Supply ⨝ (σ psize>10 pcolor=‘red’ (Part) ) ) )

Can be represented as trees as well

DSC 100

30

31 of 44

Some Examples

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,qty,price)

Name of supplier of parts with size greater than 10

Project[sname](Supplier Join[sno=sno] � (Supply Join[pno=pno] (Select[psize>10](Part))))

Name of supplier of red parts or parts with size greater than 10

Project[sname](Supplier Join[sno=sno]

(Supply Join[pno=pno]

((Select[psize>10](Part)) Union � (Select[pcolor=‘red’](Part)))

Project[sname](Supplier Join[sno=sno] (Supply Join[pno=pno] � (Select[psize>10 OR pcolor=‘red’](Part))))

Can be represented as trees as well

31

32 of 44

Representing RA Queries as Trees

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,qty,price)

32

Part

Supply

σpsize>10

πsname

Answer

Supplier

SELECT z.sname

FROM Part x, Supply y, Supplier z

WHERE x.psize > 10� and x.pno = y.pno� and y.sno = z.sno

πsname(Supplier ⨝ (Supply ⨝ (σpsize>10 (Part)))

33 of 44

Relational Algebra Operators

  • Union , intersection , difference -
  • Selection σ
  • Projection π
  • Cartesian product X, join
  • (Rename ρ)
  • Duplicate elimination δ
  • Grouping and aggregation ɣ
  • Sorting 𝛕

DSC 100

33

RA

Extended RA

All operators take in 1 or 2 relations as inputs �and return another relation

34 of 44

Extended RA: Operators on Bags

  • Duplicate elimination δ
  • Grouping γ
    • Takes in relation and a list of grouping operations (e.g., aggregates). Returns a new relation.
  • Sorting τ
    • Takes in a relation, a list of attributes to sort on, and an order. Returns a new relation.

DSC 100

34

35 of 44

Using Extended RA Operators

DSC 100

35

SELECT city, sum(quantity)

FROM sales

GROUP BY city

HAVING count(*) > 100

T1, T2 = temporary tables

sales(product, city, quantity)

γ city, sum(quantity)→q, count(*) → c

σ c > 100

Π city, q

Answer

T1(city,q,c)

T2(city,q,c)

36 of 44

Typical Plan for a Query (1/2)

DSC 100

36

R

S

join condition

σselection condition

πfields

join condition

SELECT-PROJECT-JOIN

Query

Answer

SELECT fields

FROM R, S, …

WHERE condition

37 of 44

Typical Plan for a Query (1/2)

DSC 100

37

πfields

ɣfields, sum/count/min/max(fields)

σhaving condition

σwhere condition

join condition

SELECT fields

FROM R, S, …

WHERE condition

GROUP BY fields

HAVING condition

38 of 44

How about Subqueries?

DSC 100

38

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,price)

SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’ � and not exists

(SELECT *

FROM Supply P� WHERE P.sno = Q.sno� and P.price > 100)

39 of 44

How about Subqueries?

DSC 100

39

SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’ � and not exists

(SELECT *

FROM Supply P� WHERE P.sno = Q.sno� and P.price > 100)

Correlation !

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,price)

40 of 44

How about Subqueries?

DSC 100

40

SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’ � and not exists

(SELECT *

FROM Supply P� WHERE P.sno = Q.sno� and P.price > 100)

De-Correlation

SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’� and Q.sno not in

(SELECT P.sno

FROM Supply P� WHERE P.price > 100)

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,price)

41 of 44

How about Subqueries?

DSC 100

41

SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’� and Q.sno not in

(SELECT P.sno

FROM Supply P� WHERE P.price > 100)

(SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’)

EXCEPT� (SELECT P.sno

FROM Supply P� WHERE P.price > 100)

EXCEPT = set difference

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,price)

Un-nesting

42 of 44

How about Subqueries?

DSC 100

42

(SELECT Q.sno

FROM Supplier Q

WHERE Q.sstate = ‘WA’)

EXCEPT� (SELECT P.sno

FROM Supply P� WHERE P.price > 100)

Supply

σsstate=‘WA’

Supplier

σPrice > 100

Finally…

πsno

πsno

Supplier(sno,sname,scity,sstate)

Part(pno,pname,psize,pcolor)

Supply(sno,pno,price)

43 of 44

Summary of RA and SQL

  • SQL = a declarative language where we say what data we want to retrieve
  • RA = an algebra where we say how we want to retrieve the data
  • Theorem: SQL and RA can express exactly the same class of queries

DSC 100

43

RDBMS translate SQL 🡪 RA, then optimize RA

44 of 44

Summary of RA and SQL

  • SQL (and RA) cannot express ALL queries that we could write in, say, Java
  • Example:
    • Parent(p,c): find all descendants of ‘Alice’
    • No RA query can compute this!
    • This is called a recursive query
  • Next lecture: Datalog is an extension that can compute recursive queries

DSC 100

44