1 of 45

Data Engineering

SQL

1

2 of 45

Next: Doing more with Select From Where

Basic Form:

SELECT attributes

FROM tables

WHERE condition about tuples in tables

2

3 of 45

Extending SELECT

SELECT can employ:

  • Renamed attributes

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

SELECT id AS number, arrest AS arrested FROM Stops

4 of 45

Extending SELECT

SELECT can employ:

  • Renamed attributes
  • Expressions

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

SELECT id, age/10 AS ageDecade, arrest

FROM Stops

5 of 45

Extending SELECT

SELECT can employ:

  • Renamed attributes
  • Expressions
  • Case statements
  • Many more functions! (String, date-time, …)

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

SELECT id, CASE

WHEN age > 60 THEN ‘senior’

WHEN age <=60 AND age > 20 THEN ‘adult’

ELSE ‘juvenile’

END AS age_type,

arrest FROM Stops

6 of 45

Extending WHERE

WHERE can employ:

  • Boolean connectives

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

SELECT * FROM Stops

WHERE gender != “M” AND age > 30

7 of 45

Extending WHERE

WHERE can employ:

  • Boolean connectives
  • String matching
    • % matches any values
    • _ matches a single char
  • Many other functions (as in FROM)!

id

race

gender

age

warning

citation

arrest

17213

asian

F

23

False

True

False

1

white

M

45

True

False

False

2

black

M

37

False

False

False

19

hispanic

F

58

False

True

False

SELECT * FROM Stops

WHERE race LIKE ‘%a%’

8 of 45

How to Read a SQL Query

  • Start with the FROM clause: lists the tables
    • (so far only one, we’ll get to more soon)
  • Apply the selection in the WHERE clause
  • Apply the projection in the SELECT clause
  • Convention: keywords caps, attribs small, rel first name caps
    • NB: keywords are case-insensitive

8

9 of 45

Recap: The Basic Form: Select From Where

Basic Form:

SELECT attributes

FROM tables

WHERE condition about tuples in tables

9

10 of 45

Recap: Multi-relation Queries

  • List them in the FROM clause
  • Like in relational algebra, refer to attributes as <relation>.<attribute> if needed
  • Can also use “AS” to rename them if needed (OK to omit as well, still allowed)

10

id

race

location

age

warning

citation

arrest

17213

asian

MacArthur

23

False

True

False

1

white

West Oakland

45

True

False

False

2

black

West Oakland

37

False

False

False

19

hispanic

Civic Center

58

False

True

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

 

SELECT * FROM Stops, Zips WHERE Stops.location = Zips.location

SELECT * FROM Stops AS S, Zips AS Z WHERE S.location = Z.location

SELECT * FROM Stops S, Zips Z WHERE S.location = Z.location

If only race and zipcode are desired:

SELECT race, zipcode FROM Stops AS S, Zips AS Z WHERE S.location = Z.location

11 of 45

Recap: Reading a SQL Query w/ Mult. Tables

  • Start with the FROM clause
    • Take the cross-product of tables listed
  • Apply the selection in the WHERE clause
  • Apply the projection in the SELECT clause

11

12 of 45

Recap

  • SQL queries:
    • Select From Where and its semantics
    • Create Table As, CTE, Views, MVs
    • LIKE, AS, *, %, …
    • Single and multiple relations
    • Next: subqueries

12

13 of 45

Subqueries

  • A parenthesized SQL query statement (a subquery) can be used as a value in various places
  • First version: if a subquery returns a single tuple with a single attribute value, it can be treated as a scalar in expressions
    • Runtime error if used incorrectly

13

14 of 45

Subquery as a Scalar

  • First version: if a subquery returns a single tuple with a single attribute value, it can be treated as a scalar in expressions
  • Q: First, without subqueries, collect all the stops that happened at the same location as id = 123 (including id = 123)

14

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

15 of 45

Subquery as a Scalar

  • First version: if a subquery returns a single tuple with a single attribute value, it can be treated as a scalar in expressions
  • Q: First, without subqueries, collect all the stops that happened at the same location as id = 123 (including id = 123)

SELECT S1.id, S1.race, S1.location, S1.arrest

FROM Stops as S1, Stops as S2

WHERE S1.location = S2.location AND S2.id = 123

This is known as a self-join – two (or more) copies of the same relation

15

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

16 of 45

Subquery as a Scalar

  • First version: if a subquery returns a single tuple with a single attribute value, it can be treated as a scalar in expressions
  • Now, with subqueries, collect all the stops that happened at the same location as id = 123 (including id = 123)

SELECT S1.id, S1.race, S1.location, S1.arrest

FROM Stops S1

WHERE S1.location = (SELECT S2.location FROM Stops S2 WHERE S2.id = 123)

16

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

17 of 45

Subqueries

  • A parenthesized SQL query statement (a subquery) can be used as a value in various places
  • First version: if a subquery returns a single tuple with a single attribute value, it can be treated as a scalar in expressions
    • Runtime error if used incorrectly
  • Second version: checking if a query is non-empty via EXISTS

17

18 of 45

Subquery as a Set: EXISTS

  • EXISTS <relation> is true iff <relation> is not empty
    • NOT EXISTS is opposite
    • Can appear in WHERE clauses
  • Determine the locations in Stops that don’t have a corresponding zipcode in Zips:

SELECT Stops.location FROM Stops

WHERE NOT EXISTS

(SELECT * FROM Zips WHERE Zips.location = Stops.location)

  • Known as a correlated subquery because it uses variables from the outer query.
    • Scoping rules apply! Be careful – best to rename all relations to unique names.
  • Think of this as a “function call” per tuple arising as a result of the FROM clause
    • Can be expensive

18

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

19 of 45

EXISTS Exercise

Determine all the Stops that are the only one in their zipcode

19

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

SELECT * FROM Stops S1 NATURAL JOIN Zips Z1

WHERE NOT EXISTS

(SELECT * FROM Stops S2 NATURAL JOIN Zips Z2 WHERE Z1.zipcode = Z2.zipcode AND S1.id != S2.id)

20 of 45

Subqueries

  • A parenthesized SQL query statement (a subquery) can be used as a value in various places
  • First version: if a subquery returns a single tuple with a single attribute value, it can be treated as a scalar in expressions
    • Runtime error if used incorrectly
  • Second version: checking if a query is non-empty via EXISTS
  • Third version: comparison against a set of values via IN/ANY/ALL
    • We’ll skip this…

20

21 of 45

Recap

  • SQL queries:
    • SFW and its semantics
    • CTAS, CTE, Views, MVs
    • LIKE, AS, *, %, …
    • Single and multiple relations
    • Subqueries
    • Next: multiset operations

21

22 of 45

Multiset/bag Operations

  • Union, difference, intersection are expressed as follows:
    • (subquery) UNION ALL (subquery)
    • (subquery) EXCEPT ALL (subquery)
    • (subquery) INTERSECT ALL (subquery)
  • Find locations in Stops that are not in Zips:

(SELECT location FROM Stops)

EXCEPT ALL

(SELECT location FROM Zips)

22

23 of 45

Multiset/bag Operations

  • Union, difference, intersection are expressed as follows:
    • (subquery) UNION ALL (subquery)
    • (subquery) EXCEPT ALL (subquery)
    • (subquery) INTERSECT ALL (subquery)
  • Can also enforce set behavior:
    • (subquery) UNION (subquery)
    • (subquery) EXCEPT (subquery)
    • (subquery) INTERSECT (subquery)
  • As we saw previously, can require any SQL query to return a set via
    • SELECT DISTINCT ….

23

24 of 45

Recap

  • SQL queries:
    • SFW and its semantics
    • CTAS, CTE, Views, MVs
    • LIKE, AS, *, %, …
    • Single and multiple relations
    • Subqueries
    • Multiset operations
    • Next: aggregation

24

25 of 45

Aggregations

  • SUM, MAX, MIN, COUNT, AVG can be applied to a column in a SELECT clause to produce that aggregation
  • COUNT (*) is a special syntax to count number of tuples

  • Number of Stops:
    • SELECT COUNT(*) FROM Stops
  • Max and average ages of Stops:
    • SELECT MAX(age), AVG (age) FROM Stops

25

26 of 45

Removing Duplicates

  • Adding DISTINCT removes duplicates prior to aggregation
  • Count the number of distinct locations in Stops:
    • SELECT COUNT(DISTINCT location) FROM Stops

26

27 of 45

Detour: Null Values

  • Tuples can have “NULL” values for attributes
    • Either missing (exists but unknown)
    • or inapplicable (value doesn’t apply)
  • Need to be careful in how to deal with NULLs.

  • NULLs do not satisfy conditions. For tuple id 1
    • age > 40 evaluates to FALSE
    • age <= 40 evaluates to FALSE
    • Interpret as: if the age is unknown, you can’t for sure tell whether it is >40 or <=40
  • Leads to some funky behavior that you wouldn’t expect from Boolean logic:
    • SELECT * FROM Stops WHERE age > 40 OR age <= 40
    • Will not have tuple id 1 as part of the output

  • If we want all tuples, we need to explicitly test for NULLs
    • SELECT * FROM Stops WHERE age > 40 OR age <= 40 OR age IS NULL
  • For more on extending Boolean logic to NULLs, see three-valued logic (TRUE, FALSE, UNKNOWN)

27

id

race

location

age

arrest

17213

asian

MacArthur

23

False

1

white

West Oakland

NULL

False

2

black

NULL

37

False

19

NULL

Civic Center

58

False

28 of 45

NULLs in Aggregation

  • NULL values are not involved in aggregation
  • But if there are no non-NULL values, result will be a NULL

  • Number of Stops
    • SELECT COUNT (*) FROM Stops;
  • Number of Stops with a non-null location
    • SELECT COUNT (location) FROM Stops;
  • Average of non-null ages
    • SELECT AVG (age) FROM Stops;

28

id

race

location

age

arrest

17213

asian

MacArthur

23

False

1

white

West Oakland

NULL

False

2

black

NULL

37

False

19

NULL

Civic Center

58

False

29 of 45

Grouping

  • In some cases, we may want to compute an aggregate for each “group” of tuples as opposed to an overall COUNT, MAX or SUM

  • You do so by adding a GROUP BY clause after S-F-W
    • Example: find average and minimum ages for each location

SELECT location, AVG(age) AS avgage, MIN (age) as minage

FROM Stops GROUP BY location;

29

id

race

location

age

arrest

17213

asian

MacArthur

23

False

1

white

West Oakland

NULL

False

2

black

NULL

37

False

19

NULL

Civic Center

58

False

30 of 45

Visualizing Grouping

SELECT location, AVG(age) AS avgage, MIN (age) as minage

FROM Stops GROUP BY location;

30

id

loc

age

MacArthur

West Oakland

MacArthur

West Oakland

Civic Center

id

loc

age

MacArthur

MacArthur

id

loc

age

West Oakland

West Oakland

id

loc

age

Civic Center

loc

avgage

minage

MacArthur

West Oakland

Civic Center

31 of 45

Grouping (contd.)

  • Relations:
    • Zips (location, zipcode)
    • Stops (id, race, location, arrest)

  • Q: Count the number of stops per zipcode

31

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

32 of 45

Grouping (contd.)

  • Relations:
    • Zips (location, zipcode)
    • Stops (id, race, location, arrest)

  • Q: Count the number of stops per zipcode

SELECT zipcode, COUNT (*)

FROM Stops, Zips

WHERE Stops.location = Zips.location

GROUP BY zipcode

32

id

race

location

arrest

17213

asian

MacArthur

False

1

white

West Oakland

False

2

black

West Oakland

False

19

hispanic

Civic Center

False

location

zipcode

MacArthur

94621

West Oakland

94609

Civic Center

94612

33 of 45

Restriction of SELECT list with aggregation

  • If aggregation is used, then each element of the SELECT clause must either be:
    • An aggregate, or
    • An attribute in the GROUP BY list
  • Q: Why this restriction?
    • If an attribute is not being aggregated or being grouped, then you need some way to “squish” the values down per group.

33

34 of 45

Exercise

  • Stops (id, race, age, location, arrest)
  • Q: Find the ids of the stops with the oldest individuals

34

id

race

location

age

arrest

17213

asian

MacArthur

23

False

1

white

West Oakland

NULL

False

2

black

NULL

37

False

19

NULL

Civic Center

58

False

35 of 45

Exercise

  • Stops (id, race, age, location, arrest)
  • Q: Find the ids of the stops with the oldest individuals

SELECT s.id FROM Stops

WHERE s.age >=

(SELECT MAX(s2.age) FROM Stops)

35

id

race

location

age

arrest

17213

asian

MacArthur

23

False

1

white

West Oakland

NULL

False

2

black

NULL

37

False

19

NULL

Civic Center

58

False

36 of 45

Exercise

  • For West Oakland and Rockridge individually, compute the average ages of stops across various races
    • Output schema: (race, west_oakland_avg, rockridge_avg)
    • Challenge: we can’t simply apply a WHERE clause across the board
    • Hint: use CASE!

37 of 45

Exercise

  • For West Oakland and Rockridge individually, compute the average ages of stops across various races

SELECT race,

AVG (CASE WHEN location = ‘West Oakland’ THEN age ELSE NULL END) AS west_oakland_avg,

AVG (CASE WHEN location = ‘Rockridge’ THEN age ELSE NULL END) AS rockridge_avg

FROM Stops GROUP BY Race

Look up FILTERS in Aggregates for an alternate syntax

38 of 45

Sophisticated Grouping: Percentiles

  • PERCENTILE_DISC(percentileWITHIN GROUP (ORDER BY sort_list)

  • Returns median age of stopped people
    • SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY age) FROM Stops

  • Others: standard deviation, co/variance, cumulative distribution, regression slope, correlation coefficient, …

39 of 45

HAVING Clauses

  • HAVING <condition> may follow a GROUP BY clause
  • If so, the condition is applied to each group, and groups not satisfying the condition are eliminated
  • Example: Compute the locations with at least 30 stops

SELECT location, COUNT (*)

FROM Stops

GROUP BY location

HAVING COUNT (*) > 30

39

40 of 45

Restrictions on HAVING Clauses

  • Similar to SELECT clauses: each attribute mentioned must either be part of the GROUP BY or be aggregated

40

41 of 45

General Form of GROUPING

SELECT S

FROM R1, R2, …

WHERE C1

GROUP BY A1, A2, …

HAVING C2

  • S and C2 can contain A1, A2, … or any other aggregated attributes
  • C1: any condition

41

42 of 45

General Form of GROUPING

SELECT S

FROM R1, R2, …

WHERE C1

GROUP BY A1, A2, …

HAVING C2

Order of evaluation:

    • Compute “FROM”: compute the cross product of R1, R2, …
    • For each tuple from 1, keep only those that satisfy C1
    • Group by A1, A2, …
      1. For each group, compute all aggregates needed in C2 and S
    • For each group, check if C2 is satisfied
    • If so: add to output based on S

42

First

Second

Third

Fourth

Fifth

43 of 45

Recap

  • SQL queries:
    • SFW and its semantics
    • CTAS, CTE, Views, MVs
    • LIKE, AS, *, %, …
    • Single and multiple relations
    • Subqueries
    • Multiset operations
    • Aggregation
    • Next: Ordering

43

44 of 45

Ordering: ORDER BY

  • Use an ORDER BY clause to enforce ordering of the result
  • ORDER BY <attr> ASC | DESC
  • Order Stops by descending age:

SELECT * FROM Stops

ORDER BY age DESC;

  • Order Stops by descending age: then ascending id per age

SELECT * FROM Stops

ORDER BY age DESC, id ASC;

ASC can be omitted – ascending by default

44

45 of 45

Restrict output: LIMIT & OFFSET

  • Sometimes you want to limit the result to a few tuples via LIMIT <val>
  • Order Stops by descending age, return top 15

SELECT *

FROM Stops

ORDER BY age DESC LIMIT 15;

  • And sometimes you want to start the output at a particular point via OFFSET <val>
  • Order Stops by descending age, return positions 11 to 15

SELECT *

FROM Stops

ORDER BY age DESC LIMIT 5 OFFSET 10

45