SQL Review
August 29, 2023
Data 101, Fall 2023 @ UC Berkeley
Lisa Yan https://fa23.data101.org/
1
LECTURE 02
Join at slido.com�#3737223
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
PostgreSQL
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
3
Lecture 02, Data 101 Fall 2023
Reminder: SQL: Structured Query Language (or “sequel”)
High level data transformation language
Declarative rather than imperative
The core functionality of SQL is restricted in comparison to, say, Python.
4
3737223
PostgreSQL: General Overview
PostgreSQL is a free, open-source database management system (DBMS) that is SQL-compliant (Note: SQL is an ISO standard [Oracle blog post]).
5
postgreSQL client
titles
akas
…
postgreSQL server
postgresql://jovyan@127.0.0.1:5432/postgres
episodes
ratings
3737223
A bit of PostgreSQL history
PostgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department. POSTGRES pioneered many concepts that only became available in some commercial database systems much later.
6
Jean-Auguste-Dominique Ingres, Napoléon Ier sur le trône impérial. Oil on canvas. 1806. [wikipedia]
Michael Stonebraker, Professor Emeritus and Turing Award winner
Eugene Wong, Professor Emeritus, government technology advisory committee service in U.S., Hong Kong, and Taiwan
PostgreSQL has fantastic documentation that we will refer to throughout this semester:
3737223
Relation = Schema + Instance
Schema: The structure, format, or scaffolding defining a relation.
A relation is a collection of tuples (records) with a predefined collection of attributes.
Instance: Specific instantiation of the Relation
7
Schema = one type of metadata; Instance = data. More on this later!
Songs (name String, artist � String, album String,� peak Integer, …)
name | artist | album | peak |
22 | Taylor Swift | Red | 20 |
Blinding Lights | The Weeknd | After Hours | 1 |
Truly Madly Deeply | Savage Garden | Savage Garden | 1 |
Shelter | Porter Robinson | Shelter (single) | 16 |
… | … | … | … |
Attribute/ Column
Tuple/Row/
Record
3737223
Today: Exploring the Internet Movie Database (IMDb)
This is a real subset of the IMDB dataset!
Overall schema: https://www.imdb.com/interfaces/
Let’s explore this data using psql, the PostgreSQL interactive terminal program! (psql official documentation)
8
You can also test out queries in interactive mode. But all your submittable work should be in Jupyter Notebooks!
Note: If you are trying this demo by yourself, make sure to run the first few cells in the notebook to load in the imdb database into the postgres server. Then launch psql.
Demo
3737223
SELECT FROM WHERE
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
9
Lecture 02, Data 101 Fall 2023
The Data 100 cheatsheet
SELECT [DISTINCT] <column expression list>
FROM <table>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>];
10
Summary So Far
The Data 100 cheatsheet
SELECT [DISTINCT] <column expression list>
FROM <table>
[WHERE <predicate>]
[GROUP BY <column list>]
[HAVING <predicate>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>];
11
Over the next two weeks, we will explore SQL queries in depth.
Our goal is not to reteach the keywords you learned from Data 100/SQLite3, but instead to provide you vocabulary and paradigms/abstractions that explain each query component.
Our second goal is to teach you some postgres!
Today:
(review) SQL queries,
(new) Relational Algebra Part I
Thursday:
(new) Relational Algebra Part II
Next Tuesday:
(review) Grouping,
(new) subqueries
Next Thursday:
(new) Window functions
Summary So Far
SFW Queries
A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.
To read a SFW SQL query:
12
(We’ll refine this overview slide in a bit.)
Convention:
NB: keywords are technically case-insensitive.
SELECT *
FROM titles
WHERE premiered = 2023;
SELECT premiered, genres
FROM titles;
3737223
More on SELECT, WHERE
SELECT can employ:
WHERE
13
SELECT primary_title, type,
premiered AS release_year,
runtime_minutes,
runtime_minutes/60 AS � runtime_hours�FROM titles
WHERE premiered >= 2020 AND� premiered <= 2023;
⚠️ runtime_hours doesn’t handle fractions!
Demo
3737223
CAST and Data Types
CAST: Convert one type to another for calculation.
Postgres Data Types: Documentation Ch. 15
14
SELECT primary_title, type,
premiered AS release_year,
runtime_minutes,
CAST(runtime_minutes AS � DOUBLE PRECISION)/60 AS � runtime_hours�FROM titles
WHERE premiered >= 2020 AND� premiered <= 2023;
Demo
3737223
[Tip] Debugging
Postgres will return the tuples in an unspecified but deterministic order. Sometimes, seeing the first few rows aren’t enough to know whether your query is executing as expected.
To see 10 random rows from the result query:
SELECT <attributes>
FROM <tables>
WHERE <condition about tuples in tables>
ORDER BY RANDOM()
LIMIT 10;
15
(We’ll return to other notions of randomness in a few lectures.)
Return only the first 10 rows.
Randomly shuffle the rows. (How? See RANDOM() docs)
Demo
3737223
Let’s Kick It Up a Notch [reference]
Consider the people table.
How do we compute the age of all people? Define age as:
16
Which expression(s) should we use in the blank? Select all that may apply.
🤔
SELECT
person_id, name, � died, born,
______ AS age
FROM people;
Which expression(s) should we use in the blank? Select all that may apply.
Exercise
3737223
Which expression(s) should we use in the blank? Select all that may apply.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Let’s Kick It Up a Notch Solution
Solution
Consider the people table.
How do we compute the age of all people? Define age as:
18
SELECT
person_id, name,
died, born,
CASE WHEN died IS NULL
THEN 2023 - born
ELSE died - born
END AS age
FROM people;
Exercise
3737223
Null values are not like other values
Tuples can have “NULL” values for some attributes. Common reasons:
19
3737223
Null values are not like other values
Tuples can have “NULL” values for some attributes. Common reasons:
Recall Data 100: We need to be careful about how we drop/impute NULLs!
20
3737223
Null values are not like other values
Tuples can have “NULL” values for some attributes. Common reasons:
Recall Data 100: We need to be careful about how we drop/impute NULLs!
21
(We’ll return to NULL when we talk about aggregation.)
SELECT born
FROM people;
SELECT born
FROM people�WHERE born < 2023 OR
born IS NULL;
SELECT born
FROM people�WHERE born < 2023;
equivalent
not equivalent
❌
3737223
String matching and COUNT(*)
Small primer on boolean expressions with strings:
Use LIKE or NOT LIKE:
Chapter 9.7 documentation: Pattern Matching
22
(More on string matching and aggregations next week!�Spoiler: Regular expressions return)
SELECT COUNT(*)
FROM people
WHERE name LIKE 'Chris%';
COUNT is an aggregation function. Here, it counts the number of tuples in the result.
'abc' LIKE 'abc' true
'abc' LIKE 'a%' true
'abc' LIKE '_b_' true
'abc' LIKE 'c' false
3737223
Multi-Relation Queries
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
23
Lecture 02, Data 101 Fall 2023
SFW Queries with Multiple Relations
Update to the SELECT FROM WHERE (SFW) query:
To read a SFW SQL query:
24
SELECT *
FROM
akas, titles
WHERE
titles.title_id = � akas.title_id;
3737223
Omitting AS will still rename attributes
Update to the SELECT FROM WHERE (SFW) query:
To read a SFW SQL query:
25
SELECT *
FROM
akas, titles
WHERE
titles.title_id =
akas.title_id;
SELECT *
FROM
akas AS A,� titles T
WHERE
A.title_id = T.title_id;
Syntax short cut with AS: omitting the AS keyword is allow and will still rename!
3737223
FROM multiple relations + WHERE = Inner Join
Recall the equivalent inner join:
(Likely no performance difference; see this StackOverflow response.)
Cross join with WHERE expression
26
More next week! For now, see discussion handout to briefly review left join, right join, and outer join.
SELECT *
FROM
akas A,
titles T
WHERE
A.title_id = T.title_id
SELECT *
FROM akas A
INNER JOIN titles T
ON A.title_id = T.title_id
3737223
IMDb Exercise
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
27
Lecture 02, Data 101 Fall 2023
Michelle Yeoh
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
28
Michelle Yeoh (Malaysia)
3737223
Now your turn!
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
29
🤔
Strategy point 1: Look at the schemas. Which tables have the information that we need?�(Hint: use \d meta-command in psql)
Which tables should we join together? Select all that apply.
A. akas
B. crew
C. episodes
D. people
E. ratings
F. titles
Exercise
3737223
Which tables should we join together? Select all that apply.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Now your turn!
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
31
Strategy point 1: Look at the schemas. Which tables have the information that we need? (Hint: use \d meta-command in psql)
Strategy point 2: Think about joins as modifying records:
Exercise
3737223
Strategy Point 2 in action
32
SELECT *
FROM titles
INNER JOIN crew
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id;
Strategy point 2: Think about joins as modifying records:
Row per movie
Row per crew member in a movie, with movie title, etc.
Row per crew member in a movie, with movie title, person name, etc.
expand
extend
Exercise
3737223
Now your turn!
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
33
Strategy point 1: Look at the schemas. Which tables have the information that we need? (Hint: use \d meta-command in psql)
Strategy point 2: Think about joins as modifying records:
Strategy point 3: Think about WHERE, first, then SELECT.
Exercise
3737223
Strategy Point 3 in action
34
SELECT *��
FROM titles
INNER JOIN crew
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie';
Row per crew member in a movie, with movie title, person name, etc.
Keep the rows with Michelle Yeoh as a crew member.
Exercise
3737223
Strategy Point 3 in action
35
SELECT DISTINCT titles.primary_title,
titles.title_id�
FROM titles
INNER JOIN crew
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie';
If Michelle Yeoh played multiple roles, there would be duplicates, so remove with DISTINCT.
Keep the rows with Michelle Yeoh as a crew member.
Got to here. Will cover the rest in Lecture 03
Exercise
3737223
[Bonus] IMDb Exercise 2
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
36
Lecture 02, Data 101 Fall 2023
Extending the Query
How do we write a query that gets the names of Michelle Yeoh movies�that have a rating of at least 7.0?
37
CREATE VIEW yeoh_movies AS (
SELECT DISTINCT
titles.primary_title,
titles.title_id
FROM titles
INNER JOIN crew
ON crew.title_id = titles.title_id
INNER JOIN people
ON people.person_id = crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie'
);
3737223
Extending the Query
How do we write a query that gets the names of Michelle Yeoh movies�that have a rating of at least 7.0?
38
SELECT primary_title
FROM ratings
INNER JOIN yeoh_movies
ON ratings.title_id = yeoh_movies.title_id
WHERE rating >= 7.0;
3737223
Quick Peek: The Natural Join
39
SELECT *
FROM ratings
INNER JOIN yeoh_movies
ON ratings.title_id =
yeoh_movies.title_id
WHERE rating >= 7.0;
title_id | rating | votes | primary_title | title_id
-----------+--------+--------+--------------------------------+-----------
tt0190332 | 7.9 | 268227 | Crouching Tiger, Hidden Dragon | tt0190332
tt0397535 | 7.4 | 132457 | Memoirs of a Geisha | tt0397535
tt0190332 | 7.8 | 241690 | Crouching Tiger, Hidden Dragon | tt0190332
(3 rows)
SELECT *
FROM ratings
NATURAL JOIN yeoh_movies
WHERE rating >= 7.0;
title_id | rating | votes | primary_title
-----------+--------+--------+--------------------------------
tt0190332 | 7.9 | 268227 | Crouching Tiger, Hidden Dragon
tt0397535 | 7.4 | 132457 | Memoirs of a Geisha
tt0190332 | 7.8 | 241690 | Crouching Tiger, Hidden Dragon
(3 rows)
Natural Join “naturally” joins on same attributes and removes duplicates!
More next time!
3737223
[the rest is Lecture 03 material]
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
40
Lecture 02, Data 101 Fall 2023
Now your turn!
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
41
Strategy point 1: Look at the schemas. Which tables have the information that we need? (Hint: use \d meta-command in psql)
Strategy point 2: Think about joins as modifying records:
Strategy point 4: Consider what makes a query readable based on the originally posed question.
Strategy point 3: Think about WHERE, first, then SELECT.
(Lecture 03 material)
Exercise
3737223
Strategy 4 in action
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
42
SELECT DISTINCT � titles.primary_title,
titles.title_id
FROM titles
INNER JOIN crew
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie';
Joins are commutative, so these queries are equivalent!
SELECT DISTINCT � titles.primary_title,
titles.title_id
FROM crew
INNER JOIN titles
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie';
3737223
Strategy 4 in action
How do we write a query that gets the�titles and IDs of Michelle Yeoh movies?
43
SELECT DISTINCT � titles.primary_title,
titles.title_id
FROM titles
INNER JOIN crew
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie';
However, stylistically the left is more readable.
SELECT DISTINCT � titles.primary_title,
crew.title_id
FROM crew
INNER JOIN titles
ON crew.title_id = � titles.title_id
INNER JOIN people
ON people.person_id =
crew.person_id
WHERE people.name = 'Michelle Yeoh'
AND titles.type = 'movie';
3737223
SFW as Relational Algebra
PostgreSQL
SELECT FROM WHERE
Multi-Relation Queries
IMDb Exercise
[Bonus] IMDb Exercise 2
SFW as Relational Algebra
44
Lecture 02, Data 101 Fall 2023
Preview of Next Time: Relational Data Model and Algebra
Algebra (i.e., formulaic operations on variables) is the core of many technical disciplines:
45
RA is the core of query optimization, which we will cover later in the semester.
For now, let’s see how to use RA to represent SFW queries.
Similarly, Relational Algebra (RA) is the foundation of the relational data model.�A firm grounding in RA allows us:
3737223
Relational Schema & Instance The Relational Data Model
A relation is a collection set of tuples (records) with a predefined collection set of attributes.
Schema: The structure, format, or scaffolding defining a relation.
Instance: Specific instantiation of the Relation
46
Same slide as before!, But note that we are dealing with sets (more on bags later).
Songs (name String, artist � String, album String,� peak Integer, …)
name | artist | album | peak |
22 | Taylor Swift | Red | 20 |
Blinding Lights | The Weeknd | After Hours | 1 |
Truly Madly Deeply | Savage Garden | Savage Garden | 1 |
Shelter | Porter Robinson | Shelter (single) | 16 |
… | … | … | … |
3737223
Set are Unordered: Equivalent Representations
Since a relation has a set of attributes and a set of tuples,�we can reorder both and not change the relation:
47
name | artist | album | peak |
22 | Taylor Swift | Red | 20 |
Blinding Lights | The Weeknd | After Hours | 1 |
Truly Madly Deeply | Savage Garden | Savage Garden | 1 |
Shelter | Porter Robinson | Shelter (single) | 16 |
… | … | … | … |
name | peak | artist | album |
Blinding Lights | 1 | The Weeknd | After Hours |
Truly Madly Deeply | 1 | Savage Garden | Savage Garden |
Shelter | 16 | Porter Robinson | Shelter (single) |
22 | 20 | Taylor Swift | Red |
… | … | … | … |
=
3737223
Relational Algebra
Relational Algebra (RA) is the theory of operations that help us transform relations.
Unary operations are those that transform a single relation R with schema R(B1, B2, …, Bm):
Basic unary operations:
48
No RA symbols for now. More next time!
Change metadata (schema)?
❌��
✅�
✅
3737223
SFW as Relational Algebra
To read a SFW SQL query:
A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.
49
SELECT Bi1 AS A1,
Bi2 AS A2,
…
BiN AS AN
FROM R
WHERE C;
3737223
SFW as Relational Algebra
To read a SFW SQL query:
A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.
50
SELECT Bi1 AS A1,
Bi2 AS A2,
…
BiN AS AN
FROM R
WHERE C;
🤔
WHERE, SELECT, AS - Which RA operators do these respectively map to?
3737223
SFW as Relational Algebra
A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.
To read a SFW SQL query:
51
SELECT Bi1 AS A1,
Bi2 AS A2,
…
BiN AS AN
FROM R
WHERE C;
Relational Algebra steps:
For the Relation R,
1. Select tuples that match condition C.
2. Project to include�attributes Bi1, Bi2, …, Bin .
3. Rename attributes �Bi1, Bi2, …, Bin to A1, A2, …, An.
3737223
SFW as Relational Algebra
A SELECT FROM WHERE (SFW) query is the most standard SQL query to extract records from a relation.
To read a SFW SQL query:
52
Relational Algebra steps:
For the Relation R,
1. Select tuples that match condition C.
2. Project to include�attributes Bi1, Bi2, …, Bin .
3. Rename attributes �Bi1, Bi2, …, Bin to A1, A2, …, An.
SELECT Bi1 AS A1,
Bi2 AS A2,
…
BiN AS AN
FROM R
WHERE C;
⚠️SQL SELECT is Relational Algebra projection! ⚠️ This is a decades-old mismatch of terminology. More next time!
3737223