1 of 18

Writing Results Out

January 31, 2024

Data 101/Info 258, Spring 2024 @ UC Berkeley

Aditya Parameswaran https://data101.org/sp24

1

LECTURE 04

2 of 18

Warmup

What type of join is the below?

Describe what the query is doing in plain English.

2

SELECT P2.name, P2.born

FROM People AS P1,

People AS P2

WHERE P1.born > P2.born

AND P1.name = 'Tom Hanks';

🤔

3 of 18

Warmup

What type of join is the below?

Describe what the query is doing in plain English.

3

SELECT P2.name, P2.born

FROM People AS P1,

People AS P2

WHERE P1.born > P2.born

AND P1.name = 'Tom Hanks';

Names and birth dates for people who were born before Tom Hanks

This is called a self join.

4 of 18

Set/Multiset operations in SQL

Set/Multiset Operations in SQL

Creating Tables, Views, and CTEs

4

Lecture 04, Data 101/Info 258 Spring 2024

5 of 18

Bag operations and set operations

Bag (i.e., multiset) operations for union, difference, and intersection, respectively:

  • (subquery) UNION ALL (subquery)
  • (subquery) EXCEPT ALL (subquery)
  • (subquery) INTERSECT ALL (subquery)

5

Example: Find members of Crew that are not in People:

(SELECT person_id FROM Crew)

EXCEPT ALL

(SELECT person_id FROM People)

A subquery is a parenthesized SQL query. More in a bit!

6 of 18

Bag operations and set operations

Bag (i.e., multiset) operations for union, difference, and intersection, respectively:

  • (subquery) UNION ALL (subquery)
  • (subquery) EXCEPT ALL (subquery)
  • (subquery) INTERSECT ALL (subquery)

Enforce set behavior:

  • (subquery) UNION (subquery)
  • (subquery) EXCEPT (subquery)
  • (subquery) INTERSECT (subquery)

Reminder: DISTINCT keyword removes duplicates ,which effectively requires SQL query to return a set of tuples:

SELECT DISTINCT …

6

Example: Find members of Crew that are not in People:

(SELECT person_id FROM Crew)

EXCEPT ALL

(SELECT person_id FROM People)

A subquery is a parenthesized SQL query. More in a bit!

7 of 18

Today: The Stanford Open Policing Dataset

7

8 of 18

Today: The Stanford Open Policing Dataset

Partnership between Computational Journalism lab and Computational Policy lab

  • 200 million records of traffic stop and search data since 2015
  • Releasing this data in a clean form is hard! Not standardized; incomplete demographics; not always releasable

8

9 of 18

Today: The Stanford Open Policing Dataset

Partnership between Computational Journalism lab and Computational Policy lab

  • 200 million records of traffic stop and search data since 2015
  • Releasing this data in a clean form is hard! Not standardized; incomplete demographics; not always releasable

Statistical analysis on ~100 million traffic stops from sufficiently detailed data from 21 state patrol agencies, 29 municipal police departments

  • Found significant racial disparities in policing
  • Attempts to disentangle impacts due to differences in driving behavior and officer bias
    • For example, rates of successful contraband searches after the stop

For more info, read the findings/check out the tutorial!

9

We work with a subset of this data for today’s lecture.

10 of 18

Also our collaborator! (Cheryl Philips)

Collaboration between Berkeley (Data Science) + Journalists + Public Defenders

Making sense of messy police misconduct data now becoming available!

10

11 of 18

Also our collaborator! (Cheryl Philips)

Collaboration between Berkeley (Data Science) + Journalists + Public Defenders

Making sense of messy police misconduct data now becoming available!

“ …developing new methodologies in AI, databases, human-computer interaction, and visualization to enable the work. They now routinely meet with data and investigative journalists to sort through the daunting technical challenges. Some of their PhD students worked with investigative reporting students on clustering algorithms, for example, to better identify themes in the policing documents… “

11

12 of 18

Creating Tables, Views, and CTEs

Set/Multiset Operations in SQL

Creating Tables, Views, and CTEs

12

Lecture 04, Data 101/Info 258 Spring 2024

13 of 18

[Summary] How do I save the results of a query?

Sometimes we would like to save the results of a query to use in other queries.

There are four common strategies:

  1. Create a new table
  2. Create a virtual view, i.e., a view
  3. Create an inlined view, i.e., a CTE (Common Table Expression)
  4. Create a materialized view, i.e., MV

13

14 of 18

[1/4] How do I save the results of a query?

Sometimes we would like to save the results of a query to use in other queries.

There are four common strategies:

  • Create a new Table
  • Treated as a regular table after invocation (i.e., creation).
  • If base table changes, we must manually change any derived tables!

14

CREATE TABLE CitationStops AS (

SELECT race, citation

FROM Stops

WHERE citation = True

);

note the parentheses!

no semicolon here!

So, we can create a new table if we do not expect much change to the data in the table.

15 of 18

[2/4] How do I save the results of a query?

Sometimes we would like to save the results of a query to use in other queries.

There are four common strategies:

  • Create a new Table
  • Create a virtual view
  • View for short
  • Output is not stored; computed on demand as part of the query
  • Queried just like a table:� SELECT *� FROM CitationStops;
  • Sometimes defined for access control – more on that later

15

Think of a view as a variable or as a virtual relation that is more convenient to query than the base table(s).

CREATE VIEW CitationStops AS (

SELECT race, citation

FROM Stops

WHERE citation = True

);

16 of 18

[4/4] How do I save the results of a query?

  • CTE for short
  • Like virtual views, CTEs are computed on demand

Sometimes we would like to save the results of a query to use in other queries.

There are four common strategies:

  • Create a new Table
  • Create a virtual view
  • Create an inlined view, i.e., �a CTE (Common Table Expression)

16

You’ll likely use this quite a bit for assigning a subquery result to a temporary variable.

Note scoping rules apply!

WITH CitationStops AS (

SELECT race, citation

FROM Stops

WHERE citation = True

)

SELECT *

FROM CitationStops;

17 of 18

[4/4] How do I save the results of a query?

Sometimes we would like to save the results of a query to use in other queries.

There are four common strategies:

  • Create a new Table
  • Create a virtual view
  • Create an inlined view, i.e., �a CTE (Common Table Expression)
  • Create a materialized view, i.e., MV
  • Output is stored on the disk, just like a regular table, and unlike views
  • Key benefit over regular tables: Materialized Views (MVs) automatically updated as base table(s) change!
  • However, many MVs can add unnecessary overhead to any base table updates! So use judiciously.

17

Unlike SQL Server, Oracle, and BigQuery, PostgreSQL does not keep MVs up to date. Instead, the user (and not the system) must manually “refresh” MVs.

CREATE MATERIALIZED VIEW CitationStops AS (

SELECT race, citation

FROM Stops

WHERE citation = True

);

18 of 18

[Summary] How do I save the results of a query?

Sometimes we would like to save the results of a query to use in other queries.

There are four common strategies:

  • Create a new table
  • Create a virtual view, i.e., a view
  • Create an inlined view, i.e.,�a CTE (Common Table Expression)
  • Create a materialized view, i.e., MV

18

In this class, we will most commonly utilize views and CTEs. You will see some examples on Project 1.

Meta-commands:

  • views: \dv
  • tables: \dt
  • all: \d