Economics 148 - SQL Databases & Scanners
Data Science for Economics
Spring 2023 - UC Berkeley
Eric Van Dusen
1
LECTURE 9
Econ 148 - Spring 2023
Outline
Lecture 9, Econ 148, Spring 2023
Databases
Scanners / Household Data
More SQL
2
Econ 148 - Spring 2023
Announcements
WHERE - Filter Rows
HAVING - Filter after a Groupby
Project has Hidden Tests!
Public Tests that you see just are like checks that there is an answer
3
Econ 148 - Spring 2023
Data 101!
4
Econ 148 - Spring 2023
Relational Database - Schema - Entity Relationship Diagram
5
homeworld
Characters
Planets
Vehicles
Star Wars Toy Database
GDP
Debt
Country Code
Country code
International Debt Database
Starships
homeworld
Species
Econ 148 - Spring 2023
Relational Database - Schema - Entity Relationship Diagram
6
Homeworld
Species
Characters
Planets
Vehicles
Star Wars Toy Database
Starships
homeworld
Species
Movie Plot as Schema
A New Hope - Episode 4
In Schema
Characters
Luke Skywalker
Species
Human
Planets
Tatooine
Not in Schema
Vehicles
Econ 148 - Spring 2023
Relational Database - Schema - Entity Relationship Diagram
7
Dragon
Dish
Scene
Data 100 Toy Database
UPC
soda
upc
upc
Soda Database
Econ 148 - Spring 2023
Student University ER model
8
https://stackoverflow.com/questions/43537743/relational-schema-to-er-diagram
Econ 148 - Spring 2023
IMDB dataset - used in Data 100
9
https://dlwhittenbury.github.io/imdb-2-designing-a-mysql-database-and-performing-etl-for-imdb-dataset-using-python.html
Econ 148 - Spring 2023
IMDB dataset - Schema - with Keys
10
Econ 148 - Spring 2023
Databases in GUI
11
Mysql workbench
Microsoft Access
Econ 148 - Spring 2023
@holocronkeeper
Database for the star wars universe!
Characters
Episodes
Ships
Time Dimension - Continuity
Universe keeps expanding!
Licensing
12
Econ 148 - Spring 2023
Airtable
13
Econ 148 - Spring 2023
Salesforce
14
Econ 148 - Spring 2023
15
Econ 148 - Spring 2023
Database Companies - large Bay Area employers !
Oracle - 240 bn
Salesforce - 161bn
SAP - 138 bn
IBM
Microsoft
Snowflake - 50bn
Alteryx - 5bn
Airtable - private with 10bn valuation
16
Econ 148 - Spring 2023
On to UPC sales
Demo database with a million rows
17
Econ 148 - Spring 2023
A Meme from 1992 ?!?!
Econ 148 - Spring 2023
Front Page of NYT
Econ 148 - Spring 2023
Applied Economics and Bigger Data !
1990s - Scanner Data - before the internet!
A first example of Big Data
Data had previously been collected from consumer surveys
Sample vs Census
Improve measurement of consumer basket of goods
Basket Changes over time
Can we make a better CPI based on consumption behavior?
Measure choices between premium and generic -
Patent and OTC - patent expires but demand remains
Do rich and poor pay the same prices
Econ 148 - Spring 2023
Household vs Firm
Household sample - Nielsen - Nationally Representative sample of 120,000 household
Households scan everything they buy
Household level attributes eg income level, zip code
Firm Sample
Firms have product, brand data
` Firms have different locations
https://www.cpgdatainsights.com/answer-business-questions/sales-drivers-analysis-panel-data/
Econ 148 - Spring 2023
Another Meme! Trading Privacy for Pennies?
Econ 148 - Spring 2023
Nielsen - leveraging big data since 1923
Originator of the term market share
What do people watch?
Fox news is the top news source
When will ABC be on top
New episodes during ratings “sweeps week “
Now - buy the information directly from
Comcast / DirecTV / Roku …..
Leverage data from multiple markets
What do people buy ?
Coke vs Pepsi Crest vs Colgate
Econ 148 - Spring 2023
Nielsen Scanner Data
Econ 148 - Spring 2023
UC Berkeley Access
https://guides.lib.berkeley.edu/c.php?g=4395&p=5048306
Econ 148 - Spring 2023
UC Berkeley Access
Econ 148 - Spring 2023
Faber and Fally, Firm Heterogeneity in Consumption Baskets
https://www.nber.org/system/files/working_papers/w23101/w23101.pdf
Econ 148 - Spring 2023
Faber and Fally, Firm Heterogeneity in Consumption Baskets
https://www.nber.org/system/files/working_papers/w23101/w23101.pdf
Econ 148 - Spring 2023
Dominick’s - A Chicago area Supermarket Chain
https://www.chicagobooth.edu/research/kilts/datasets/dominicks
Econ 148 - Spring 2023
Domenicks Data
Econ 148 - Spring 2023
Domenicks Database - Schema - Entity Relationship Diagram
31
Store id
Store data
Store Database
UPC
UPC
UPC
Storeid
soda
Soda Database
UPC
UPC
UPC
Storeid
beer
Beer Database
Econ 148 - Spring 2023
Stores
32
Econ 148 - Spring 2023
Which Stores
33
Econ 148 - Spring 2023
34
Econ 148 - Spring 2023
UPC Code
35
Econ 148 - Spring 2023
Basic GROUP BY Operations
Lecture 18, Data 100 Spring 2022
36
Econ 148 - Spring 2023
The Dish Table
37
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
Econ 148 - Spring 2023
The Dish Table
38
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
SELECT type
FROM Dish;
Econ 148 - Spring 2023
GROUP BY (similar to groupby in Pandas)
39
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
SELECT type
FROM Dish
GROUP BY type;
Econ 148 - Spring 2023
Using GROUP BY and MAX (similar to groupby().max() in pandas)
40
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
SELECT type, MAX(cost)
FROM Dish
GROUP BY type;
Note that “type, MAX(cost)“ is called a “column expression list”.
Econ 148 - Spring 2023
Using GROUP BY and SUM (similar to groupby().sum() in pandas)
41
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
SELECT type, SUM(cost)
FROM Dish
GROUP BY type;
For more aggregation functions see https://www.sqlite.org/lang_aggfunc.html
Econ 148 - Spring 2023
Using Multiple Aggregation Functions (no simple equivalent in pandas)
42
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
SELECT type,
SUM(cost),
MIN(cost),
MAX(name)
FROM Dish
GROUP BY type;
What do you think will happen?
Econ 148 - Spring 2023
Using Multiple Aggregation Functions (no simple equivalent in pandas)
43
name TEXT, PK | type TEXT | cost INTEGER, >=0 |
ravioli | entree | 10 |
pork bun | entree | 7 |
taco | entree | 7 |
edamame | appetizer | 4 |
fries | appetizer | 4 |
potsticker | appetizer | 4 |
ice cream | dessert | 5 |
Dish
SELECT type,
SUM(cost),
MIN(cost),
MAX(name)
FROM Dish
GROUP BY type;
Econ 148 - Spring 2023
Summary
SELECT <column expression list>
FROM <table>
[WHERE <predicate>]
[GROUP BY <column>]
[ORDER BY <column list>]
[LIMIT <number of rows>]
[OFFSET <number of rows>];
44
Econ 148 - Spring 2023