1 of 44

Economics 148 - SQL Databases & Scanners

Data Science for Economics

Spring 2023 - UC Berkeley

Eric Van Dusen

1

LECTURE 9

Econ 148 - Spring 2023

2 of 44

Outline

Lecture 9, Econ 148, Spring 2023

Databases

Scanners / Household Data

More SQL

2

Econ 148 - Spring 2023

3 of 44

Announcements

WHERE - Filter Rows

HAVING - Filter after a Groupby

Project has Hidden Tests!

Public Tests that you see just are like checks that there is an answer

3

Econ 148 - Spring 2023

4 of 44

Data 101!

4

Econ 148 - Spring 2023

5 of 44

Relational Database - Schema - Entity Relationship Diagram

5

homeworld

Characters

Planets

Vehicles

Star Wars Toy Database

GDP

Debt

Country Code

Country code

International Debt Database

Starships

homeworld

Species

Econ 148 - Spring 2023

6 of 44

Relational Database - Schema - Entity Relationship Diagram

6

Homeworld

Species

Characters

Planets

Vehicles

Star Wars Toy Database

Starships

homeworld

Species

Movie Plot as Schema

A New Hope - Episode 4

In Schema

Characters

Luke Skywalker

Species

Human

Planets

Tatooine

Not in Schema

Vehicles

Econ 148 - Spring 2023

7 of 44

Relational Database - Schema - Entity Relationship Diagram

7

Dragon

Dish

Scene

Data 100 Toy Database

UPC

soda

upc

upc

Soda Database

Econ 148 - Spring 2023

8 of 44

Student University ER model

8

https://stackoverflow.com/questions/43537743/relational-schema-to-er-diagram

Econ 148 - Spring 2023

9 of 44

IMDB dataset - used in Data 100

9

https://dlwhittenbury.github.io/imdb-2-designing-a-mysql-database-and-performing-etl-for-imdb-dataset-using-python.html

Econ 148 - Spring 2023

10 of 44

IMDB dataset - Schema - with Keys

10

Econ 148 - Spring 2023

11 of 44

Databases in GUI

11

Mysql workbench

Microsoft Access

Econ 148 - Spring 2023

12 of 44

@holocronkeeper

Database for the star wars universe!

Characters

Episodes

Ships

Time Dimension - Continuity

Universe keeps expanding!

Licensing

12

Econ 148 - Spring 2023

13 of 44

Airtable

13

Econ 148 - Spring 2023

14 of 44

Salesforce

14

Econ 148 - Spring 2023

15 of 44

15

Econ 148 - Spring 2023

16 of 44

Database Companies - large Bay Area employers !

Oracle - 240 bn

Salesforce - 161bn

SAP - 138 bn

IBM

Microsoft

Snowflake - 50bn

Alteryx - 5bn

Airtable - private with 10bn valuation

16

Econ 148 - Spring 2023

17 of 44

On to UPC sales

Demo database with a million rows

17

Econ 148 - Spring 2023

18 of 44

A Meme from 1992 ?!?!

Econ 148 - Spring 2023

19 of 44

Front Page of NYT

Econ 148 - Spring 2023

20 of 44

Applied Economics and Bigger Data !

1990s - Scanner Data - before the internet!

A first example of Big Data

Data had previously been collected from consumer surveys

Sample vs Census

Improve measurement of consumer basket of goods

Basket Changes over time

Can we make a better CPI based on consumption behavior?

Measure choices between premium and generic -

Patent and OTC - patent expires but demand remains

Do rich and poor pay the same prices

Econ 148 - Spring 2023

21 of 44

Household vs Firm

Household sample - Nielsen - Nationally Representative sample of 120,000 household

Households scan everything they buy

Household level attributes eg income level, zip code

Firm Sample

Firms have product, brand data

` Firms have different locations

https://www.cpgdatainsights.com/answer-business-questions/sales-drivers-analysis-panel-data/

Econ 148 - Spring 2023

22 of 44

Another Meme! Trading Privacy for Pennies?

Econ 148 - Spring 2023

23 of 44

Nielsen - leveraging big data since 1923

Originator of the term market share

What do people watch?

Fox news is the top news source

When will ABC be on top

New episodes during ratings “sweeps week “

Now - buy the information directly from

Comcast / DirecTV / Roku …..

Leverage data from multiple markets

What do people buy ?

Coke vs Pepsi Crest vs Colgate

Econ 148 - Spring 2023

24 of 44

Nielsen Scanner Data

Econ 148 - Spring 2023

25 of 44

UC Berkeley Access

https://guides.lib.berkeley.edu/c.php?g=4395&p=5048306

Econ 148 - Spring 2023

26 of 44

UC Berkeley Access

Econ 148 - Spring 2023

27 of 44

Faber and Fally, Firm Heterogeneity in Consumption Baskets

https://www.nber.org/system/files/working_papers/w23101/w23101.pdf

Econ 148 - Spring 2023

28 of 44

Faber and Fally, Firm Heterogeneity in Consumption Baskets

https://www.nber.org/system/files/working_papers/w23101/w23101.pdf

Econ 148 - Spring 2023

29 of 44

Dominick’s - A Chicago area Supermarket Chain

https://www.chicagobooth.edu/research/kilts/datasets/dominicks

Econ 148 - Spring 2023

30 of 44

Domenicks Data

Econ 148 - Spring 2023

31 of 44

Domenicks Database - Schema - Entity Relationship Diagram

31

Store id

Store data

Store Database

UPC

UPC

UPC

Storeid

soda

Soda Database

UPC

UPC

UPC

Storeid

beer

Beer Database

Econ 148 - Spring 2023

32 of 44

Stores

32

Econ 148 - Spring 2023

33 of 44

Which Stores

33

Econ 148 - Spring 2023

34 of 44

34

Econ 148 - Spring 2023

35 of 44

UPC Code

35

Econ 148 - Spring 2023

36 of 44

Basic GROUP BY Operations

Lecture 18, Data 100 Spring 2022

  • Why Databases
  • Warmup: SQL Example
  • SQL Tables
  • Basic SQL Queries
  • Basic GROUP BY Operations
  • Trickier GROUP BY Operations
  • DISTINCT

36

Econ 148 - Spring 2023

37 of 44

The Dish Table

37

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

Econ 148 - Spring 2023

38 of 44

The Dish Table

38

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

SELECT type

FROM Dish;

Econ 148 - Spring 2023

39 of 44

GROUP BY (similar to groupby in Pandas)

39

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

SELECT type

FROM Dish

GROUP BY type;

Econ 148 - Spring 2023

40 of 44

Using GROUP BY and MAX (similar to groupby().max() in pandas)

40

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

SELECT type, MAX(cost)

FROM Dish

GROUP BY type;

Note that “type, MAX(cost)“ is called a “column expression list”.

Econ 148 - Spring 2023

41 of 44

Using GROUP BY and SUM (similar to groupby().sum() in pandas)

41

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

SELECT type, SUM(cost)

FROM Dish

GROUP BY type;

For more aggregation functions see https://www.sqlite.org/lang_aggfunc.html

Econ 148 - Spring 2023

42 of 44

Using Multiple Aggregation Functions (no simple equivalent in pandas)

42

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

SELECT type,

SUM(cost),

MIN(cost),

MAX(name)

FROM Dish

GROUP BY type;

What do you think will happen?

Econ 148 - Spring 2023

43 of 44

Using Multiple Aggregation Functions (no simple equivalent in pandas)

43

name

TEXT, PK

type

TEXT

cost

INTEGER, >=0

ravioli

entree

10

pork bun

entree

7

taco

entree

7

edamame

appetizer

4

fries

appetizer

4

potsticker

appetizer

4

ice cream

dessert

5

Dish

SELECT type,

SUM(cost),

MIN(cost),

MAX(name)

FROM Dish

GROUP BY type;

Econ 148 - Spring 2023

44 of 44

Summary

SELECT <column expression list>

FROM <table>

[WHERE <predicate>]

[GROUP BY <column>]

[ORDER BY <column list>]

[LIMIT <number of rows>]

[OFFSET <number of rows>];

44

Econ 148 - Spring 2023