1 of 52

Data Wrangling and EDA

Exploratory Data Analysis and its role in the data science lifecycle

Data 100, Summer 2022 @ UC Berkeley

Anirudhan Badrinath and Dominic Liu

1

LECTURE 4

2 of 52

Today’s Roadmap

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

2

3 of 52

Previously…

http://abcnews.go.com/Lifestyle/silly-baby-panda-falls-flat-face-public-debut/story?id=42481478

4 of 52

Pandas and Jupyter Notebooks

Introduced DataFrame concepts

Series: A named column of data with an index
Indices: The mapping from keys to rows
DataFrame: collection of series with common index

Dataframe access methods

Filtering on predicts and slicing
df.loc: location by index
df.iloc: location by integer address
groupby & pivot aggregating data

4

5 of 52

Now

Congratulations!!!

You have collected or have been given�a box of data.

What do you do next?

5

Box of Data

6 of 52

Plan for next few lectures

6

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

?

7 of 52

Plan for this week

7

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

?

Data Wrangling

Intro to EDA

Working with Text Data

Regular Expressions

Plots and variables

Seaborn

Viz principles

KDE/Transformations

Mon and Tue (Part I: Processing Data)

Wed and Thur (Part II: Visualizing and Reporting Data)

(today)

8 of 52

Data Wrangling and EDA:�An Infinite Loop

Lecture 04, Data 100 Summer 2022

Data Wrangling and EDA: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

8

9 of 52

The Infinite Loop of Data Science

9

Exploratory Data

Analysis (EDA)

Data Wrangling

10 of 52

Data Wrangling

Often addresses issues like…

structure / formatting
missing or corrupted values
unit conversion
encoding text as numbers
…

Sadly, data cleaning is a big part�of data science…

10

Data Wrangling, or Data Cleaning:

The process of transforming raw data� to facilitate subsequent analysis.

11 of 52

Exploratory Data Analysis (EDA)

“Getting to Know the Data”

The process of transforming, visualizing, and summarizing data to:

Build/confirm understanding of the data and its provenance
Identify and address potential issues in the data
Inform the subsequent analysis
Discover potential hypothesis … (be careful…)

EDA is an open-ended analysis.

Be willing to find something surprising!

11

Provenance: origin of data; methodology by which data were produced

12 of 52

John Tukey on EDA

John Tukey (1915-2000) was a Princeton Mathematician & Statistician�and an Early Data Scientist.

Coined/Introduced:

Fast Fourier Transform algorithm
“Bit” : binary digit
Exploratory Data Analysis

12

[Data Analysis & Statistics, Tukey 1965; Image from LIFE Magazine]

EDA is like detective work:

Exploratory data analysis is an attitude, a state of�flexibility, a willingness to look for those things that�we believe are not there, as well as those that�we believe to be there.

13 of 52

Key Data Properties to Consider in EDA

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

13

14 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

14

What should�we look for?

Key Data Properties�to Consider in EDA

15 of 52

Structure

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

15

16 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

16

17 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

17

18 of 52

Rectangular Data

We prefer rectangular data for data analysis (why?)

Regular structures are easy manipulate and analyze
A big part of data cleaning is about �transforming data to be more rectangular

Two kinds of rectangular data: Tables and Matrices.

18

Records/Rows

Fields/Attributes/�Features/Columns

Tables (a.k.a. dataframes in R/Python and relations in SQL)

Named columns with different types
Manipulated using data transformation languages (map, filter, group by, join, …)

Matrices

Numeric data of the same type (float, int, etc.)
Manipulated using linear algebra

19 of 52

How are these data files formatted?

19

TSV

Tab separated values

CSV

Comma separated �values

JSON

Which is the best? It depends on your use case.

20 of 52

CSV: Comma Separated Values

San Francisco restaurant food safety scores, (San Francisco Department of Public Health)

Understand high-level structure:

How big is the data file?
How is the data file formatted?
How do we read the data into pandas?

CSV is a very common table file format:

Records (rows) are delimited by a newline: '\n', "\r\n"
Fields (columns) are delimited by commas: ','

Tabular data: pd.read_csv

20

Demo Slides

21 of 52

TSV: Tab Separated Values

Another common table file format.

Records are delimited by a newline: '\n', "\r\n"
Fields are delimited by '\t' (tab)

pd.read_csv: Need to specify� delimiter='\t'

21

Issues with CSVs and TSVs:

Commas, tabs in records
Quoting
…

Demo Slides

22 of 52

JSON: JavaScript Object Notation

Berkeley covid cases by day (City of Berkele y)

A less common table file format.

Very similar to Python dictionaries
Strict formatting ”quoting” addresses some issues in CSV/TSV
Can save metadata (data about the data) along with records in the same file

22

Issues

Not rectangular
Each record can have different fields
Nesting means records can contain tables – complicated

Tabular data: Find the records using regular Python, then pd.DataFrame.

Demo Slides

23 of 52

What is the following file format?

Mauna Loa Observatory CO₂ levels (NOAA)

How do we load these data into Pandas?

p d.read_csv? pd.DataFrame?

23

Often files will have mixed file formats, incorrect extensions or no extension at all.

You may need to explore the�actual raw data file!

Demo Slides

24 of 52

Other types of data formats

Log data (usually .txt)

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”

169.237.6.168 - - [8/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

XML (Extensible Markup Language)

<common>Bloodroot</common>

<botanical>Sanguinaria canadensis</botanical>

<light>Mostly Shady</light>

<color>white</color>

</description>

</plant>

…

</catalog>

In Data 100 we will primarily work with CSV files, but there are other types of non-tabular data out in the wild.

24

CSV? TSV?�JSON? XML?

None of the above?

Make your custom parser!

Nested structure

25 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

25

26 of 52

Records and Variables/Fields

All data (regardless of format) is composed of records.�Each record has a set of variables (aka fields).

Tabular: Records == Rows, Variables == Columns
Non-Tabular: Create Records�and wrangle into tabular data

Variables are defined by their type (2 defs):

Storage type in pandas:�integer, floating point, boolean, object (string-like), etc.�df[colname].dtype�
Feature type: conceptual notion of the information�Use expert knowledge�Explore data itself�Consult data codebook (if it exists)

26

Records/Rows

Fields/Attributes/Features/Columns

	business_id	business_name
0	835	Kam Po Kitchen
1	905	Working Girls' Cafe'

27 of 52

Variable Feature Types

27

Quantitative

Qualitative �(categorical)

Variable

Examples:

Price
Temperature

Could be measured to arbitrary precision.

Examples:

Preferences
Level of education

Examples:

Political Affiliation
Cal lD number
ZIP code

Categories w/ordered levels; no consistent meaning to difference

Categories w/ no

specific ordering.

Ordinal

Nominal

Continuous

Discrete

Examples:

Number of siblings
Yrs of education

Ratios and intervals have�meaning.

Finite possible values

Note that qualitative variables could have numeric levels; conversely, quantitative variables could be stored as strings!

28 of 52

Class Exercise

28

What is the feature type of each variable?

🤔

Q	Variable	Feature Type
1	CO₂ level (PPM)
2	Number of siblings
3	GPA
4	Income bracket (low, med, high)
5	Race
6	Number of years of education
7	Yelp Rating

Quantitative

Qualitative

Ord-�inal

Nom-�inal

Cont-�inuous

Dis-�crete

Variable

A

B

C

D

29 of 52

Class Exercise: Solutions

29

What is the feature type of each variable?

🤔

Q	Variable	Feature Type
1	CO₂ level (PPM)
2	Number of siblings
3	GPA
4	Income bracket (low, med, high)
5	Race
6	Number of years of education
7	Yelp Rating

Meta: For this exercise,�The Feature Type variable is Qualitative Nominal.

A. Quantitative Cont.

B. Quantitative Discrete

A. Quantitative Cont.

C. Qualitative Ordinal�

D. Qualitative Nominal

B. Quantitative Discrete�

C. Qualitative Ordinal

Quantitative

Qualitative

Ord-�inal

Nom-�inal

Cont-�inuous

Dis-�crete

Variable

A

B

C

D

30 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

30

31 of 52

Structure: Keys

Sometimes your data comes in multiple files:

Often data will reference other pieces of data.

Primary key: the column or set of columns in a table that determine the values of the remaining columns

Primary keys are unique
Examples: SSN, ProductIDs, …

31

OrderNum	ProdID	Quantity
1	42	3
1	999	2
2	42	1

OrderNum	CustID	Date
1	171345	8/21/2017
2	281139	8/30/2017

ProdID	Cost
42	3.14
999	2.72

Products.csv

Orders.csv

CustID	Addr
171345	Harmon..
281139	Main ..

Customers.csv

Primary Key

Purchases.csv

32 of 52

Structure: Keys

Sometimes your data comes in multiple files:

Often data will reference other pieces of data.

Primary key: the column or set of columns in a table that determine the values of the remaining columns

Primary keys are unique
Examples: SSN, ProductIDs, …

Foreign keys: the column or sets of columns that reference primary keys in other tables.

You may need to join across tables!� pd.merge

32

OrderNum	ProdID	Quantity
1	42	3
1	999	2
2	42	1

OrderNum	CustID	Date
1	171345	8/21/2017
2	281139	8/30/2017

ProdID	Cost
42	3.14
999	2.72

Products.csv

Orders.csv

CustID	Addr
171345	Harmon..
281139	Main ..

Customers.csv

Primary Key

Purchases.csv

Foreign Key

33 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

Are the data in a standard format or encoding?

Tabular data: CSV, TSV, Excel, SQL
Nested data: JSON or XML

Are the data organized in records or nested?

Can we define records by parsing the data?
Can we reasonably un-nest the data?

Does the data reference other data?

Can we join/merge the data?
Do we need to?

What are the fields in each record?

How are they encoded? (e.g., strings, numbers, binary, dates …)
What is the type of the data?

33

Summary

You will do the most data wrangling when analyzing the structure of your data.

34 of 52

Interlude

Assignments

Labs 1 and 2, Homework 1 due tonight! If you need more time, remember to fill out the slip day request form!

Office Hours

We have online office hours! If you are ever stuck on a homework or lab, please bring them to office hours!

34

35 of 52

Granularity, Scope, Temporality

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

35

36 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

36

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

37 of 52

Granularity

What does each record represent?

Examples: a purchase, a person, a group of users

Do all records capture granularity at the same level?

Some data will include summaries (aka rollups) as records

If the data are coarse, how were the records aggregated?

Sampling, averaging, …

37

Fine �Grained

Coarse

Grained

Rec. 1

Rec. 2

Rec. 3

Rec. 1

Rec. 2

Rec. 3

Rec. 1

38 of 52

Scope

Does my data cover my area of interest?

Example: I am interested in studying crime in California but I only have Berkeley crime data.

Are my data too expansive?

Example: I am interested in student grades for DS100 but have student grades for all statistics classes.
Solution: Filtering ⇒ Implications on sample?

If the data is a sample I may have poor coverage after filtering …

Does my data cover the right time frame?

More on this in Temporality…

38

39 of 52

Scope

Does my data cover my area of interest?

Example: I am interested in studying crime in California but I only have Berkeley crime data.

Are my data too expansive?

Example: I am interested in student grades for DS100 but have student grades for all statistics classes.
Solution: Filtering ⇒ Implications on sample?

If the data is a sample I may have poor coverage after filtering …

Does my data cover the right time frame?

More on this in Temporality…

(recall) The sampling frame is the population from which the data were sampled. Note that this may not be the population of interest.

How complete/incomplete is the frame (and its data)?

How is the frame/data situated in place?
How well does the frame/data capture reality?
How is the frame/data situated in time?

39

40 of 52

Temporality

Data changes – when was the data collected/last updated?

Periodicity — Is there periodicity? Diurnal (24-hr) patterns?

40

What is the meaning of the time and date fields? A few options:

When the “event” happened?
When the data was collected or was entered into the system?
Date the data was copied into a database? (look for many matching timestamps)

Are there strange null values?

E.g., January 1st 1970, January 1st 1900…?

Time depends on where! (time zones & daylight savings)

Regions have different datestring representations: 07/08/09?
Learn to use datetime python library and Pandas dt accessors

41 of 52

Temporality: Unix Time / POSIX Time

Time measured in seconds since January 1st 1970

Minus leap seconds …

Unix time follows Coordinated Universal Time (UTC)

International time standard
Measured at 0 degrees latitude

Similar to Greenwich Mean Time (GMT)

No daylight savings
Time codes

Time Zones:

San Francisco (UTC-7) with daylight savings

41

https://en.wikipedia.org/wiki/Coordinated_Universal_Time

Jun 26, 2022 6:00pm PDT

1656374400

42 of 52

Faithfulness (and Missing Values)

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

42

43 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

43

44 of 52

Faithfulness: Do I trust this data?

Does my data contain unrealistic or “incorrect” values?

Dates in the future for events in the past
Locations that don’t exist
Negative counts
Misspellings of names
Large outliers

Does my data violate obvious dependencies?

E.g., age and birthday don’t match

Was the data entered by hand?

Spelling errors, fields shifted …
Did the form require all fields or provide default values?

Are there obvious signs of data falsification?

Repeated names, fake looking email addresses, repeated use of uncommon names or fields.

44

45 of 52

Signs that your data may not be faithful (and Solutions)

Truncated data

Early Microsoft Excel limits: 65536 Rows, 255 Columns
Soln: be aware of consequences in analysis ⇒ how did truncation affect sample?

Time Zone Inconsistencies

Soln 1: convert to a common timezone (e.g., UTC)
Soln 2: convert to the timezone of the location – useful in modeling behavior.

Duplicated Records or Fields

Soln: identify and eliminate (use primary key) ⇒ implications on sample?

Spelling Errors

Soln: Apply corrections or drop records not in a dictionary ⇒ implications on sample?

Units not specified or consistent

Solns: Infer units, check values are in reasonable ranges for data

Missing Data

See next slide

45

46 of 52

How to Address Missing Data/Default Values

Drop records with missing values

Probably most common
Caution: check for biases induced by�dropped values

Missing or corrupt records might be�related to something of interest

Imputation: Inferring missing values

Average Imputation: replace with an average value

Which average? Often use closest related subgroup mean.

Hot deck imputation: replace with a random value

Choose a random value from the subgroup and use it for the missing value.

Other Suggestions (thanks Prof. Gonzalez):

Drop missing values but check for induced bias (use domain knowledge)
Directly model missing values during future analysis

You’ll see Suggestion 1 in this week’s homework; you’ll see Suggestion 2 in the wild.

46

Examples

" “

0, -1�999, 12345

1970, 1900

NaN

Null

NaN: “Not a Number”

47 of 52

Demo: Mauna Loa CO2 EDA

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

Structure

File format
Variable types
Primary and Foreign Keys

Granularity, Scope, Temporality
Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

47

48 of 52

What are our Variable Feature Types?

From file description:

All measurement variables (average, interpolated, trend) are monthly mean CO2 mole fraction

Computed from daily means

#days: Number of daily means in a month (i.e., # days equipment worked)

What are the first three columns? How do these columns define each record?

48

EDA step:

Understand what each record, each feature represents

Demo Slides

49 of 52

The Search for the Missing Values

From file description:

-99.99: missing monthly average Avg
-1: missing value for # days that the equipment was in operation that month.

Which approach? Drop, NaN, Interpolate

All 3 are probably fine since few missing values, but we choose interpolation

Granularity of data: What do we want to report? How long is the timescale?

49

EDA step:

Hypothesize why these values were missing, then use that knowledge to decide whether to drop or impute missing values

Demo Slides

50 of 52

A Discussion on Data Granularity

BONUS MATERIAL

From the description:

Monthly measurements are averages of average day measurements.
The NOAA GML website has datasets for daily/hourly measurements too.

Which granularity to present?

You can always go from finer-grained to coarser-grained data (groupby.agg), but not vice versa.
Fine-grained data can be computationally expensive: 61 years of seconds is a lot of records!

50

You want the granularity of your data to match your research question.

Demo Slides

51 of 52

Summary: How do you do EDA/Data Wrangling?

Examine data and metadata:

What is the date, size, organization, and structure of the data?

Examine each field/attribute/dimension individually

Examine pairs of related dimensions

Stratifying earlier analysis: break down grades by major …

Along the way:

Visualize/summarize the data (Wednesday)
Validate assumptions about data and collection process
Identify and address anomalies
Apply data transformations and corrections (Thursday)
Record everything you do! (why?)

51

52 of 52

Data Wrangling and EDA

Content credit: Joseph Gonzalez, Deborah Nolan, Joseph Hellerstein, and Lisa Yan

52

LECTURE 4