Data Wrangling and EDA
Exploratory Data Analysis and its role in the data science lifecycle
Data 100, Summer 2022 @ UC Berkeley
Anirudhan Badrinath and Dominic Liu
1
LECTURE 4
Today’s Roadmap
Lecture 04, Data 100 Summer 2022
Data Wrangling and Exploratory Data Analysis: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
2
Previously…
Pandas and Jupyter Notebooks
4
Now
Congratulations!!!
You have collected or have been given�a box of data.
What do you do next?
5
Box of Data
Plan for next few lectures
6
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
?
Plan for this week
7
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Prediction and
Inference
?
Data Wrangling
Intro to EDA
Working with Text Data
Regular Expressions
Plots and variables
Seaborn
Viz principles
KDE/Transformations
Mon and Tue (Part I: Processing Data)
Wed and Thur (Part II: Visualizing and Reporting Data)
(today)
Data Wrangling and EDA:�An Infinite Loop
Lecture 04, Data 100 Summer 2022
Data Wrangling and EDA: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
8
The Infinite Loop of Data Science
9
Exploratory Data
Analysis (EDA)
Data Wrangling
Data Wrangling
Often addresses issues like…
Sadly, data cleaning is a big part�of data science…
10
Data Wrangling, or Data Cleaning:
The process of transforming raw data� to facilitate subsequent analysis.
Exploratory Data Analysis (EDA)
“Getting to Know the Data”
The process of transforming, visualizing, and summarizing data to:
EDA is an open-ended analysis.
11
Provenance: origin of data; methodology by which data were produced
John Tukey on EDA
John Tukey (1915-2000) was a Princeton Mathematician & Statistician�and an Early Data Scientist.
Coined/Introduced:
12
[Data Analysis & Statistics, Tukey 1965; Image from LIFE Magazine]
EDA is like detective work:
Exploratory data analysis is an attitude, a state of�flexibility, a willingness to look for those things that�we believe are not there, as well as those that�we believe to be there.
Key Data Properties to Consider in EDA
Lecture 04, Data 100 Summer 2022
Data Wrangling and Exploratory Data Analysis: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
13
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
14
What should�we look for?
Key Data Properties�to Consider in EDA
Structure
Lecture 04, Data 100 Summer 2022
Data Wrangling and Exploratory Data Analysis: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
15
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
File Format
Variable Type
Multiple files�(Primary and Foreign Keys)
16
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
File Format
Variable Type
Multiple files�(Primary and Foreign Keys)
17
Rectangular Data
We prefer rectangular data for data analysis (why?)
Two kinds of rectangular data: Tables and Matrices.
18
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | |
Records/Rows
Fields/Attributes/�Features/Columns
Tables (a.k.a. dataframes in R/Python and relations in SQL)
Matrices
How are these data files formatted?
19
TSV
Tab separated values
CSV
Comma separated �values
JSON
Which is the best? It depends on your use case.
CSV: Comma Separated Values
San Francisco restaurant food safety scores, (San Francisco Department of Public Health)
Understand high-level structure:
CSV is a very common table file format:
Tabular data: pd.read_csv
20
Demo Slides
TSV: Tab Separated Values
Another common table file format.
pd.read_csv: Need to specify� delimiter='\t'
21
Issues with CSVs and TSVs:
Demo Slides
JSON: JavaScript Object Notation
Berkeley covid cases by day (City of Berkeley)
A less common table file format.
22
Issues
Tabular data: Find the records using regular Python, then pd.DataFrame.
Demo Slides
What is the following file format?
23
Often files will have mixed file formats, incorrect extensions or no extension at all.
You may need to explore the�actual raw data file!
Demo Slides
Other types of data formats
Log data (usually .txt)
169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”
169.237.6.168 - - [8/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
XML (Extensible Markup Language)
<catalog>
<plant type='a'>
<common>Bloodroot</common>
<botanical>Sanguinaria canadensis</botanical>
<zone>4</zone>
<light>Mostly Shady</light>
<price>2.44</price>
<availability>03/15/2006</availability>
<description>
<color>white</color>
<petals>true</petals>
</description>
<indoor>true</indoor>
</plant>
…
</catalog>
In Data 100 we will primarily work with CSV files, but there are other types of non-tabular data out in the wild.
24
CSV? TSV?�JSON? XML?
None of the above?
Make your custom parser!
Nested structure
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
File Format
Variable Type
Multiple files�(Primary and Foreign Keys)
25
Records and Variables/Fields
All data (regardless of format) is composed of records.�Each record has a set of variables (aka fields).
Variables are defined by their type (2 defs):
26
Records/Rows
Fields/Attributes/Features/Columns
| business_id | business_name |
0 | 835 | Kam Po Kitchen |
1 | 905 | Working Girls' Cafe' |
Variable Feature Types
27
Quantitative
Qualitative �(categorical)
Variable
Examples:
Could be measured to arbitrary precision.
Examples:
Examples:
Categories w/ordered levels; no consistent meaning to difference
Categories w/ no
specific ordering.
Ordinal
Nominal
Continuous
Discrete
Examples:
Ratios and intervals have�meaning.
Finite possible values
Note that qualitative variables could have numeric levels; conversely, quantitative variables could be stored as strings!
Class Exercise
28
What is the feature type of each variable?
🤔
Q | Variable | Feature Type |
1 | CO2 level (PPM) | |
2 | Number of siblings | |
3 | GPA | |
4 | Income bracket (low, med, high) | |
5 | Race | |
6 | Number of years of education | |
7 | Yelp Rating | |
Quantitative
Qualitative
Ord-�inal
Nom-�inal
Cont-�inuous
Dis-�crete
Variable
A
B
C
D
Class Exercise: Solutions
29
What is the feature type of each variable?
🤔
Q | Variable | Feature Type |
1 | CO2 level (PPM) | |
2 | Number of siblings | |
3 | GPA | |
4 | Income bracket (low, med, high) | |
5 | Race | |
6 | Number of years of education | |
7 | Yelp Rating | |
Meta: For this exercise,�The Feature Type variable is Qualitative Nominal.
A. Quantitative Cont.
B. Quantitative Discrete
A. Quantitative Cont.
C. Qualitative Ordinal�
D. Qualitative Nominal
B. Quantitative Discrete�
C. Qualitative Ordinal
Quantitative
Qualitative
Ord-�inal
Nom-�inal
Cont-�inuous
Dis-�crete
Variable
A
B
C
D
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
File Format
Variable Type
Multiple files�(Primary and Foreign Keys)
30
Structure: Keys
Sometimes your data comes in multiple files:
Primary key: the column or set of columns in a table that determine the values of the remaining columns
31
OrderNum | ProdID | Quantity |
1 | 42 | 3 |
1 | 999 | 2 |
2 | 42 | 1 |
OrderNum | CustID | Date |
1 | 171345 | 8/21/2017 |
2 | 281139 | 8/30/2017 |
ProdID | Cost |
42 | 3.14 |
999 | 2.72 |
Products.csv
Orders.csv
CustID | Addr |
171345 | Harmon.. |
281139 | Main .. |
Customers.csv
Primary Key
Primary Key
Purchases.csv
Structure: Keys
Structure: Keys
Sometimes your data comes in multiple files:
Primary key: the column or set of columns in a table that determine the values of the remaining columns
Foreign keys: the column or sets of columns that reference primary keys in other tables.
You may need to join across tables!� pd.merge
32
OrderNum | ProdID | Quantity |
1 | 42 | 3 |
1 | 999 | 2 |
2 | 42 | 1 |
OrderNum | CustID | Date |
1 | 171345 | 8/21/2017 |
2 | 281139 | 8/30/2017 |
ProdID | Cost |
42 | 3.14 |
999 | 2.72 |
Products.csv
Orders.csv
CustID | Addr |
171345 | Harmon.. |
281139 | Main .. |
Customers.csv
Primary Key
Primary Key
Purchases.csv
Foreign Key
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
Are the data in a standard format or encoding?
Are the data organized in records or nested?
Does the data reference other data?
What are the fields in each record?
33
Summary
You will do the most data wrangling when analyzing the structure of your data.
Interlude
Assignments
Labs 1 and 2, Homework 1 due tonight! If you need more time, remember to fill out the slip day request form!
Office Hours
We have online office hours! If you are ever stuck on a homework or lab, please bring them to office hours!
34
Granularity, Scope, Temporality
Lecture 04, Data 100 Summer 2022
Data Wrangling and Exploratory Data Analysis: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
35
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
36
Question & Problem
Formulation
Data
Acquisition
Exploratory Data Analysis
Granularity
What does each record represent?
Do all records capture granularity at the same level?
If the data are coarse, how were the records aggregated?
37
Fine �Grained
Coarse
Grained
Rec. 1
Rec. 2
Rec. 3
Rec. 1
Rec. 2
Rec. 3
Rec. 1
Scope
Does my data cover my area of interest?
Are my data too expansive?
Does my data cover the right time frame?
38
Scope
Does my data cover my area of interest?
Are my data too expansive?
Does my data cover the right time frame?
(recall) The sampling frame is the population from which the data were sampled. Note that this may not be the population of interest.
How complete/incomplete is the frame (and its data)?
39
Temporality
Data changes – when was the data collected/last updated?
Periodicity — Is there periodicity? Diurnal (24-hr) patterns?
40
What is the meaning of the time and date fields? A few options:
Are there strange null values?
Time depends on where! (time zones & daylight savings)
Temporality: Unix Time / POSIX Time
Time measured in seconds since January 1st 1970
Unix time follows Coordinated Universal Time (UTC)
Time Zones:
41
Jun 26, 2022 6:00pm PDT
1656374400
Faithfulness (and Missing Values)
Lecture 04, Data 100 Summer 2022
Data Wrangling and Exploratory Data Analysis: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
42
Structure -- the “shape” of a data file
Granularity -- how fine/coarse is each datum
Scope -- how (in)complete is the data
Temporality -- how is the data situated in time
Faithfulness -- how well does the data capture “reality”
43
Faithfulness: Do I trust this data?
Does my data contain unrealistic or “incorrect” values?
Does my data violate obvious dependencies?
Was the data entered by hand?
Are there obvious signs of data falsification?
44
Signs that your data may not be faithful (and Solutions)
Truncated data
Time Zone Inconsistencies
Duplicated Records or Fields
Spelling Errors
Units not specified or consistent
Missing Data
45
How to Address Missing Data/Default Values
Drop records with missing values
Imputation: Inferring missing values
Other Suggestions (thanks Prof. Gonzalez):
You’ll see Suggestion 1 in this week’s homework; you’ll see Suggestion 2 in the wild.
46
Examples
" “
0, -1�999, 12345
1970, 1900
NaN
Null
NaN: “Not a Number”
Demo: Mauna Loa CO2 EDA
Lecture 04, Data 100 Summer 2022
Data Wrangling and Exploratory Data Analysis: An Infinite Loop
Key Data Properties to Consider in EDA
EDA Demo: Mauna Loa CO2
47
What are our Variable Feature Types?
From file description:
What are the first three columns? How do these columns define each record?
48
EDA step:
Understand what each record, each feature represents
Demo Slides
The Search for the Missing Values
From file description:
Which approach? Drop, NaN, Interpolate
Granularity of data: What do we want to report? How long is the timescale?
49
EDA step:
Hypothesize why these values were missing, then use that knowledge to decide whether to drop or impute missing values
Demo Slides
A Discussion on Data Granularity
BONUS MATERIAL
From the description:
Which granularity to present?
50
You want the granularity of your data to match your research question.
Demo Slides
Summary: How do you do EDA/Data Wrangling?
Examine data and metadata:
Examine each field/attribute/dimension individually
Examine pairs of related dimensions
Along the way:
51
Data Wrangling and EDA
Content credit: Joseph Gonzalez, Deborah Nolan, Joseph Hellerstein, and Lisa Yan
52
LECTURE 4