1 of 52

Data Wrangling and EDA

Exploratory Data Analysis and its role in the data science lifecycle

Data 100, Summer 2022 @ UC Berkeley

Anirudhan Badrinath and Dominic Liu

1

LECTURE 4

2 of 52

Today’s Roadmap

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

2

3 of 52

Previously…

4 of 52

Pandas and Jupyter Notebooks

  • Introduced DataFrame concepts
    • Series: A named column of data with an index
    • Indices: The mapping from keys to rows
    • DataFrame: collection of series with common index
  • Dataframe access methods
    • Filtering on predicts and slicing
    • df.loc: location by index
    • df.iloc: location by integer address
    • groupby & pivot aggregating data

4

5 of 52

Now

Congratulations!!!

You have collected or have been given�a box of data.

What do you do next?

5

Box of Data

6 of 52

Plan for next few lectures

6

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

?

7 of 52

Plan for this week

7

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

Prediction and

Inference

?

Data Wrangling

Intro to EDA

Working with Text Data

Regular Expressions

Plots and variables

Seaborn

Viz principles

KDE/Transformations

Mon and Tue (Part I: Processing Data)

Wed and Thur (Part II: Visualizing and Reporting Data)

(today)

8 of 52

Data Wrangling and EDA:�An Infinite Loop

Lecture 04, Data 100 Summer 2022

Data Wrangling and EDA: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

8

9 of 52

The Infinite Loop of Data Science

9

Exploratory Data

Analysis (EDA)

Data Wrangling

10 of 52

Data Wrangling

Often addresses issues like…

  • structure / formatting
  • missing or corrupted values
  • unit conversion
  • encoding text as numbers

Sadly, data cleaning is a big part�of data science…

10

Data Wrangling, or Data Cleaning:

The process of transforming raw data� to facilitate subsequent analysis.

11 of 52

Exploratory Data Analysis (EDA)

“Getting to Know the Data”

The process of transforming, visualizing, and summarizing data to:

  • Build/confirm understanding of the data and its provenance
  • Identify and address potential issues in the data
  • Inform the subsequent analysis
  • Discover potential hypothesis … (be careful…)

EDA is an open-ended analysis.

  • Be willing to find something surprising!

11

Provenance: origin of data; methodology by which data were produced

12 of 52

John Tukey on EDA

John Tukey (1915-2000) was a Princeton Mathematician & Statistician�and an Early Data Scientist.

Coined/Introduced:

  • Fast Fourier Transform algorithm
  • “Bit” : binary digit
  • Exploratory Data Analysis

12

[Data Analysis & Statistics, Tukey 1965; Image from LIFE Magazine]

EDA is like detective work:

Exploratory data analysis is an attitude, a state of�flexibility, a willingness to look for those things that�we believe are not there, as well as those that�we believe to be there.

13 of 52

Key Data Properties to Consider in EDA

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

13

14 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

14

What should�we look for?

Key Data Properties�to Consider in EDA

15 of 52

Structure

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

15

16 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

16

17 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

17

18 of 52

Rectangular Data

We prefer rectangular data for data analysis (why?)

  • Regular structures are easy manipulate and analyze
  • A big part of data cleaning is about �transforming data to be more rectangular

Two kinds of rectangular data: Tables and Matrices.

18

Records/Rows

Fields/Attributes/�Features/Columns

Tables (a.k.a. dataframes in R/Python and relations in SQL)

  • Named columns with different types
  • Manipulated using data transformation languages (map, filter, group by, join, …)

Matrices

  • Numeric data of the same type (float, int, etc.)
  • Manipulated using linear algebra

19 of 52

How are these data files formatted?

19

TSV

Tab separated values

CSV

Comma separated �values

JSON

Which is the best? It depends on your use case.

20 of 52

CSV: Comma Separated Values

San Francisco restaurant food safety scores, (San Francisco Department of Public Health)

Understand high-level structure:

  1. How big is the data file?
  2. How is the data file formatted?
  3. How do we read the data into pandas?

CSV is a very common table file format:

  • Records (rows) are delimited by a newline: '\n', "\r\n"
  • Fields (columns) are delimited by commas: ','

Tabular data: pd.read_csv

20

Demo Slides

21 of 52

TSV: Tab Separated Values

Another common table file format.

  • Records are delimited by a newline: '\n', "\r\n"
  • Fields are delimited by '\t' (tab)

pd.read_csv: Need to specify� delimiter='\t'

21

Issues with CSVs and TSVs:

  • Commas, tabs in records
  • Quoting

Demo Slides

22 of 52

JSON: JavaScript Object Notation

Berkeley covid cases by day (City of Berkeley)

A less common table file format.

  • Very similar to Python dictionaries
  • Strict formatting ”quoting” addresses some issues in CSV/TSV
  • Can save metadata (data about the data) along with records in the same file

22

Issues

  • Not rectangular
  • Each record can have different fields
  • Nesting means records can contain tables – complicated

Tabular data: Find the records using regular Python, then pd.DataFrame.

Demo Slides

23 of 52

What is the following file format?

Mauna Loa Observatory CO2 levels (NOAA)

How do we load these data into Pandas?

pd.read_csv? pd.DataFrame?

23

Often files will have mixed file formats, incorrect extensions or no extension at all.

You may need to explore the�actual raw data file!

Demo Slides

24 of 52

Other types of data formats

Log data (usually .txt)

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)”

169.237.6.168 - - [8/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

XML (Extensible Markup Language)

<catalog>

<plant type='a'>

<common>Bloodroot</common>

<botanical>Sanguinaria canadensis</botanical>

<zone>4</zone>

<light>Mostly Shady</light>

<price>2.44</price>

<availability>03/15/2006</availability>

<description>

<color>white</color>

<petals>true</petals>

</description>

<indoor>true</indoor>

</plant>

</catalog>

In Data 100 we will primarily work with CSV files, but there are other types of non-tabular data out in the wild.

24

CSV? TSV?�JSON? XML?

None of the above?

Make your custom parser!

Nested structure

25 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

25

26 of 52

Records and Variables/Fields

All data (regardless of format) is composed of records.�Each record has a set of variables (aka fields).

  • Tabular: Records == Rows, Variables == Columns
  • Non-Tabular: Create Records�and wrangle into tabular data

Variables are defined by their type (2 defs):

  • Storage type in pandas:integer, floating point, boolean, object (string-like), etc.�df[colname].dtype
  • Feature type: conceptual notion of the information�Use expert knowledge�Explore data itself�Consult data codebook (if it exists)

26

Records/Rows

Fields/Attributes/Features/Columns

business_id

business_name

0

835

Kam Po Kitchen

1

905

Working Girls' Cafe'

27 of 52

Variable Feature Types

27

Quantitative

Qualitative �(categorical)

Variable

Examples:

  • Price
  • Temperature

Could be measured to arbitrary precision.

Examples:

  • Preferences
  • Level of education

Examples:

  • Political Affiliation
  • Cal lD number
  • ZIP code

Categories w/ordered levels; no consistent meaning to difference

Categories w/ no

specific ordering.

Ordinal

Nominal

Continuous

Discrete

Examples:

  • Number of siblings
  • Yrs of education

Ratios and intervals have�meaning.

Finite possible values

Note that qualitative variables could have numeric levels; conversely, quantitative variables could be stored as strings!

28 of 52

Class Exercise

28

What is the feature type of each variable?

🤔

Q

Variable

Feature Type

1

CO2 level (PPM)

2

Number of siblings

3

GPA

4

Income bracket

(low, med, high)

5

Race

6

Number of years of education

7

Yelp Rating

Quantitative

Qualitative

Ord-�inal

Nom-�inal

Cont-�inuous

Dis-�crete

Variable

A

B

C

D

29 of 52

Class Exercise: Solutions

29

What is the feature type of each variable?

🤔

Q

Variable

Feature Type

1

CO2 level (PPM)

2

Number of siblings

3

GPA

4

Income bracket

(low, med, high)

5

Race

6

Number of years of education

7

Yelp Rating

Meta: For this exercise,�The Feature Type variable is Qualitative Nominal.

A. Quantitative Cont.

B. Quantitative Discrete

A. Quantitative Cont.

C. Qualitative Ordinal

D. Qualitative Nominal

B. Quantitative Discrete�

C. Qualitative Ordinal

Quantitative

Qualitative

Ord-�inal

Nom-�inal

Cont-�inuous

Dis-�crete

Variable

A

B

C

D

30 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

File Format

Variable Type

Multiple files�(Primary and Foreign Keys)

30

31 of 52

Structure: Keys

Sometimes your data comes in multiple files:

  • Often data will reference other pieces of data.

Primary key: the column or set of columns in a table that determine the values of the remaining columns

  • Primary keys are unique
  • Examples: SSN, ProductIDs, …

31

OrderNum

ProdID

Quantity

1

42

3

1

999

2

2

42

1

OrderNum

CustID

Date

1

171345

8/21/2017

2

281139

8/30/2017

ProdID

Cost

42

3.14

999

2.72

Products.csv

Orders.csv

CustID

Addr

171345

Harmon..

281139

Main ..

Customers.csv

Primary Key

Primary Key

Purchases.csv

32 of 52

Structure: Keys

Structure: Keys

Sometimes your data comes in multiple files:

  • Often data will reference other pieces of data.

Primary key: the column or set of columns in a table that determine the values of the remaining columns

  • Primary keys are unique
  • Examples: SSN, ProductIDs, …

Foreign keys: the column or sets of columns that reference primary keys in other tables.

You may need to join across tables!� pd.merge

32

OrderNum

ProdID

Quantity

1

42

3

1

999

2

2

42

1

OrderNum

CustID

Date

1

171345

8/21/2017

2

281139

8/30/2017

ProdID

Cost

42

3.14

999

2.72

Products.csv

Orders.csv

CustID

Addr

171345

Harmon..

281139

Main ..

Customers.csv

Primary Key

Primary Key

Purchases.csv

Foreign Key

33 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

Are the data in a standard format or encoding?

  • Tabular data: CSV, TSV, Excel, SQL
  • Nested data: JSON or XML

Are the data organized in records or nested?

  • Can we define records by parsing the data?
  • Can we reasonably un-nest the data?

Does the data reference other data?

  • Can we join/merge the data?
  • Do we need to?

What are the fields in each record?

  • How are they encoded? (e.g., strings, numbers, binary, dates …)
  • What is the type of the data?

33

Summary

You will do the most data wrangling when analyzing the structure of your data.

34 of 52

Interlude

Assignments

Labs 1 and 2, Homework 1 due tonight! If you need more time, remember to fill out the slip day request form!

Office Hours

We have online office hours! If you are ever stuck on a homework or lab, please bring them to office hours!

34

35 of 52

Granularity, Scope, Temporality

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

35

36 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

36

Question & Problem

Formulation

Data

Acquisition

Exploratory Data Analysis

37 of 52

Granularity

What does each record represent?

  • Examples: a purchase, a person, a group of users

Do all records capture granularity at the same level?

  • Some data will include summaries (aka rollups) as records

If the data are coarse, how were the records aggregated?

  • Sampling, averaging, …

37

Fine �Grained

Coarse

Grained

Rec. 1

Rec. 2

Rec. 3

Rec. 1

Rec. 2

Rec. 3

Rec. 1

38 of 52

Scope

Does my data cover my area of interest?

  • Example: I am interested in studying crime in California but I only have Berkeley crime data.

Are my data too expansive?

  • Example: I am interested in student grades for DS100 but have student grades for all statistics classes.
  • Solution: Filtering ⇒ Implications on sample?
    • If the data is a sample I may have poor coverage after filtering …

Does my data cover the right time frame?

  • More on this in Temporality…

38

39 of 52

Scope

Does my data cover my area of interest?

  • Example: I am interested in studying crime in California but I only have Berkeley crime data.

Are my data too expansive?

  • Example: I am interested in student grades for DS100 but have student grades for all statistics classes.
  • Solution: Filtering ⇒ Implications on sample?
    • If the data is a sample I may have poor coverage after filtering …

Does my data cover the right time frame?

  • More on this in Temporality…

(recall) The sampling frame is the population from which the data were sampled. Note that this may not be the population of interest.

How complete/incomplete is the frame (and its data)?

  • How is the frame/data situated in place?
  • How well does the frame/data capture reality?
  • How is the frame/data situated in time?

39

40 of 52

Temporality

Data changes – when was the data collected/last updated?

Periodicity — Is there periodicity? Diurnal (24-hr) patterns?

40

What is the meaning of the time and date fields? A few options:

  • When the “event” happened?
  • When the data was collected or was entered into the system?
  • Date the data was copied into a database? (look for many matching timestamps)

Are there strange null values?

  • E.g., January 1st 1970, January 1st 1900…?

Time depends on where! (time zones & daylight savings)

  • Regions have different datestring representations: 07/08/09?
  • Learn to use datetime python library and Pandas dt accessors

41 of 52

Temporality: Unix Time / POSIX Time

Time measured in seconds since January 1st 1970

  • Minus leap seconds …

Unix time follows Coordinated Universal Time (UTC)

  • International time standard
  • Measured at 0 degrees latitude
    • Similar to Greenwich Mean Time (GMT)
  • No daylight savings
  • Time codes

Time Zones:

  • San Francisco (UTC-7) with daylight savings

41

Jun 26, 2022 6:00pm PDT

1656374400

42 of 52

Faithfulness (and Missing Values)

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

42

43 of 52

Structure -- the “shape” of a data file

Granularity -- how fine/coarse is each datum

Scope -- how (in)complete is the data

Temporality -- how is the data situated in time

Faithfulness -- how well does the data capture “reality”

43

44 of 52

Faithfulness: Do I trust this data?

Does my data contain unrealistic or “incorrect” values?

  • Dates in the future for events in the past
  • Locations that don’t exist
  • Negative counts
  • Misspellings of names
  • Large outliers

Does my data violate obvious dependencies?

  • E.g., age and birthday don’t match

Was the data entered by hand?

  • Spelling errors, fields shifted …
  • Did the form require all fields or provide default values?

Are there obvious signs of data falsification?

  • Repeated names, fake looking email addresses, repeated use of uncommon names or fields.

44

45 of 52

Signs that your data may not be faithful (and Solutions)

Truncated data

  • Early Microsoft Excel limits: 65536 Rows, 255 Columns
  • Soln: be aware of consequences in analysis ⇒ how did truncation affect sample?

Time Zone Inconsistencies

  • Soln 1: convert to a common timezone (e.g., UTC)
  • Soln 2: convert to the timezone of the location – useful in modeling behavior.

Duplicated Records or Fields

  • Soln: identify and eliminate (use primary key) ⇒ implications on sample?

Spelling Errors

  • Soln: Apply corrections or drop records not in a dictionary ⇒ implications on sample?

Units not specified or consistent

  • Solns: Infer units, check values are in reasonable ranges for data

Missing Data

  • See next slide

45

46 of 52

How to Address Missing Data/Default Values

Drop records with missing values

  • Probably most common
  • Caution: check for biases induced by�dropped values
    • Missing or corrupt records might be�related to something of interest

Imputation: Inferring missing values

  • Average Imputation: replace with an average value
    • Which average? Often use closest related subgroup mean.
  • Hot deck imputation: replace with a random value
    • Choose a random value from the subgroup and use it for the missing value.

Other Suggestions (thanks Prof. Gonzalez):

  1. Drop missing values but check for induced bias (use domain knowledge)
  2. Directly model missing values during future analysis

You’ll see Suggestion 1 in this week’s homework; you’ll see Suggestion 2 in the wild.

46

Examples

" “

0, -1�999, 12345

1970, 1900

NaN

Null

NaN: “Not a Number”

47 of 52

Demo: Mauna Loa CO2 EDA

Lecture 04, Data 100 Summer 2022

Data Wrangling and Exploratory Data Analysis: An Infinite Loop

Key Data Properties to Consider in EDA

  • Structure
    • File format
    • Variable types
    • Primary and Foreign Keys
  • Granularity, Scope, Temporality
  • Faithfulness (and Missing Values)

EDA Demo: Mauna Loa CO2

47

48 of 52

What are our Variable Feature Types?

From file description:

  • All measurement variables (average, interpolated, trend) are monthly mean CO2 mole fraction
    • Computed from daily means
  • #days: Number of daily means in a month (i.e., # days equipment worked)

What are the first three columns? How do these columns define each record?

48

EDA step:

Understand what each record, each feature represents

Demo Slides

49 of 52

The Search for the Missing Values

From file description:

  • -99.99: missing monthly average Avg
  • -1: missing value for # days that the equipment was in operation that month.

Which approach? Drop, NaN, Interpolate

  • All 3 are probably fine since few missing values, but we choose interpolation

Granularity of data: What do we want to report? How long is the timescale?

49

EDA step:

Hypothesize why these values were missing, then use that knowledge to decide whether to drop or impute missing values

Demo Slides

50 of 52

A Discussion on Data Granularity

BONUS MATERIAL

From the description:

  • Monthly measurements are averages of average day measurements.
  • The NOAA GML website has datasets for daily/hourly measurements too.

Which granularity to present?

  • You can always go from finer-grained to coarser-grained data (groupby.agg), but not vice versa.
  • Fine-grained data can be computationally expensive: 61 years of seconds is a lot of records!

50

You want the granularity of your data to match your research question.

Demo Slides

51 of 52

Summary: How do you do EDA/Data Wrangling?

Examine data and metadata:

  • What is the date, size, organization, and structure of the data?

Examine each field/attribute/dimension individually

Examine pairs of related dimensions

  • Stratifying earlier analysis: break down grades by major …

Along the way:

  • Visualize/summarize the data (Wednesday)
  • Validate assumptions about data and collection process
  • Identify and address anomalies
  • Apply data transformations and corrections (Thursday)
  • Record everything you do! (why?)

51

52 of 52

Data Wrangling and EDA

Content credit: Joseph Gonzalez, Deborah Nolan, Joseph Hellerstein, and Lisa Yan

52

LECTURE 4