1 of 44

Welcome to Class!

2 of 44

Data 198: Introduction to Real World Data Science

3/3/2025: Intro to Pandas (II) and Data Cleaning

🐼

3 of 44

Agenda

Lightning Talk with Jonathan Ferrari
Lecture

Be respectful to the presenters!
Stay away from devices unless filling out attendance / referencing lecture material
Pandas Review (In Depth)
Data Cleaning

Live Deepnote Demo
Blooket
Group work time

4 of 44

🧊Ice Breaker!🧊

Would you rather:

Get an A+ in an important class but you honestly didn’t learn much

Fail an important class but you learnt everything important for whatever job you needed in the future

5 of 44

Link to Slack Attendance Form!!

6 of 44

Lightning Talk

With Jonathan Ferrari

7 of 44

Today’s Instructors

Wayland La

2nd year

Data Science

Domain Emphasis: Machine Learning, NLP

Interests: Sports, Education

Iana Mae Peralta

3rd Year

Data Science

Domain Emphasis: Business, AI/ML, Education

Interests: Cooking/baking, Matcha, Food

8 of 44

🐼 Pandas 🐼

9 of 44

Review: What is Pandas?

Python library used for data manipulation and analysis
Main way to store data is in DataFrames, which are data structures that store tabular data with labeled rows and columns

10 of 44

Aliases

import pandas as pd

import numpy as np

Easier way to use functions in these packages!

numpy.array() np.array()

11 of 44

DataFrame, Series, Index Review

DataFrame - 2D data (rows and columns)
Series - 1D data (of a single column)
Index - a sequence of row labels (numeric or non-numeric)

12 of 44

Selecting Columns

Column

Row

Data Frame

DF[“column”] → series

Imagine Series being just a sequence of objects
Series can be used for array math (np.mean, np.sum)

13 of 44

Selecting Columns (cont.)*

Column

Row

Data Frame

DF[[“column”]] → dataframe

*Alternative: Series.to_frame()

DF[“column”].to_frame() → dataframe

14 of 44

Selecting Columns (cont.)

Column

Row

Data Frame

DF[[“column_1”, “column_2”]] → df of col_1 and col_2

Dataframe Usage:

data manipulation and analysis
Make sure to double check if data frames if you are manipulating correctly!

15 of 44

Conditional Selection

Column

Row

Data Frame

Selecting variables based off a condition (eg: True/False, >, ==, etc)

Tip: Imagine selection like arrays!

np.array([3, 1]) < 2

#returns arr([False, True])

16 of 44

Conditional Selection (cont.)

Column

Row

Data Frame

DF[“name”] == “things”

things

name

returns a series:

DF[DF[“name”] == “things”]

returns a df

17 of 44

Explain this code

nba[(nba["lastName"] == "Green") & (nba["Reb"] > 5)]

18 of 44

Adding/ Replacing with New Columns

new_series = DF[“name”] == “things”

1. Make a new series (the new stuff)

2. Put the series into DF!

DF[“new_col”] = new_series

new_series =

19 of 44

Before

Column

Row

Data Frame

20 of 44

After

Column

Row

Data Frame

new_col

new_series

21 of 44

Adding new columns to your dataframe

22 of 44

Dropping columns from your dataframe

df.drop(columns = [“name_lengths”, “Count”])

Renaming columns

df.rename(columns = {"old1" : "new1", “old2” : “new2”})

23 of 44

GOOGLE, GOOGLE, GOOGLE! (or ChatGPT)

24 of 44

🏋️‍♀️Data Cleaning🧻

25 of 44

What is Data Cleaning?

Textbook Definition: “The process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset”
General Definition: Making data more readable and user-friendly

26 of 44

Why Do We Need to Clean Data?

You are only going to get good data analysis if you put in good structured data.
“Garbage In, Garbage Out”

27 of 44

Data Science Life Cycle

Part of “understand the data”
Transform the data to fit what helps you in your analysis
It’s important to know what types of data you are working with before jumping in

28 of 44

Inspection and Understanding Data

Important to look at data before cleaning

Summary Statistics like: Mean, Median, Maximum and Minimum

Questions You Can Ask?

Is the data column recorded as a string or number?
How many values are missing?
How many unique values in a column, and their distribution?
Is this data set is linked to or have a relationship with another?

29 of 44

Removing Irrelevant Observations

It’s important to drop irrelevant information/columns from our data so we can keep it clean and structured
We can use pandas to drop columns in our data frame

df.drop(columns = [“three”])

30 of 44

Example

If we were to predict TV Rating, what columns would we consider dropping?

31 of 44

Removing Irrelevant Observations

In general:

Pandas - df.loc[df[“column”] CONDITIONAL STATEMENT]
Data 8 - tbl.where(“column”, are.CONDITION())

Example (equality):

df.loc[df[“column”] == x]
tbl.where(“column”, are.equal_to(x))

32 of 44

Removing Irrelevant Observations

Many times there are duplicate entries that need to be deleted if they are not intentional

Having the same data repeated in your data by accident could skew results and make the results inaccurate

We can use pandas to go through a data frame and drop duplicate rows

df.drop_duplicates()

33 of 44

Fix Data Types

Make sure numbers are stored as numerical data types
Syntax, Readability and Misspelling are things to watch out for

“Men” == “men”

Watch out for values like “0”, “Not Applicable”, “NA”, “None”, “Null”, or “INF”, they might mean the same thing:

The value is missing.

You can check a data type by using the type() function
Change data types using .asType()

34 of 44

Filter Unwanted Outliers

ONLY REMOVE if you have legitimate reason
Just because outlier exists does not mean it is incorrect
If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it

Remember… IQR = Interquartile Range or (Q3 - Q1) ~ 25% and 75% of dataset
Any value over Q3 + 1.5 * IQR is an outlier
Any value under Q1 - 1.5 * IQR is an outlier

35 of 44

Handle Missing Data

If Dataset is Large:

Drop the observations
Quick and Easy
df.dropna()

If Dataset is Not Large:
Input values based on other observations

Mean
Median
Linear Regression
df.fillna(x)

36 of 44

Steps of Data Cleaning

Remove duplicate values and irrelevant data/observations
Fix Data Types
Filter any unwanted outliers
Handle all missing data

37 of 44

Final word: How to solve Errors

Think about where you want your dataset to look like at the end
Break down the steps of getting there step by step in English
Jot down functions you may use for each step (making comments if helpful)
After each individual step, look at the dataset to double check what you are doing is desired

38 of 44

Validate Data

Make sure data makes sense
Structured for analysis
Follows rules and regulations you put in place
Make sure the data works for what you want to accomplish

You guys got this!

39 of 44

Resources!

Pandas Tutor

Helps visualize how python + pandas analysis dataframes!
Link

Data 100 lecture slides

Weeks 1-4 usually
Link

Pandas API Reference

Good reference for functions
Link

40 of 44

Deepnote!

https://deepnote.com/workspace/UCB-07a1b329-fde7-46ae-8654-c7c166952e80/project/DSS-Project-b27a6fd9-77b0-4edf-8fbc-21343dba9c70/notebook/Notebook-1-6a11fdaa35804dbab6bc85efc9dff609

= +

41 of 44

Group Up!

42 of 44

Blooket Link!

play.blooket.com

43 of 44

Thank You!

Time to work together in Groups!

44 of 44

References

https://docs.google.com/presentation/d/1-ypNNwAprR8-LdLO0Ykq6Hh7VMZUcDSc0s4Y7bdGtrs/edit#slide=id.gee4477fa3b_0_17