Welcome to Class!
Data 198: Introduction to Real World Data Science
3/3/2025: Intro to Pandas (II) and Data Cleaning
🐼
Agenda
3
🧊Ice Breaker!🧊
Would you rather:
Get an A+ in an important class but you honestly didn’t learn much
OR
Fail an important class but you learnt everything important for whatever job you needed in the future
Link to Slack Attendance Form!!
Lightning Talk
With Jonathan Ferrari
7
Today’s Instructors
Wayland La
2nd year
Data Science
Domain Emphasis: Machine Learning, NLP
Interests: Sports, Education
Iana Mae Peralta
3rd Year
Data Science
Domain Emphasis: Business, AI/ML, Education
Interests: Cooking/baking, Matcha, Food
🐼 Pandas 🐼
Review: What is Pandas?
9
Aliases
10
import pandas as pd
import numpy as np
Easier way to use functions in these packages!
numpy.array() np.array()
DataFrame, Series, Index Review
11
Selecting Columns
Column
Row
Data Frame
DF[“column”] → series
Selecting Columns (cont.)*
Column
Row
Data Frame
DF[[“column”]] → dataframe
*Alternative: Series.to_frame()
DF[“column”].to_frame() → dataframe
Selecting Columns (cont.)
Column
Row
Data Frame
DF[[“column_1”, “column_2”]] → df of col_1 and col_2
Dataframe Usage:
Conditional Selection
Column
Row
Data Frame
Selecting variables based off a condition (eg: True/False, >, ==, etc)
Tip: Imagine selection like arrays!
np.array([3, 1]) < 2
#returns arr([False, True])
Conditional Selection (cont.)
Column
Row
Data Frame
DF[“name”] == “things”
things
name
returns a series:
DF[DF[“name”] == “things”]
returns a df
Explain this code
nba[(nba["lastName"] == "Green") & (nba["Reb"] > 5)]
17
Adding/ Replacing with New Columns
new_series = DF[“name”] == “things”
1. Make a new series (the new stuff)
2. Put the series into DF!
DF[“new_col”] = new_series
new_series =
Before
Column
Row
Data Frame
After
Column
Row
Data Frame
new_col
new_series
Adding new columns to your dataframe
22
Dropping columns from your dataframe
df.drop(columns = [“name_lengths”, “Count”])
Renaming columns
df.rename(columns = {"old1" : "new1", “old2” : “new2”})
GOOGLE, GOOGLE, GOOGLE! (or ChatGPT)
:D
🏋️♀️Data Cleaning🧻
What is Data Cleaning?
Why Do We Need to Clean Data?
Data Science Life Cycle
Inspection and Understanding Data
Removing Irrelevant Observations
df.drop(columns = [“three”])
Example
If we were to predict TV Rating, what columns would we consider dropping?
30
Removing Irrelevant Observations
In general:
Example (equality):
Removing Irrelevant Observations
df.drop_duplicates()
Fix Data Types
Filter Unwanted Outliers
Handle Missing Data
Steps of Data Cleaning
Final word: How to solve Errors
37
Validate Data
You guys got this!
Resources!
Deepnote!
= +
Group Up!
Blooket Link!
play.blooket.com
Thank You!
Time to work together in Groups!
References