1 of 44

Welcome to Class!

2 of 44

Data 198: Introduction to Real World Data Science

3/3/2025: Intro to Pandas (II) and Data Cleaning

🐼

3 of 44

Agenda

  • Lightning Talk with Jonathan Ferrari
  • Lecture
    • Be respectful to the presenters!
    • Stay away from devices unless filling out attendance / referencing lecture material
    • Pandas Review (In Depth)
    • Data Cleaning
  • Live Deepnote Demo
  • Blooket
  • Group work time

3

4 of 44

🧊Ice Breaker!🧊

Would you rather:

Get an A+ in an important class but you honestly didn’t learn much

OR

Fail an important class but you learnt everything important for whatever job you needed in the future

5 of 44

Link to Slack Attendance Form!!

6 of 44

Lightning Talk

With Jonathan Ferrari

7 of 44

7

Today’s Instructors

Wayland La

2nd year

Data Science

Domain Emphasis: Machine Learning, NLP

Interests: Sports, Education

Iana Mae Peralta

3rd Year

Data Science

Domain Emphasis: Business, AI/ML, Education

Interests: Cooking/baking, Matcha, Food

8 of 44

🐼 Pandas 🐼

9 of 44

Review: What is Pandas?

  • Python library used for data manipulation and analysis
  • Main way to store data is in DataFrames, which are data structures that store tabular data with labeled rows and columns

9

10 of 44

Aliases

10

import pandas as pd

import numpy as np

Easier way to use functions in these packages!

numpy.array() np.array()

11 of 44

DataFrame, Series, Index Review

  • DataFrame - 2D data (rows and columns)
  • Series - 1D data (of a single column)
  • Index - a sequence of row labels (numeric or non-numeric)

11

12 of 44

Selecting Columns

Column

Row

Data Frame

DF[“column”] → series

  • Imagine Series being just a sequence of objects
  • Series can be used for array math (np.mean, np.sum)

13 of 44

Selecting Columns (cont.)*

Column

Row

Data Frame

DF[[“column”]] → dataframe

*Alternative: Series.to_frame()

DF[“column”].to_frame() → dataframe

14 of 44

Selecting Columns (cont.)

Column

Row

Data Frame

DF[[“column_1”, “column_2”]] → df of col_1 and col_2

Dataframe Usage:

  • data manipulation and analysis
  • Make sure to double check if data frames if you are manipulating correctly!

15 of 44

Conditional Selection

Column

Row

Data Frame

Selecting variables based off a condition (eg: True/False, >, ==, etc)

Tip: Imagine selection like arrays!

np.array([3, 1]) < 2

#returns arr([False, True])

16 of 44

Conditional Selection (cont.)

Column

Row

Data Frame

DF[“name] == “things

things

name

returns a series:

DF[DF[“name] == “things”]

returns a df

17 of 44

Explain this code

nba[(nba["lastName"] == "Green") & (nba["Reb"] > 5)]

17

18 of 44

Adding/ Replacing with New Columns

new_series = DF[“name] == “things

1. Make a new series (the new stuff)

2. Put the series into DF!

DF[“new_col] = new_series

new_series =

19 of 44

Before

Column

Row

Data Frame

20 of 44

After

Column

Row

Data Frame

new_col

new_series

21 of 44

Adding new columns to your dataframe

22 of 44

22

Dropping columns from your dataframe

df.drop(columns = [“name_lengths”, “Count”])

Renaming columns

df.rename(columns = {"old1" : "new1", “old2” : “new2”})

23 of 44

GOOGLE, GOOGLE, GOOGLE! (or ChatGPT)

:D

24 of 44

🏋️‍♀️Data Cleaning🧻

25 of 44

What is Data Cleaning?

  • Textbook Definition: “The process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset”
  • General Definition: Making data more readable and user-friendly

26 of 44

Why Do We Need to Clean Data?

  • You are only going to get good data analysis if you put in good structured data.
  • “Garbage In, Garbage Out”

27 of 44

Data Science Life Cycle

  • Part of “understand the data”
  • Transform the data to fit what helps you in your analysis
  • It’s important to know what types of data you are working with before jumping in

28 of 44

Inspection and Understanding Data

  • Important to look at data before cleaning
    • Summary Statistics like: Mean, Median, Maximum and Minimum
  • Questions You Can Ask?
    • Is the data column recorded as a string or number?
    • How many values are missing?
    • How many unique values in a column, and their distribution?
    • Is this data set is linked to or have a relationship with another?

29 of 44

Removing Irrelevant Observations

  • It’s important to drop irrelevant information/columns from our data so we can keep it clean and structured
  • We can use pandas to drop columns in our data frame

df.drop(columns = [“three”])

30 of 44

Example

If we were to predict TV Rating, what columns would we consider dropping?

30

31 of 44

Removing Irrelevant Observations

In general:

  • Pandas - df.loc[df[“column”] CONDITIONAL STATEMENT]
  • Data 8 - tbl.where(“column”, are.CONDITION())

Example (equality):

  • df.loc[df[“column”] == x]
  • tbl.where(“column”, are.equal_to(x))

32 of 44

Removing Irrelevant Observations

  • Many times there are duplicate entries that need to be deleted if they are not intentional
    • Having the same data repeated in your data by accident could skew results and make the results inaccurate
  • We can use pandas to go through a data frame and drop duplicate rows

df.drop_duplicates()

33 of 44

Fix Data Types

  • Make sure numbers are stored as numerical data types
  • Syntax, Readability and Misspelling are things to watch out for
    • “Men” == “men”
  • Watch out for values like “0”, “Not Applicable”, “NA”, “None”, “Null”, or “INF”, they might mean the same thing:
    • The value is missing.
  • You can check a data type by using the type() function
  • Change data types using .asType()

34 of 44

Filter Unwanted Outliers

  • ONLY REMOVE if you have legitimate reason
  • Just because outlier exists does not mean it is incorrect
  • If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it
    • Remember… IQR = Interquartile Range or (Q3 - Q1) ~ 25% and 75% of dataset
    • Any value over Q3 + 1.5 * IQR is an outlier
    • Any value under Q1 - 1.5 * IQR is an outlier

35 of 44

Handle Missing Data

  • If Dataset is Large:
    • Drop the observations
    • Quick and Easy
    • df.dropna()
  • If Dataset is Not Large:
  • Input values based on other observations
    • Mean
    • Median
    • Linear Regression
    • df.fillna(x)

36 of 44

Steps of Data Cleaning

  1. Remove duplicate values and irrelevant data/observations
  2. Fix Data Types
  3. Filter any unwanted outliers
  4. Handle all missing data

37 of 44

Final word: How to solve Errors

37

  1. Think about where you want your dataset to look like at the end
  2. Break down the steps of getting there step by step in English
  3. Jot down functions you may use for each step (making comments if helpful)
  4. After each individual step, look at the dataset to double check what you are doing is desired

38 of 44

Validate Data

  • Make sure data makes sense
  • Structured for analysis
  • Follows rules and regulations you put in place
  • Make sure the data works for what you want to accomplish

You guys got this!

39 of 44

Resources!

  • Pandas Tutor
    • Helps visualize how python + pandas analysis dataframes!
    • Link

  • Data 100 lecture slides
    • Weeks 1-4 usually
    • Link

  • Pandas API Reference
    • Good reference for functions
    • Link

40 of 44

Deepnote!

= +

41 of 44

Group Up!

42 of 44

Blooket Link!

play.blooket.com

43 of 44

Thank You!

Time to work together in Groups!

44 of 44

References

  • https://docs.google.com/presentation/d/1-ypNNwAprR8-LdLO0Ykq6Hh7VMZUcDSc0s4Y7bdGtrs/edit#slide=id.gee4477fa3b_0_17