ANNOUNCEMENTS
Pandas II
HODP Spring 2025 Bootcamp
End Goal
Your Article Here!
THE PLAN
Review
Restructuring Dataframes - Sort
Restructuring Dataframes - Func
Start with:
// start
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
Original (top) vs post-restructure (bottom)
// restructuring
df['Bonus'] = df['Salary'] * 0.10
df['Age'] = df['Age'].apply(add_ten)
// end
What might add_ten look like?
| Name | Age | Salary |
0 | Alice | 25 | 50000 |
1 | Bob | 30 | 60000 |
2 | Charlie | 35 | 70000 |
| Name | Age | Salary | Bonus |
0 | Alice | 35 | 50000 | 5000.0 |
1 | Bob | 40 | 60000 | 6000.0 |
2 | Charlie | 45 | 70000 | 7000.0 |
Restructuring Dataframes - Drop
Start with:
// start
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000],
‘Bonus’: [5000.0, 6000.0, 7000.0]}
df = pd.DataFrame(data) Original (top) vs post-restructure (bottom)
df.drop('Bonus', axis=1, inplace=True)
// axis = 1 means column
Row drop: df.drop(1, axis=0, inplace=True)
| Name | Age | Salary |
0 | Alice | 25 | 50000 |
1 | Bob | 30 | 60000 |
2 | Charlie | 35 | 70000 |
| Name | Age | Salary | Bonus |
0 | Alice | 25 | 50000 | 5000.0 |
1 | Bob | 30 | 60000 | 6000.0 |
2 | Charlie | 45 | 70000 | 7000.0 |
Common Dangers
Exporting
cleaned_df_path = ‘/path/to/your/file/desired_file_name.csv’
// optional - will only run if the file if the desired file doesn’t exist yet
if not os.path.exists(cleaned_df_path):
df.to_csv(cleaned_df_path, index=False)
Cleaning Intro
Why bother cleaning?
Introduction
Cleaning Commands
What does merging look like?
What does merging look like?
merged_df = pd.merge(df1, df2, on='ID', how='inner')
merged_df = pd.merge(df1, df2, on='ID', how='left')
Exercise 1
Exercise 1 Solution
print(df.isnull().sum())
df.drop('rating', axis=1, inplace=True)
df
Simple Functions
Missing Values
Missing Values
Descriptive Statistics
Measures of Central Tendency
Mean: average of a dataset
Median: the middle value of the dataset when ordered from least to greatest
Measures of Dispersion
Variance: how far a data point is, on average, from the mean
Standard Deviation: square root of the variance
Numbers that aim to summarize a dataset.
Imputation
Exercise 2
As an extension of Exercise 1, can you impute missing values in the column you identified using the mean?
Exercise 2 Solution
import pandas as pd
cereal_df = pd.read_csv('cereal_modified.csv')
mean_rating = cereal_df['rating'].mean()
cereal_df['rating'].fillna(mean_rating, inplace=True)
Filtering/sorting
Filtering/Sorting Commands
Filtering
Start with:
// start
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000],
‘Bonus’: [5000.0, 6000.0, 7000.0]}
df = pd.DataFrame(data) Original (top) vs post-restructure (bottom)
filtered_df = df[df['Age'] > 28]
| Name | Age | Salary | Bonus |
0 | Alice | 25 | 50000 | 5000.0 |
1 | Bob | 30 | 60000 | 6000.0 |
2 | Charlie | 45 | 70000 | 7000.0 |
| Name | Age | Salary | Bonus |
1 | Bob | 30 | 60000 | 6000.0 |
2 | Charlie | 45 | 70000 | 7000.0 |
Exercise 3
Let's load back in the unmodified CSV, cereal.csv.
Exercise 3 Solution
df[df['rating'] > 50].sort_values(by=['calories', 'sugars'], ascending=False)
Categoricals
Categorical Variables
Categorical Variables
Categorical Variables
Example
Project Time
Attendance Code: clean
https://tinyurl.com/hodp-spring25-project
Use the link on the first page of the form to find people and their interests. As groups form, we will try to update that spreadsheet (which is also here)
https://docs.google.com/spreadsheets/d/1QpwyljIJ8NnM-AN6ggfeG4PFC4bRGS8kgp8HMtTc8yQ/edit?gid=0#gid=0
Goal: Have data by end of next session