6 of 36

Restructuring Dataframes - Sort

Suppose our running example is all messed up i.e:
We can fix this in many ways:

sorted_df = df.sort_values(by='Age')

This sorts by age ascending order
Order is: A, B, C, D, E

Can also do opposite order:
new_sorted_df = df.sort_values(by=’Name’, ascending = False)

What might the new order be?

We can also do restructurings inplace

Use df.sort_values(by='Age', inplace=True)

Can also sort with a hierarchy

sorted_df_multi = df.sort_values(by=['Age', 'Name'])
This sorts by age first, then in case of ties does it by name

7 of 36

Restructuring Dataframes - Func

Start with:

// start

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000]}

df = pd.DataFrame(data)

Original (top) vs post-restructure (bottom)

// restructuring

df['Bonus'] = df['Salary'] * 0.10

df['Age'] = df['Age'].apply(add_ten)

// end

What might add_ten look like?

	Name	Age	Salary
0	Alice	25	50000
1	Bob	30	60000
2	Charlie	35	70000

	Name	Age	Salary	Bonus
0	Alice	35	50000	5000.0
1	Bob	40	60000	6000.0
2	Charlie	45	70000	7000.0

8 of 36

Restructuring Dataframes - Drop

Start with:

// start

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000],

‘Bonus’: [5000.0, 6000.0, 7000.0]}

df = pd.DataFrame(data) Original (top) vs post-restructure (bottom)

df.drop('Bonus', axis=1, inplace=True)

// axis = 1 means column

Row drop: df.drop(1, axis=0, inplace=True)

	Name	Age	Salary
0	Alice	25	50000
1	Bob	30	60000
2	Charlie	35	70000

	Name	Age	Salary	Bonus
0	Alice	25	50000	5000.0
1	Bob	30	60000	6000.0
2	Charlie	45	70000	7000.0

9 of 36

Common Dangers

Dropping rows too early

If you drop a row and then try accessing it later, even if it is earlier, you can’t
Have to rerun previous cells to ‘restore’ the previous cell instance

Notebook runs sequentially

This means you might have to rerun the entire thing and restart runtime
Use Shift + Enter to do this really quickly

Reset_index function:

If you do an operation like sort_values, indexing is preserved
To account for this, can run df.reset_index(drop=True, inplace=True)
Helps also for merging datasets

Datasets merge by default via index so if one dataset has indices out of order, will put things in the wrong place

10 of 36

Exporting

Saving DataFrames: Suppose you’ve done all your hard work and want to export some data. How to preserve the python notebook or send to people?

cleaned_df_path = ‘/path/to/your/file/desired_file_name.csv’

// optional - will only run if the file if the desired file doesn’t exist yet

if not os.path.exists(cleaned_df_path):

df.to_csv(cleaned_df_path, index=False)

11 of 36

Cleaning Intro

12 of 36

Why bother cleaning?

Accuracy: Cleaning data ensures accuracy in analysis, leading to more reliable conclusions.
Performance: Clean data reduces processing time and enhances the performance of data models.
Consistency: Standardizes data formats, making it easier to integrate and analyze information from different sources.
Error Reduction: Identifies and corrects errors or outliers that could skew results significantly.
Data Integrity: Maintains the integrity of the data throughout its lifecycle, ensuring that the data remains correct, consistent, and usable.

13 of 36

Introduction

Inspecting DataFrames:

Use df.info(): Provides a concise summary of the DataFrame including the number of non-null entries in each column and the datatype.
Use df.dtypes: Lists the datatype of each column in the DataFrame, helping to quickly identify columns with unexpected datatypes.

Common Data Errors:

Numeric as Strings: Often due to mixed datatypes in raw data. Can impede calculations and statistical analysis.
Capitalization Variance: Can lead to categorization errors; 'Apple' and 'apple' would be treated as different categories.
Spelling Mistakes: Can introduce inconsistencies in categorical data, affecting data aggregation and analysis.

Addressing Errors:

Convert DataTypes: Use pd.to_numeric() or astype() to correct numeric strings.
Normalize Text: Apply str.lower() or str.capitalize() to standardize text entries.
Automated Spell Check: Implement or use existing libraries to correct spelling inconsistencies.

14 of 36

Cleaning Commands

Remove Columns and Rows:

df.drop('colName', axis=1, inplace=True) to drop a column.
df.drop(rowIndex, axis=0, inplace=True) to drop a row.

Apply Functions:

Apply a specific function across the DataFrame using df.apply(func) or to a column like df['Age'].apply(add_ten).

Unique Values:

Retrieve unique values from a column: df.colName.unique().
Check for missing values with df.isnull().sum().
Identify duplicate rows using df.duplicated().sum().

Merging Data:

Combine DataFrames using various merge strategies, e.g., df1.merge(df2, on='key').

15 of 36

What does merging look like?

16 of 36

What does merging look like?

merged_df = pd.merge(df1, df2, on='ID', how='inner')

merged_df = pd.merge(df1, df2, on='ID', how='left')

17 of 36

Exercise 1

Load in the dataset cereal_modified.csv into Pandas
Find out which columns have missing values.
Drop that column.

18 of 36

Exercise 1 Solution

print(df.isnull().sum())

df.drop('rating', axis=1, inplace=True)

19 of 36

Simple Functions

String Methods:

str.lower(): Converts a string to lowercase.
str.upper(): Converts a string to uppercase.
str.compare(other): Compares two strings for sorting.

Dictionary Mapping:

Used to replace or map values in a data structure using a dictionary.
Example: data.map({'old_value': 'new_value'}).

Basic Math Functions:

sum(): Calculates the sum of a list or array.
min(), max(): Find the minimum or maximum value.
round(number, digits): Rounds a number to a specified number of decimal places.

20 of 36

Missing Values

21 of 36

Missing Values

Understanding Missing Data:

df.isna().sum() to detect missing entries in each column.
df[df.isna().sum(axis=1) > 1] to find rows with more than one missing value.

Options for Handling Missing Data:

Drop: Remove rows/columns with missing data.

df.dropna(subset=['colName']) drops rows where 'colName' is missing.

Impute:

What is imputation?
Replace missing values with statistical estimates (mean, median, mode).

Imputation Application:

Apply chosen imputation method to specific columns or the entire DataFrame based on the analysis needs.

22 of 36

Descriptive Statistics

Measures of Central Tendency

Mean: average of a dataset

Median: the middle value of the dataset when ordered from least to greatest

Measures of Dispersion

Variance: how far a data point is, on average, from the mean

Standard Deviation: square root of the variance

Numbers that aim to summarize a dataset.

23 of 36

Imputation

How should we impute the following missing data given the circumstances?

Missing ~10% of values for latitude/longitude in city housing
Missing ~15% of values for pokemon types
Missing ~70% of values for HUDS meal options
Missing ~30% of values in product ratings

24 of 36

Exercise 2

As an extension of Exercise 1, can you impute missing values in the column you identified using the mean?

25 of 36

Exercise 2 Solution

import pandas as pd

cereal_df = pd.read_csv('cereal_modified.csv')

mean_rating = cereal_df['rating'].mean()

cereal_df['rating'].fillna(mean_rating, inplace=True)

26 of 36

Filtering/sorting

27 of 36

Filtering/Sorting Commands

Indexing:

Boolean Indexing: df[boolean_condition] for filtering rows.
df.iloc[start:end]: Position-based indexing, excludes the endpoint.
df.loc[start:end]: Label-based indexing, includes the endpoint.

Sorting:

df.sort_values(by='var', ascending=True, inplace=False): Sorts DataFrame by column.

Unique Values and Counts:

df['colName'].value_counts(): Lists unique values and their counts.

Replacing Values:

Simple replace: df['colName'].replace('old_value', 'new_value')

Exploding Lists:

df.explode('colName'): Splits lists in a column into separate rows.

28 of 36

Filtering

Start with:

// start

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000],

‘Bonus’: [5000.0, 6000.0, 7000.0]}

df = pd.DataFrame(data) Original (top) vs post-restructure (bottom)

filtered_df = df[df['Age'] > 28]

	Name	Age	Salary	Bonus
0	Alice	25	50000	5000.0
1	Bob	30	60000	6000.0
2	Charlie	45	70000	7000.0

	Name	Age	Salary	Bonus
1	Bob	30	60000	6000.0
2	Charlie	45	70000	7000.0

29 of 36

Exercise 3

Let's load back in the unmodified CSV, cereal.csv.

Can you first filter the dataset to only include cereals with a rating above 50?
For this filtered dataset, can you sort the dataset by calories and protein in descending order?

30 of 36

Exercise 3 Solution

df[df['rating'] > 50].sort_values(by=['calories', 'sugars'], ascending=False)

31 of 36

Categoricals

32 of 36

Categorical Variables

What are dangers of including categorical variables unmodified?

Model Incompatibility: Many algorithms require numerical input.
Loss of Information: Misleading ordinal implications for nominal data.
Increased Complexity: Introduces noise, complicating the model.
Overfitting: High-dimensional space increases complexity and risk of overfitting.

Solution?

One Hot Encoding
pd.get_dummies(df): Converts categorical variable(s) into dummy/indicator variables.
Q: If we have n categorical variables, we should make dummy variables?

33 of 36

Categorical Variables

What are dangers of including categorical variables unmodified?

Model Incompatibility: Many algorithms require numerical input.
Loss of Information: Misleading ordinal implications for nominal data.
Increased Complexity: Introduces noise, complicating the model.
Overfitting: High-dimensional space increases complexity and risk of overfitting.

Solution?

One Hot Encoding
pd.get_dummies(df): Converts categorical variable(s) into dummy/indicator variables.
Q: If we have n categorical variables, we should make n-1 dummy variables?

Why?

34 of 36

Categorical Variables

What are dangers of including categorical variables unmodified?

Model Incompatibility: Many algorithms require numerical input.
Loss of Information: Misleading ordinal implications for nominal data.
Increased Complexity: Introduces noise, complicating the model.
Overfitting: High-dimensional space increases complexity and risk of overfitting.

Solution?

One Hot Encoding
pd.get_dummies(df): Converts categorical variable(s) into dummy/indicator variables.
Q: If we have n categorical variables, we should make n-1 dummy variables?

Why?
Multicollinearity

Helps isolate individual variables’ impact
Imagine we care about 3 colors: red, blue, green.
If we know something isn’t red or blue, what does that imply?

36 of 36

Project Time

Attendance Code: clean

https://tinyurl.com/hodp-spring25-project

Use the link on the first page of the form to find people and their interests. As groups form, we will try to update that spreadsheet (which is also here)

https://docs.google.com/spreadsheets/d/1QpwyljIJ8NnM-AN6ggfeG4PFC4bRGS8kgp8HMtTc8yQ/edit?gid=0#gid=0

Goal: Have data by end of next session

1 of 36

2 of 36

3 of 36

4 of 36

5 of 36

6 of 36

7 of 36

8 of 36

9 of 36

10 of 36

11 of 36

12 of 36

13 of 36

14 of 36

15 of 36

16 of 36

17 of 36

18 of 36

19 of 36

20 of 36

21 of 36

22 of 36

23 of 36

24 of 36

25 of 36

26 of 36

27 of 36

28 of 36

29 of 36

30 of 36

31 of 36

32 of 36

33 of 36

34 of 36

35 of 36

36 of 36