1 of 36

ANNOUNCEMENTS

  • Attendance Policy!
    • One free unexcused absence

2 of 36

Pandas II

HODP Spring 2025 Bootcamp

3 of 36

End Goal

Your Article Here!

4 of 36

THE PLAN

  • Review - End of Pandas I
  • New - Data Cleaning
  • Next Time - Data Visualizations!!!

5 of 36

Review

6 of 36

Restructuring Dataframes - Sort

  • Suppose our running example is all messed up i.e:
  • We can fix this in many ways:
    • sorted_df = df.sort_values(by='Age')
      • This sorts by age ascending order
      • Order is: A, B, C, D, E
    • Can also do opposite order:
    • new_sorted_df = df.sort_values(by=’Name’, ascending = False)
      • What might the new order be?
  • We can also do restructurings inplace
    • Use df.sort_values(by='Age', inplace=True)
  • Can also sort with a hierarchy
    • sorted_df_multi = df.sort_values(by=['Age', 'Name'])
    • This sorts by age first, then in case of ties does it by name

7 of 36

Restructuring Dataframes - Func

Start with:

// start

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000]}

df = pd.DataFrame(data)

Original (top) vs post-restructure (bottom)

// restructuring

df['Bonus'] = df['Salary'] * 0.10

df['Age'] = df['Age'].apply(add_ten)

// end

What might add_ten look like?

Name

Age

Salary

0

Alice

25

50000

1

Bob

30

60000

2

Charlie

35

70000

Name

Age

Salary

Bonus

0

Alice

35

50000

5000.0

1

Bob

40

60000

6000.0

2

Charlie

45

70000

7000.0

8 of 36

Restructuring Dataframes - Drop

Start with:

// start

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000],

‘Bonus’: [5000.0, 6000.0, 7000.0]}

df = pd.DataFrame(data) Original (top) vs post-restructure (bottom)

df.drop('Bonus', axis=1, inplace=True)

// axis = 1 means column

Row drop: df.drop(1, axis=0, inplace=True)

Name

Age

Salary

0

Alice

25

50000

1

Bob

30

60000

2

Charlie

35

70000

Name

Age

Salary

Bonus

0

Alice

25

50000

5000.0

1

Bob

30

60000

6000.0

2

Charlie

45

70000

7000.0

9 of 36

Common Dangers

  • Dropping rows too early
    • If you drop a row and then try accessing it later, even if it is earlier, you can’t
    • Have to rerun previous cells to ‘restore’ the previous cell instance
  • Notebook runs sequentially
    • This means you might have to rerun the entire thing and restart runtime
    • Use Shift + Enter to do this really quickly
  • Reset_index function:
    • If you do an operation like sort_values, indexing is preserved
    • To account for this, can run df.reset_index(drop=True, inplace=True)
    • Helps also for merging datasets
      • Datasets merge by default via index so if one dataset has indices out of order, will put things in the wrong place

10 of 36

Exporting

  • Saving DataFrames: Suppose you’ve done all your hard work and want to export some data. How to preserve the python notebook or send to people?

cleaned_df_path = ‘/path/to/your/file/desired_file_name.csv’

// optional - will only run if the file if the desired file doesn’t exist yet

if not os.path.exists(cleaned_df_path):

df.to_csv(cleaned_df_path, index=False)

11 of 36

Cleaning Intro

12 of 36

Why bother cleaning?

  • Accuracy: Cleaning data ensures accuracy in analysis, leading to more reliable conclusions.
  • Performance: Clean data reduces processing time and enhances the performance of data models.
  • Consistency: Standardizes data formats, making it easier to integrate and analyze information from different sources.
  • Error Reduction: Identifies and corrects errors or outliers that could skew results significantly.
  • Data Integrity: Maintains the integrity of the data throughout its lifecycle, ensuring that the data remains correct, consistent, and usable.

13 of 36

Introduction

  • Inspecting DataFrames:
    • Use df.info(): Provides a concise summary of the DataFrame including the number of non-null entries in each column and the datatype.
    • Use df.dtypes: Lists the datatype of each column in the DataFrame, helping to quickly identify columns with unexpected datatypes.
  • Common Data Errors:
    • Numeric as Strings: Often due to mixed datatypes in raw data. Can impede calculations and statistical analysis.
    • Capitalization Variance: Can lead to categorization errors; 'Apple' and 'apple' would be treated as different categories.
    • Spelling Mistakes: Can introduce inconsistencies in categorical data, affecting data aggregation and analysis.
  • Addressing Errors:
    • Convert DataTypes: Use pd.to_numeric() or astype() to correct numeric strings.
    • Normalize Text: Apply str.lower() or str.capitalize() to standardize text entries.
    • Automated Spell Check: Implement or use existing libraries to correct spelling inconsistencies.

14 of 36

Cleaning Commands

  • Remove Columns and Rows:
    • df.drop('colName', axis=1, inplace=True) to drop a column.
    • df.drop(rowIndex, axis=0, inplace=True) to drop a row.
  • Apply Functions:
    • Apply a specific function across the DataFrame using df.apply(func) or to a column like df['Age'].apply(add_ten).
  • Unique Values:
    • Retrieve unique values from a column: df.colName.unique().
    • Check for missing values with df.isnull().sum().
    • Identify duplicate rows using df.duplicated().sum().
  • Merging Data:
    • Combine DataFrames using various merge strategies, e.g., df1.merge(df2, on='key').

15 of 36

What does merging look like?

16 of 36

What does merging look like?

merged_df = pd.merge(df1, df2, on='ID', how='inner')

merged_df = pd.merge(df1, df2, on='ID', how='left')

17 of 36

Exercise 1

  1. Load in the dataset cereal_modified.csv into Pandas
  2. Find out which columns have missing values.
  3. Drop that column.

18 of 36

Exercise 1 Solution

print(df.isnull().sum())

df.drop('rating', axis=1, inplace=True)

df

19 of 36

Simple Functions

  • String Methods:
    • str.lower(): Converts a string to lowercase.
    • str.upper(): Converts a string to uppercase.
    • str.compare(other): Compares two strings for sorting.
  • Dictionary Mapping:
    • Used to replace or map values in a data structure using a dictionary.
    • Example: data.map({'old_value': 'new_value'}).
  • Basic Math Functions:
    • sum(): Calculates the sum of a list or array.
    • min(), max(): Find the minimum or maximum value.
    • round(number, digits): Rounds a number to a specified number of decimal places.

20 of 36

Missing Values

21 of 36

Missing Values

  • Understanding Missing Data:
    • df.isna().sum() to detect missing entries in each column.
    • df[df.isna().sum(axis=1) > 1] to find rows with more than one missing value.
  • Options for Handling Missing Data:
    • Drop: Remove rows/columns with missing data.
      • df.dropna(subset=['colName']) drops rows where 'colName' is missing.
    • Impute:
      • What is imputation?
      • Replace missing values with statistical estimates (mean, median, mode).
  • Imputation Application:
    • Apply chosen imputation method to specific columns or the entire DataFrame based on the analysis needs.

22 of 36

Descriptive Statistics

Measures of Central Tendency

Mean: average of a dataset

Median: the middle value of the dataset when ordered from least to greatest

Measures of Dispersion

Variance: how far a data point is, on average, from the mean

Standard Deviation: square root of the variance

Numbers that aim to summarize a dataset.

23 of 36

Imputation

  • How should we impute the following missing data given the circumstances?
    • Missing ~10% of values for latitude/longitude in city housing
    • Missing ~15% of values for pokemon types
    • Missing ~70% of values for HUDS meal options
    • Missing ~30% of values in product ratings

24 of 36

Exercise 2

As an extension of Exercise 1, can you impute missing values in the column you identified using the mean?

25 of 36

Exercise 2 Solution

import pandas as pd

cereal_df = pd.read_csv('cereal_modified.csv')

mean_rating = cereal_df['rating'].mean()

cereal_df['rating'].fillna(mean_rating, inplace=True)

26 of 36

Filtering/sorting

27 of 36

Filtering/Sorting Commands

  • Indexing:
    • Boolean Indexing: df[boolean_condition] for filtering rows.
    • df.iloc[start:end]: Position-based indexing, excludes the endpoint.
    • df.loc[start:end]: Label-based indexing, includes the endpoint.
  • Sorting:
    • df.sort_values(by='var', ascending=True, inplace=False): Sorts DataFrame by column.
  • Unique Values and Counts:
    • df['colName'].value_counts(): Lists unique values and their counts.
  • Replacing Values:
    • Simple replace: df['colName'].replace('old_value', 'new_value')
  • Exploding Lists:
    • df.explode('colName'): Splits lists in a column into separate rows.

28 of 36

Filtering

Start with:

// start

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000],

‘Bonus’: [5000.0, 6000.0, 7000.0]}

df = pd.DataFrame(data) Original (top) vs post-restructure (bottom)

filtered_df = df[df['Age'] > 28]

Name

Age

Salary

Bonus

0

Alice

25

50000

5000.0

1

Bob

30

60000

6000.0

2

Charlie

45

70000

7000.0

Name

Age

Salary

Bonus

1

Bob

30

60000

6000.0

2

Charlie

45

70000

7000.0

29 of 36

Exercise 3

Let's load back in the unmodified CSV, cereal.csv.

  1. Can you first filter the dataset to only include cereals with a rating above 50?
  2. For this filtered dataset, can you sort the dataset by calories and protein in descending order?

30 of 36

Exercise 3 Solution

df[df['rating'] > 50].sort_values(by=['calories', 'sugars'], ascending=False)

31 of 36

Categoricals

32 of 36

Categorical Variables

  • What are dangers of including categorical variables unmodified?
    • Model Incompatibility: Many algorithms require numerical input.
    • Loss of Information: Misleading ordinal implications for nominal data.
    • Increased Complexity: Introduces noise, complicating the model.
    • Overfitting: High-dimensional space increases complexity and risk of overfitting.
  • Solution?
    • One Hot Encoding
    • pd.get_dummies(df): Converts categorical variable(s) into dummy/indicator variables.
    • Q: If we have n categorical variables, we should make dummy variables?

33 of 36

Categorical Variables

  • What are dangers of including categorical variables unmodified?
    • Model Incompatibility: Many algorithms require numerical input.
    • Loss of Information: Misleading ordinal implications for nominal data.
    • Increased Complexity: Introduces noise, complicating the model.
    • Overfitting: High-dimensional space increases complexity and risk of overfitting.
  • Solution?
    • One Hot Encoding
    • pd.get_dummies(df): Converts categorical variable(s) into dummy/indicator variables.
    • Q: If we have n categorical variables, we should make n-1 dummy variables?
      • Why?

34 of 36

Categorical Variables

  • What are dangers of including categorical variables unmodified?
    • Model Incompatibility: Many algorithms require numerical input.
    • Loss of Information: Misleading ordinal implications for nominal data.
    • Increased Complexity: Introduces noise, complicating the model.
    • Overfitting: High-dimensional space increases complexity and risk of overfitting.
  • Solution?
    • One Hot Encoding
    • pd.get_dummies(df): Converts categorical variable(s) into dummy/indicator variables.
    • Q: If we have n categorical variables, we should make n-1 dummy variables?
      • Why?
      • Multicollinearity
        • Helps isolate individual variables’ impact
        • Imagine we care about 3 colors: red, blue, green.
        • If we know something isn’t red or blue, what does that imply?

35 of 36

Example

36 of 36

Project Time

Attendance Code: clean

https://tinyurl.com/hodp-spring25-project

Use the link on the first page of the form to find people and their interests. As groups form, we will try to update that spreadsheet (which is also here)

https://docs.google.com/spreadsheets/d/1QpwyljIJ8NnM-AN6ggfeG4PFC4bRGS8kgp8HMtTc8yQ/edit?gid=0#gid=0

Goal: Have data by end of next session