JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 76

Pandas, Part II

More Advanced Pandas (Grouping, Aggregation, Pivot Tables Merging)

Data 100/Data 200, Spring 2022 @ UC Berkeley

Josh Hug and Lisa Yan

LECTURE 4

2 of 76

New Syntax / Concept Summary

Today we’ll cover:

Sorting with a custom key (by googling how to do this).
Creating and dropping columns.
Groupby: Output of .groupby(“Name”) is a DataFrameGroupBy object. Condense back into a DataFrame or Series with:

groupby.agg
groupby.size
groupby.filter
and more...

Pivot tables: An alternate way to group by exactly two columns.
Joining tables using pd.merge.

3 of 76

Googling Custom Sorts

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

4 of 76

Manipulating String Data

Last time, we saw how we could find, for example, the most popular male names in California in the year 2020:

babynames.query('Sex == "M" and Year == 2020')

.sort_values("Count", ascending = False)

5 of 76

Manipulating String Data

What if we wanted to find the longest names in California?

Just sorting by name won’t work!

Before summer 2020, this would not have been straightforward.

But these days it is! Let’s figure it out by Googling.

babynames.query('Sex == "M" and Year == 2020')

.sort_values("Name", ascending = False)

6 of 76

Manipulating String Data

What if we wanted to find the longest names in California?

babynames.query('Sex == "M" and Year == 2020')

.sort_values("Name", key = lambda x: x.str.len(),

ascending = False)

7 of 76

Adding, Modifying, and Removing Columns

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

8 of 76

Sorting By Length

As motivation, let’s try to solve the sorting problem using a pre-2020 technique:

We will create a temporary column, then sort on it.

9 of 76

Approach 1: Create a Temporary Column

Intuition: Create a column equal to the length. Sort by that column.

10 of 76

Syntax for Column Addition

Adding a column is easy:

Can also do both steps on one line of code

#create a new series of only the lengths

babyname_lengths = babynames["Name"].str.len()

#add that series to the dataframe as a column

babynames["name_lengths"] = babyname_lengths

11 of 76

Syntax for Column Addition

Sorting a table is as usual:

babynames = babynames.sort_values(by = "name_lengths", ascending=False)

12 of 76

Syntax for Dropping a Column (or Row)

After sorting, we can drop the temporary column.

The Drop method assumes you’re dropping a row by default. Use axis = ‘columns’ to drop a column instead.

babynames = babynames.drop("name_lengths", axis = 'columns')

13 of 76

Sorting by Arbitrary Functions

Suppose we want to sort by the number of occurrences of “dr” + number of occurrences of “ea”.

Use the Series .map method.

def dr_ea_count(string):

return string.count('dr') + string.count('ea')

babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)

14 of 76

Groupby.agg

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

15 of 76

Goal

Goal: Find the female baby name whose popularity has fallen the most.

Number of Jennifers Born in California Per Year

16 of 76

Goal

Goal: Find the female baby name whose popularity has fallen the most.

Let’s start by defining what we mean by changed popularity.

In lecture, let’s define the “ratio to peak” or RTP as the ratio of Jennifers born today to the maximum number born in a single year.

Example for “Jennifer”:

In 1972, we hit peak Jennifer. 6,064 Jennifers were born.
In 2020, there were only 141 Jennifers.
RTP is 141 / 6064 = 0.0233.

Let’s spend some time in our notebook. The following N slides are for reference only and will be skipped during live lecture.

17 of 76

Calculating RTP

max_jennifers = max(babynames.query("Name == 'Jennifer' and Sex == 'F'")["Count"])

6064

current_jennifers = babynames.query("Name == 'Jennifer' and Sex == 'F'")["Count"].iloc[-1]

141

rtp = current_jennifers / max_jennifers

0.023251978891820582

def ratio_to_peak(series):

return series.iloc[-1] / max(series)

jennifer_counts_series = babynames.query("Name == 'Jennifer' and Sex == 'F'")["Count"]

ratio_to_peak(jennifer_counts_series)

0.02325197889182058

18 of 76

Approach 1: Getting RTP for Every Name The Hard Way

Approach 1: Hack something together using our existing Python knowledge.

Challenge: Try to fill in the code above by filling in the ??.

#build dictionary where each entry is the rtp for a given name

#e.g. rtps["jennifer"] should be 0.0231

rtps = {}

for name in ??:

counts_of_current_name = female_babynames[??]["Count"]

rtps[name] = ratio_to_peak(counts_of_current_name)

#convert to series

rtps = pd.Series(rtps)

19 of 76

Approach 1: Getting RTP for Every Name The Hard Way

Approach 1: Hack something together using our existing Python knowledge.

The code above is extremely slow, and also way more complicated than the better approach coming next.

#build dictionary where each entry is the rtp for a given name

#e.g. rtps["jennifer"] should be 0.0231

rtps = {}

for name in babynames["Name"].unique():

counts_of_current_name = female_babynames[female_babynames["Name"] == name]["Count"]

rtps[name] = ratio_to_peak(counts_of_current_name)

#convert to series

rtps = pd.Series(rtps)

20 of 76

Approach 2: Using Groupby and Agg

The code below is the more idiomatic way of computing what we want.

Much simpler, much faster, much more versatile.

female_babynames.groupby("Name").agg(ratio_to_peak)

female_babynames.group("Name", ratio_to_peak)

Nothing new here! This is just the group function that you saw in data8, where collect = ratio_to_peak.

21 of 76

Comparing the Two Approaches

As a reminder you should almost never be writing code in this class that includes loops or list comprehensions on Pandas series.

Use the pandas API as intended!

female_babynames.groupby("Name").agg(ratio_to_peak)

Approach 1: [BAD!!]

Approach 2:

22 of 76

Visual Review of Data 8: Grouping and Collection

groupby

Can think of as temporary 3 sub-dataframes

.agg(f), where f = ammd

Condenses every sub-dataframe back into a single row.

23 of 76

Attendance Question: Check Your groupBy Understanding

Approach 2 generated two columns, Year and Count. www.yellkey.com/perhaps

In the five rows shown, note the Year is 1.0 for every value.

Are there any rows for which Year is not 1.0?

Yes, names that appeared for the first time in 2020.
Yes, names that did not appear in 2020.
Yes, names whose peak Count was in 2020.
No, every row has a Year value of 1.0.

fbn.groupby("Name").agg(ratio_to_peak)

24 of 76

Attendance Question: Check Your groupBy Understanding

Approach 2 generated two columns, Year and Count. www.yellkey.com/perhaps

In the five rows shown, note the Year is 1.0 for every value.

Are there any rows for which Year is not 1.0?

Yes, names that appeared for the first time in 2020.
Yes, names that did not appear in 2020.
Yes, names whose peak Count was in 2020.
No, every row has a Year value of 1.0.

fbn.groupby("Name").agg(ratio_to_peak)

Note: This is a hard question! I originally tricked myself and thought the answer was B.

25 of 76

Note on Nuisance Columns

At least as of the time of this slide creation (January 2022), executing our agg call results in:

female_babynames.groupby("Name").agg(ratio_to_peak)

For more details, see Pandas 1.3 release notes.

26 of 76

Note on Nuisance Columns

At least as of the time of this slide creation (January 2022), executing our agg call results in:

\\\\\

At some point in the future (maybe when you try running the notebook sometime in late 2022 or later), this code will simply crash!

Presumably, the designers of pandas felt like automatically dropping nuisance columns leads to bad coding practices.
And in line with the Zen of Python: “Explicit is better than implicit.”

female_babynames.groupby("Name").agg(ratio_to_peak)

27 of 76

Note on Nuisance Columns

Below, we explicitly select the columns BEFORE calling agg to avoid the warning.

rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

28 of 76

Renaming Columns

The code below renames the Count column to “Count RTP”.

rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})

This syntax will be covered in lab 2 in slightly more detail.

29 of 76

Some Data Science Payoff

By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")

30 of 76

Some Data Science Payoff

By sorting rtp_table we can see the names whose popularity has decreased the most.

rtp_table.sort_values("Count RTP")

px.line(babynames.query("Name == 'Debra' and Sex == 'F'"),

x = "Year", y = "Count")

31 of 76

Some Data Science Payoff

With some fancier code we can plot the ten female names with the lowest Count RTP.

rtp_table.sort_values("Count RTP")

px.line(babynames.query("Name == 'Debra' and Sex == 'F'"),

x = "Year", y = "Count")

32 of 76

We can get the list of the top 10 names and then plot popularity with::

top10 = rtp_table.sort_values("Count RTP").head(10).index

px.line(babynames.query("Name in @top10 and Sex == 'F'"),

x = "Year", y = "Count", color = "Name")

33 of 76

Some groupby.agg Puzzles

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

34 of 76

Groupby Puzzle #1

Before we saw that the code below generates the Count RTP for all female names.

female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

35 of 76

Groupby Puzzle #1

Before we saw that the code below generates the Count RTP for all female names.

Write a groupby.agg call that returns the total number of babies with each name.

female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

36 of 76

Groupby Puzzle #2

Before we saw that the code below generates the Count RTP for all female names.

Write a groupby.agg call that returns the total number of babies with each name.

female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

female_babynames.groupby("Name")[["Count"]].agg(sum)

37 of 76

Groupby Puzzle #2

Before we saw that the code below generates the total number of babies with each name.

Write a groupby.agg call that returns the total babies born in every year:

female_babynames.groupby("Name")[["Count"]].agg(sum)

38 of 76

Groupby Puzzle #2

Before we saw that the code below generates the total number of babies with each name.

Write a groupby.agg call that returns the total babies born in every year:

female_babynames.groupby("Name")[["Count"]].agg(sum)

female_babynames.groupby("Year")[["Count"]].agg(sum)

39 of 76

Shorthand groupby Methods

Pandas also provides a number of shorthand functions that you can use in place of agg.

Instead, we could have simply written:

female_babynames.groupby("Name")[["Count"]].agg(sum)

female_babynames.groupby("Name")[["Count"]].sum()

For more examples (first, last, mean, median, etc.) see the left sidebar on: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html

40 of 76

Plotting Birth Counts

Plotting the DataFrame we just generated tells an interesting story.

puzzle2 = female_babynames.groupby("Year")[["Count"]].agg(sum)

px.line(puzzle2, y = "Count")

41 of 76

A Word of Warning!

We made an enormous assumption when we decided to use this dataset to estimate the birth rate.

According to https://lao.ca.gov/LAOEconTax/Article/Detail/691, the true number of babies born in California in 2020 was more than 400,000.
What happened?

42 of 76

From Lecture 1: Exploratory Data Analysis and Visualization

How is our data organized and what does it contain?
Do we already have relevant data?
What are the biases, anomalies, or other issues with the data?
How do we transform the data to enable effective analysis?

Bottom line: Blindly using tools is dangerous!

Lisa will cover EDA next week.

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

43 of 76

From Lecture 1: Exploratory Data Analysis and Visualization

What are the biases, anomalies, or other issues with the data?

We only used names for babies who are female at birth.
Not all babies register for social security.
The database does not include names of popularity less than 5 per year (for example both of my kids).

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

44 of 76

One more groupby Puzzle

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

45 of 76

Groupby Puzzle #4

Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

elections.groupby("Party").agg(max).head(10)

46 of 76

groupby Puzzle #4

Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?

Every column is calculated independently! Among Democrats:

Last year they ran: 2020
Alphabetically latest candidate name: Woodrow Wilson
Highest % of vote: 61.34

elections.groupby("Party").agg(max).head(10)

47 of 76

Quick Subpuzzle

groupby

.agg(f), where f = max

What will go in the ??

48 of 76

Quick Subpuzzle

groupby

.agg(f), where f = max

49 of 76

Puzzle #4

Very hard puzzle: Try to write code that returns the table below.

Each row shows the best result (in %) by each party.

For example: Best Democratic result ever was Johnson’s 1964 win.

50 of 76

Puzzle #4

Very hard puzzle: Try to write code that returns the table below.

Hint, first do:
Each row shows the best result (in %) by each party.

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

51 of 76

Puzzle #4

Very hard puzzle: Try to write code that returns the table below.

First sort the DataFrame so that rows are in ascending order of %.
Then group by Party and take the first item of each series.
Note: Lab will give you a chance to try this out if you didn’t quite follow during lecture.

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])

elections_sorted_by_percent

52 of 76

There’s More Than One Way to Find the Best Result by Party

In Pandas, there’s more than one way to get to the same answer.

Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc.
Takes a very long time to understand these tradeoffs!
If you find your current solution to be particularly convoluted or hard to read, maybe try finding another way!

53 of 76

Alternate Approaches

Some examples that use syntax we haven’t discussed in class:

See today’s lecture notebook if you want to explore idxmax and drop_duplicates.

We won’t cover these formally in the course.

best_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]

elections_sorted_by_percent = elections.sort_values("%", ascending=False)

elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])

best_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')

54 of 76

Other DataFrameGroupBy Features

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

55 of 76

Revisiting groupby.agg

So far, we’ve seen that df.groupby("year").agg(sum):

Organizes all rows with the same year into a subframe for that year.
Creates a new dataframe with one row representing each subframe year.

All rows in each subframe are combined using the sum function.

56 of 76

Raw groupby Objects

The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.

It is not a DataFrame!

Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or Series). agg is only one choice:

agg: Creates a new DataFrame with one aggregated row per subframe.
max: Creates a new DataFrame aggregated using the max function.
size: Creates a new Series with the size of each subframe.
filter: Creates a copy of the original DataFrame, but keeping only rows from subframes that obey the provided condition.

See https://pandas.pydata.org/docs/reference/groupby.html for a list of DataFrameGroupBy methods.

grouped_by_year = babynames.groupby("Year")

type(grouped_by_year)

pandas.core.groupby.generic.DataFrameGroupBy

57 of 76

groupby.size()

1992

1996

2000

1996

1992

2000

1992

1996

2000

1992

1996

2000

groupby(year)

.size()

2000

58 of 76

Filtering by Group

Another common use for groups is to filter data.

groupby.filter takes an argument f.
f is a function that:

Takes a DataFrame as input.
Returns either true or false.

For each group g, f is applied to the subframe comprised of the rows from the original dataframe corresponding to that group.

59 of 76

groupby.filter

groupby

.filter(f), where

f = lambda sf: sf[“num”].sum() > 10

60 of 76

Groupby and PivotTables

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

61 of 76

Grouping by Multiple Columns

Suppose we want to build a table showing the total number of babies born of each sex in each year. One way is to groupby using both columns of interest:

Example:

Note: Resulting DataFrame is multi-indexed. That is, its index has multiple dimensions. Will explore in a later lecture.

babynames.groupby(["Year", "Sex"]).agg(sum).head(6)

62 of 76

Pivot Tables

A more natural approach is to use our Data 8 brains and create a pivot table.

babynames_pivot = babynames.pivot_table(

index='Year', # rows (turned into index)

columns='Sex', # column values

values=['Count'], # field(s) to process in each group

aggfunc=np.sum, # group operation

)

babynames_pivot.head(6)

63 of 76

groupby([“Year”, “Sex”]) vs. pivot_table

The pivot table more naturally represents our data.

64 of 76

Pivot Table Mechanics

NaN

...

group

f = sum

65 of 76

Pivot Tables

We can include multiple values in our pivot tables.

babynames_pivot = babynames.pivot_table(

index='Year', # rows (turned into index)

columns='Sex', # column values

values=['Count', 'Name'],

aggfunc=np.max, # group operation

)

babynames_pivot.head(6)

66 of 76

A Quick Look at Joining Tables

Lecture 04, Data 100 Spring 2022

Googling Custom Sorts
Adding, Modifying, and Removing Columns
Groupby.agg
Some groupby.agg Puzzles
One more groupby Puzzle
Other DataFrameGroupBy Features
Groupby and PivotTables
A Quick Look at Joining Tables

67 of 76

Joining Tables

Suppose want to know the 2020 male popularity of presidential candidate’s names.

Example: Dwight Eisenhower’s name Dwight is not popular today, with only 5 babies born with this name in California in 2020.

To solve this problem, we’ll have to join tables.

This will be almost exactly like Table.join from data 8 (http://data8.org/datascience/_autosummary/datascience.tables.Table.join.html)

68 of 76

Creating Table 1: Male Babynames

Let’s set aside only male names from 2020 first:

male_2020_babynames = babynames.query('Sex == "M" and Year == 2020')

male_2020_babynames

69 of 76

Creating Table 2: Presidents with First Names

To join our table, we’ll also need to set aside the first names of each candidate.

You’ll have a chance to write this code again on lab, so don’t worry about the details too much.

elections["First Name"] = elections["Candidate"].str.split().str[0]

70 of 76

Joining Our Tables

merged = pd.merge(left = elections, right = male_2020_babynames,

left_on = "First Name", right_on = "Name")

71 of 76

Lab

We’ll talk more about joining tables in a future lecture.

Note: Your code in the lab will also be capable of finding the popularity of female presidential candidate names!

72 of 76

New Syntax / Concept Summary

Today we covered:

Sorting with a custom key.
Creating and dropping columns.
Groupby: Output of .groupby(“Name”) is a DataFrameGroupBy object. Condense back into a DataFrame or Series with:

groupby.agg
groupby.size
groupby.filter
and more...

Pivot tables: An alternate way to group by exactly two columns.
Joining tables using pd.merge.

73 of 76

Groupby Puzzle #4 (Tricky!)

You might expect that the code below gives the relative birth rate for each year.

However, this code actually yields the result shown. What is this?

female_babynames.groupby("Year")[["Count"]].agg(ratio_to_peak)

74 of 76

Groupby Puzzle #3 (Tricky!)

You might expect that the code below gives the ratio_to_peak for the birth rate for each year.

However, this code actually yields the result shown. What is this?

female_babynames.groupby("Year")[["Count"]].agg(ratio_to_peak)

Consider the 1914 names:

def ratio_to_peak(series):

return series.iloc[-1] / max(series)

max(series)

series.iloc[-1]

5 / 773 = 0.006468

75 of 76

Groupby Puzzle #3 (Tricky!)

You might expect that the code below gives the ratio_to_peak for the birth rate for each year.

Instead it plots 5 / max(count) for each year.

female_babynames.groupby("Year")[["Count"]].agg(ratio_to_peak)

76 of 76

Computing Relative Birth Counts

In puzzle two we wrote:

To compute relative birth counts, we’d just divide this DataFrame by its max value.

puzzle2 = female_babynames.groupby("Year")[["Count"]].agg(sum)

puzzle2 / max(puzzle2["Count"])