Pandas, Part II
More Advanced Pandas (Grouping, Aggregation, Pivot Tables Merging)
Data 100/Data 200, Spring 2022 @ UC Berkeley
Josh Hug and Lisa Yan
1
LECTURE 4
New Syntax / Concept Summary
Today we’ll cover:
2
Googling Custom Sorts
Lecture 04, Data 100 Spring 2022
3
Manipulating String Data
Last time, we saw how we could find, for example, the most popular male names in California in the year 2020:
4
babynames.query('Sex == "M" and Year == 2020')
.sort_values("Count", ascending = False)
Manipulating String Data
What if we wanted to find the longest names in California?
Before summer 2020, this would not have been straightforward.
5
babynames.query('Sex == "M" and Year == 2020')
.sort_values("Name", ascending = False)
Manipulating String Data
What if we wanted to find the longest names in California?
6
babynames.query('Sex == "M" and Year == 2020')
.sort_values("Name", key = lambda x: x.str.len(),
ascending = False)
Adding, Modifying, and Removing Columns
Lecture 04, Data 100 Spring 2022
7
Sorting By Length
As motivation, let’s try to solve the sorting problem using a pre-2020 technique:
8
Approach 1: Create a Temporary Column
Intuition: Create a column equal to the length. Sort by that column.
9
Syntax for Column Addition
Adding a column is easy:
10
Can also do both steps on one line of code
#create a new series of only the lengths
babyname_lengths = babynames["Name"].str.len()
#add that series to the dataframe as a column
babynames["name_lengths"] = babyname_lengths
Syntax for Column Addition
Sorting a table is as usual:
11
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
Syntax for Dropping a Column (or Row)
After sorting, we can drop the temporary column.
12
babynames = babynames.drop("name_lengths", axis = 'columns')
Sorting by Arbitrary Functions
Suppose we want to sort by the number of occurrences of “dr” + number of occurrences of “ea”.
13
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
Groupby.agg
Lecture 04, Data 100 Spring 2022
14
Goal
Goal: Find the female baby name whose popularity has fallen the most.
15
Number of Jennifers Born in California Per Year
Goal
Goal: Find the female baby name whose popularity has fallen the most.
Let’s start by defining what we mean by changed popularity.
Example for “Jennifer”:
Let’s spend some time in our notebook. The following N slides are for reference only and will be skipped during live lecture.
16
Calculating RTP
17
max_jennifers = max(babynames.query("Name == 'Jennifer' and Sex == 'F'")["Count"])
6064
current_jennifers = babynames.query("Name == 'Jennifer' and Sex == 'F'")["Count"].iloc[-1]
141
rtp = current_jennifers / max_jennifers
0.023251978891820582
def ratio_to_peak(series):
return series.iloc[-1] / max(series)
jennifer_counts_series = babynames.query("Name == 'Jennifer' and Sex == 'F'")["Count"]
ratio_to_peak(jennifer_counts_series)
0.02325197889182058
Approach 1: Getting RTP for Every Name The Hard Way
Approach 1: Hack something together using our existing Python knowledge.
Challenge: Try to fill in the code above by filling in the ??.
18
#build dictionary where each entry is the rtp for a given name
#e.g. rtps["jennifer"] should be 0.0231
rtps = {}
for name in ??:
counts_of_current_name = female_babynames[??]["Count"]
rtps[name] = ratio_to_peak(counts_of_current_name)
#convert to series
rtps = pd.Series(rtps)
Approach 1: Getting RTP for Every Name The Hard Way
Approach 1: Hack something together using our existing Python knowledge.
The code above is extremely slow, and also way more complicated than the better approach coming next.
19
#build dictionary where each entry is the rtp for a given name
#e.g. rtps["jennifer"] should be 0.0231
rtps = {}
for name in babynames["Name"].unique():
counts_of_current_name = female_babynames[female_babynames["Name"] == name]["Count"]
rtps[name] = ratio_to_peak(counts_of_current_name)
#convert to series
rtps = pd.Series(rtps)
Approach 2: Using Groupby and Agg
The code below is the more idiomatic way of computing what we want.
20
female_babynames.groupby("Name").agg(ratio_to_peak)
female_babynames.group("Name", ratio_to_peak)
Nothing new here! This is just the group function that you saw in data8, where collect = ratio_to_peak.
Comparing the Two Approaches
As a reminder you should almost never be writing code in this class that includes loops or list comprehensions on Pandas series.
21
female_babynames.groupby("Name").agg(ratio_to_peak)
Approach 1: [BAD!!]
Approach 2:
Visual Review of Data 8: Grouping and Collection
22
A
3
B
1
C
4
A
1
B
5
C
9
A
2
C
5
B
6
12
7
3
2
7
3
8
6
7
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
C
5
groupby
12
2
8
7
7
7
3
3
6
A
2
10
B
5
0
C
5
3
Can think of as temporary 3 sub-dataframes
.agg(f), where f = ammd
Condenses every sub-dataframe back into a single row.
Attendance Question: Check Your groupBy Understanding
In the five rows shown, note the Year is 1.0 for every value.
Are there any rows for which Year is not 1.0?
23
fbn.groupby("Name").agg(ratio_to_peak)
Attendance Question: Check Your groupBy Understanding
In the five rows shown, note the Year is 1.0 for every value.
Are there any rows for which Year is not 1.0?
24
fbn.groupby("Name").agg(ratio_to_peak)
Note: This is a hard question! I originally tricked myself and thought the answer was B.
Note on Nuisance Columns
At least as of the time of this slide creation (January 2022), executing our agg call results in:
25
female_babynames.groupby("Name").agg(ratio_to_peak)
For more details, see Pandas 1.3 release notes.
Note on Nuisance Columns
At least as of the time of this slide creation (January 2022), executing our agg call results in:
\\\\\
At some point in the future (maybe when you try running the notebook sometime in late 2022 or later), this code will simply crash!
26
female_babynames.groupby("Name").agg(ratio_to_peak)
Note on Nuisance Columns
Below, we explicitly select the columns BEFORE calling agg to avoid the warning.
27
rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Renaming Columns
The code below renames the Count column to “Count RTP”.
28
rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})
This syntax will be covered in lab 2 in slightly more detail.
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.
29
rtp_table.sort_values("Count RTP")
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.
30
rtp_table.sort_values("Count RTP")
px.line(babynames.query("Name == 'Debra' and Sex == 'F'"),
x = "Year", y = "Count")
Some Data Science Payoff
With some fancier code we can plot the ten female names with the lowest Count RTP.
31
rtp_table.sort_values("Count RTP")
px.line(babynames.query("Name == 'Debra' and Sex == 'F'"),
x = "Year", y = "Count")
We can get the list of the top 10 names and then plot popularity with::
32
top10 = rtp_table.sort_values("Count RTP").head(10).index
px.line(babynames.query("Name in @top10 and Sex == 'F'"),
x = "Year", y = "Count", color = "Name")
Some groupby.agg Puzzles
Lecture 04, Data 100 Spring 2022
33
Groupby Puzzle #1
Before we saw that the code below generates the Count RTP for all female names.
34
female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Groupby Puzzle #1
Before we saw that the code below generates the Count RTP for all female names.
Write a groupby.agg call that returns the total number of babies with each name.
35
female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Groupby Puzzle #2
Before we saw that the code below generates the Count RTP for all female names.
Write a groupby.agg call that returns the total number of babies with each name.
36
female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
female_babynames.groupby("Name")[["Count"]].agg(sum)
Groupby Puzzle #2
Before we saw that the code below generates the total number of babies with each name.
Write a groupby.agg call that returns the total babies born in every year:
37
female_babynames.groupby("Name")[["Count"]].agg(sum)
Groupby Puzzle #2
Before we saw that the code below generates the total number of babies with each name.
Write a groupby.agg call that returns the total babies born in every year:
38
female_babynames.groupby("Name")[["Count"]].agg(sum)
female_babynames.groupby("Year")[["Count"]].agg(sum)
Shorthand groupby Methods
Pandas also provides a number of shorthand functions that you can use in place of agg.
Instead, we could have simply written:
39
female_babynames.groupby("Name")[["Count"]].agg(sum)
female_babynames.groupby("Name")[["Count"]].sum()
For more examples (first, last, mean, median, etc.) see the left sidebar on: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html
Plotting Birth Counts
Plotting the DataFrame we just generated tells an interesting story.
40
puzzle2 = female_babynames.groupby("Year")[["Count"]].agg(sum)
px.line(puzzle2, y = "Count")
A Word of Warning!
We made an enormous assumption when we decided to use this dataset to estimate the birth rate.
41
From Lecture 1: Exploratory Data Analysis and Visualization
Bottom line: Blindly using tools is dangerous!
Lisa will cover EDA next week.
42
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
From Lecture 1: Exploratory Data Analysis and Visualization
What are the biases, anomalies, or other issues with the data?
43
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
One more groupby Puzzle
Lecture 04, Data 100 Spring 2022
44
Groupby Puzzle #4
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
45
elections.groupby("Party").agg(max).head(10)
groupby Puzzle #4
Why does the table seem to claim that Woodrow Wilson won the presidency in 2020?
Every column is calculated independently! Among Democrats:
46
elections.groupby("Party").agg(max).head(10)
Quick Subpuzzle
47
A
3
B
1
C
4
A
1
B
5
C
9
A
2
C
5
B
6
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
C
5
A
3
B
6
C
9
groupby
.agg(f), where f = max
ak
tx
fl
hi
mi
ak
ca
sd
nc
ak
hi
ca
tx
mi
nc
fl
ak
sd
??
??
??
What will go in the ??
Quick Subpuzzle
48
A
3
B
1
C
4
A
1
B
5
C
9
A
2
C
5
B
6
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
C
5
A
3
B
6
C
9
groupby
.agg(f), where f = max
ak
tx
fl
hi
mi
ak
ca
sd
nc
ak
hi
ca
tx
mi
nc
fl
ak
sd
hi
tx
sd
Puzzle #4
Very hard puzzle: Try to write code that returns the table below.
49
Puzzle #4
Very hard puzzle: Try to write code that returns the table below.
50
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
Puzzle #4
Very hard puzzle: Try to write code that returns the table below.
51
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])
elections_sorted_by_percent
There’s More Than One Way to Find the Best Result by Party
In Pandas, there’s more than one way to get to the same answer.
52
Alternate Approaches
Some examples that use syntax we haven’t discussed in class:
See today’s lecture notebook if you want to explore idxmax and drop_duplicates.
53
best_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.groupby("Party").agg(lambda x : x.iloc[0])
best_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')
Other DataFrameGroupBy Features
Lecture 04, Data 100 Spring 2022
54
Revisiting groupby.agg
So far, we’ve seen that df.groupby("year").agg(sum):
55
Raw groupby Objects
The result of a groupby operation applied to a DataFrame is a DataFrameGroupBy object.
Given a DataFrameGroupBy object, can use various functions to generate DataFrames (or Series). agg is only one choice:
56
See https://pandas.pydata.org/docs/reference/groupby.html for a list of DataFrameGroupBy methods.
grouped_by_year = babynames.groupby("Year")
type(grouped_by_year)
pandas.core.groupby.generic.DataFrameGroupBy
groupby.size()
57
1992
3
1996
1
2000
4
1996
1
1992
5
2000
9
2000
2
2000
6
1992
3
1992
5
1996
1
1996
1
2000
4
2000
9
2000
6
1992
2
1996
2
2000
4
groupby(year)
.size()
ak
tx
fl
hi
mi
ak
ca
sd
ak
hi
tx
mi
fl
ak
sd
2000
2
ca
Filtering by Group
Another common use for groups is to filter data.
58
groupby.filter
59
A
3
B
1
C
4
A
1
B
5
C
9
A
2
D
5
B
6
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
D
5
groupby
.filter(f), where
f = lambda sf: sf[“num”].sum() > 10
B
1
C
4
B
5
C
9
B
6
12
13
6
5
Groupby and PivotTables
Lecture 04, Data 100 Spring 2022
60
Grouping by Multiple Columns
Suppose we want to build a table showing the total number of babies born of each sex in each year. One way is to groupby using both columns of interest:
Example:
61
Note: Resulting DataFrame is multi-indexed. That is, its index has multiple dimensions. Will explore in a later lecture.
babynames.groupby(["Year", "Sex"]).agg(sum).head(6)
Pivot Tables
A more natural approach is to use our Data 8 brains and create a pivot table.
62
babynames_pivot = babynames.pivot_table(
index='Year', # rows (turned into index)
columns='Sex', # column values
values=['Count'], # field(s) to process in each group
aggfunc=np.sum, # group operation
)
babynames_pivot.head(6)
groupby([“Year”, “Sex”]) vs. pivot_table
The pivot table more naturally represents our data.
63
Pivot Table Mechanics
64
A
3
B
1
C
4
A
1
B
5
C
9
A
2
D
5
B
6
F
M
F
M
F
M
F
F
M
A
3
A
2
F
F
A
1
M
B
5
F
B
1
B
6
M
M
C
4
F
C
9
M
D
5
F
A
5
F
A
1
M
B
5
F
B
7
M
C
4
F
C
9
M
D
5
F
5
NaN
A
B
C
D
F
M
1
5
4
5
7
9
...
R
C
group
f
f
f
f
f
f
f
f = sum
Pivot Tables
We can include multiple values in our pivot tables.
65
babynames_pivot = babynames.pivot_table(
index='Year', # rows (turned into index)
columns='Sex', # column values
values=['Count', 'Name'],
aggfunc=np.max, # group operation
)
babynames_pivot.head(6)
A Quick Look at Joining Tables
Lecture 04, Data 100 Spring 2022
66
Joining Tables
Suppose want to know the 2020 male popularity of presidential candidate’s names.
To solve this problem, we’ll have to join tables.
67
Creating Table 1: Male Babynames
Let’s set aside only male names from 2020 first:
68
male_2020_babynames = babynames.query('Sex == "M" and Year == 2020')
male_2020_babynames
Creating Table 2: Presidents with First Names
To join our table, we’ll also need to set aside the first names of each candidate.
69
elections["First Name"] = elections["Candidate"].str.split().str[0]
Joining Our Tables
70
merged = pd.merge(left = elections, right = male_2020_babynames,
left_on = "First Name", right_on = "Name")
Lab
We’ll talk more about joining tables in a future lecture.
71
New Syntax / Concept Summary
Today we covered:
72
Groupby Puzzle #4 (Tricky!)
You might expect that the code below gives the relative birth rate for each year.
However, this code actually yields the result shown. What is this?
73
female_babynames.groupby("Year")[["Count"]].agg(ratio_to_peak)
Groupby Puzzle #3 (Tricky!)
You might expect that the code below gives the ratio_to_peak for the birth rate for each year.
However, this code actually yields the result shown. What is this?
74
female_babynames.groupby("Year")[["Count"]].agg(ratio_to_peak)
Consider the 1914 names:
def ratio_to_peak(series):
return series.iloc[-1] / max(series)
max(series)
series.iloc[-1]
5 / 773 = 0.006468
Groupby Puzzle #3 (Tricky!)
You might expect that the code below gives the ratio_to_peak for the birth rate for each year.
Instead it plots 5 / max(count) for each year.
75
female_babynames.groupby("Year")[["Count"]].agg(ratio_to_peak)
Computing Relative Birth Counts
In puzzle two we wrote:
To compute relative birth counts, we’d just divide this DataFrame by its max value.
76
puzzle2 = female_babynames.groupby("Year")[["Count"]].agg(sum)
puzzle2 / max(puzzle2["Count"])