Pandas, Part II
More on Pandas (Utility Functions, Grouping, Aggregation)
Data 100/Data 200, Spring 2023 @ UC Berkeley
Narges Norouzi and Lisa Yan
Content credit: Narges Norouzi, Lisa Yan, Josh Hug
1
LECTURE 3
Join at slido.com�#2647893
ⓘ Start presenting to display the joining instructions on this slide.
New Syntax / Concept Summary
Today we’ll cover:
3
More on Conditional Selection
Lecture 03, Data 100 Spring 2023
4
Boolean Array Input
Yet another input type supported by loc and [] is the boolean array.
5
babynames_first_10_rows = babynames.loc[:9, :]
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
Boolean Array Input
We can perform the same operation using loc.
6
babynames_first_10_rows = babynames.loc[:9, :]
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
Boolean Array Input
Useful because boolean arrays can be generated by using logical operators on Series.
7
Length 400761 Series where every entry is either “True” or “False”, where “True” occurs for every babyname with "Sex" = "F".
logical_operator = (babynames["Sex"] == "F")
True in rows 0, 1, 2, …
Boolean Array Input
Useful because boolean arrays can be generated by using logical operators on Series.
8
Length 400761 Series where every entry is either “True” or “False”, where “True” occurs for every babyname with "Sex" = "F".
babynames[babynames["Sex"] == "F"]
Length 235791 Series where every entry belongs to a babyname with "Sex" = "F"
Boolean Array Input
Can also use .loc.
9
9
Length 400761 Series where every entry is either “True” or “False”, where “True” occurs for every babyname with "Sex" = "F".
babynames.loc[babynames["Sex"] == "F"]
Length 235791 Series where every entry belongs to a babyname with "Sex" = "F"
Boolean Array Input
Boolean Series can be combined using various operators, allowing filtering of results by multiple criteria.
10
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]
Which of the following pandas statements returns a DataFrame of the first 3 baby names with Count > 250.
ⓘ Start presenting to display the poll results on this slide.
Answer
Which of the following pandas statements returns a DataFrame of the first 3 baby names with Count > 250.
babynames.iloc[[0, 233, 484], [3, 4]]
babynames.loc[[0, 233, 484]]
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
12
Note on Exam Problems
Q: Are you going to put horrible problems like these on the exam?
�A: Technically such problems would be in scope, but it’s very unlikely they’ll be this nitpicky.
13
babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]
Alternatives to Boolean Array Selection
Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions.
Pandas provides many alternatives, for example:
14
babynames[(babynames["Name"] == "Bella") |
(babynames["Name"] == "Alex") |
(babynames["Name"] == "Ani") |
(babynames["Name"] == "Lisa")]
Alternatives to Boolean Array Selection
Pandas provides many alternatives, for example:
15
names = ["Bella", "Alex", "Ani", "Lisa"]
babynames[babynames["Name"].isin(names)]
Alternatives to Boolean Array Selection
Pandas provides many alternatives, for example:
16
babynames[babynames["Name"].str.startswith("N")]
Handy Utility Functions
Lecture 03, Data 100 Spring 2023
17
Numpy
Pandas Series and DataFrames support a large number of operations, including mathematical operations, so long as the data is numerical.
18
bella_count = babynames[babynames["Name"] == "Bella"]["Count"]
np.mean(bella_counts)
270.1860465116279
max(bella_counts)
902
Pandas
In addition to its rich syntax for indexing and support for other libraries (numpy, built-in functions), Pandas provides an enormous number of useful utility functions. Today, we’ll discuss:
19
shape/size
20
babynames.size
2003810
babynames.shape
(400762, 5)
describe()
21
babynames.describe()
describe()
22
babynames["Sex"].describe()
sample()
If you want a DataFrame with a random selection of rows, you can use the sample() method.
23
babynames.sample(5).iloc[:, 2:]
babynames[babynames["Year"] == 2000]
.sample(4, replace=True)
.iloc[:, 2:]
babynames.sample()
value_counts()
The Series.value_counts method counts the number of occurrences of a each unique value in a Series.
24
babyname["Name"].value_counts()
unique()
The Series.unique method returns an array of every unique value in a Series.
25
babynames["Name"].unique()
sort_values()
The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).
26
babynames["Name"].sort_values()
sort_values
The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).
27
babynames.sort_values(by = "Count", ascending=False)
Custom Sorts
Lecture 03, Data 100 Spring 2023
28
Manipulating String Data
How we could find, for example, the top 5 most popular names in California in the year 2021?
29
babynames[babynames["Year"] == 2021]
.sort_values("Count", ascending=False)
Manipulating String Data
What if we wanted to find the longest names in California?
30
babynames.sort_values("Name", ascending=False)
Manipulating String Data
What if we wanted to find the longest names in California?
31
babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False)
.head()
Adding, Modifying, and Removing Columns
Lecture 03, Data 100 Spring 2023
32
Sorting By Length
Let’s try to solve the sorting problem with different approaches:
33
Approach 1: Create a Temporary Column
Intuition: Create a column equal to the length. Sort by that column.
34
Syntax for Column Addition
Adding a column is easy:
35
Can also do both steps on one line of code
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()
# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
Syntax for Column Addition
Sorting a table is as usual:
36
babynames = babynames.sort_values(by = "name_lengths", ascending=False)
Syntax for Dropping a Column (or Row)
After sorting, we can drop the temporary column.
37
babynames = babynames.drop("name_lengths", axis = "columns")
Sorting by Arbitrary Functions
Suppose we want to sort by the number of occurrences of “dr” + number of occurrences of “ea”.
38
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
# Use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)
Groupby.agg
Lecture 03, Data 100 Spring 2023
39
Visual Review of Data 8: Grouping and Collection
A
3
B
1
C
4
A
1
B
5
C
9
A
2
C
5
B
6
12
7
3
2
7
3
8
6
7
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
C
5
groupby
12
2
8
7
7
7
3
3
6
A
2
10
B
5
0
C
5
3
Can think of as temporary 3 sub-dataframes
.agg(f), where f = ammd
Condenses every sub-dataframe back into a single row.
groupby()
A groupby operation involves some combination of splitting the object, applying a function, and combining the results.
41
groupby.agg
A groupby operation involves some combination of splitting the object, applying a function, and combining the results.
42
Goal
Goal: Find the female baby name whose popularity has fallen the most.
43
female_babynames = babynames[babynames["Sex"] == "F"]
female_babynames = female_babynames.sort_values(["Year", "Count"])
jenn_counts_ser = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"]
Number of Jennifers Born in California Per Year
Goal
Goal: Find the female baby name whose popularity has fallen the most.
Let’s start by defining what we mean by changed popularity.
Example for "Jennifer":
44
Calculating RTP
45
max_jenn = max(female_babynames[female_babynames["Name"] == "Jennifer"]["Count"])
6065
curr_jenn = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]
91
rtp = curr_jenn / max_jenn
0.015004122011541632
def ratio_to_peak(series):
return series.iloc[-1] / max(series)
jenn_counts_ser = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"]
ratio_to_peak(jenn_counts_ser)
0.015004122011541632
Approach 1: Getting RTP for Every Name The Hard Way
Approach 1: Hack something together using our existing Python knowledge.
The code above is extremely slow, and also way more complicated than the better approach coming next.
46
# Build dictionary where each entry is the rtp for a given name
# e.g. rtps["jennifer"] should be 0.015004122011541632
rtps = {}
for name in female_babynames["Name"].unique():
counts_of_current_name = female_babynames[female_babynames["Name"] == name]["Count"]
rtps[name] = ratio_to_peak(counts_of_current_name)
# Convert to series
rtps = pd.Series(rtps)
Approach 2: Using Groupby and Agg
The code below is the more idiomatic way of computing what we want.
47
rtp_table = female_babynames.groupby("Name").agg(ratio_to_peak)
# data8 Tables code, not pandas
female_babynames.group("Name", ratio_to_peak)
Nothing new here! This is just the group function that you saw in data8, where collect = ratio_to_peak.
Comparing the Two Approaches
As a reminder you should almost never be writing code in this class that includes loops or list comprehensions on Pandas series.
48
female_babynames.groupby("Name").agg(ratio_to_peak)
Approach 1: [BAD!!]
Approach 2:
Are there any rows for which Year is not 1.0?
ⓘ Start presenting to display the poll results on this slide.
Answer
Approach 2 generated two columns, Year and Count.
In the five rows shown, note the Year is 1.0 for every value.
Are there any rows for which Year is not 1.0?
50
rtp_table = (
female_babynames
.groupby("Name")
.agg(ratio_to_peak)
)
Note: This is a hard question! I originally tricked myself and thought the answer was B.
Note on Nuisance Columns
At least as of the time of this slide creation (January 2023), executing our agg call results in:
51
female_babynames.groupby("Name").agg(ratio_to_peak)
For more details, see Pandas 1.3 release notes.
Note on Nuisance Columns
At least as of the time of this slide creation (January 2023), executing our agg call results in:
\\\\\
At some point in the future, this code will simply crash!
52
female_babynames.groupby("Name").agg(ratio_to_peak)
Note on Nuisance Columns
Below, we explicitly select the columns BEFORE calling agg to avoid the warning.
53
rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Renaming Columns
The code below renames the Count column to "Count RTP".
54
rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})
This syntax will be covered in lab 2 in slightly more detail.
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.
55
rtp_table.sort_values("Count RTP")
Some Data Science Payoff
By sorting rtp_table we can see the names whose popularity has decreased the most.
56
rtp_table.sort_values("Count RTP")
px.line(female_babynames[female_babyname["Name"] == "Debra")],
x = "Year", y = "Count")
Some Data Science Payoff
We can get the list of the top 10 names and then plot popularity with::
57
top10 = rtp_table.sort_values("Count RTP").head(10).index
px.line(female_babynames[female_babyname["Name"].isin(top10)],
x = "Year", y = "Count", color = "Name")
Some groupby.agg Puzzles
Lecture 03, Data 100 Spring 2023
58
Groupby Puzzle #1
Before we saw that the code below generates the Count RTP for all female names.
59
female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
Write a groupby.agg call that returns the total number of babies with each name. The image here shows the example of calculating ratio_to_peak.
ⓘ Start presenting to display the poll results on this slide.
Answer
Before we saw that the code below generates the Count RTP for all female names.
Write a groupby.agg call that returns the total number of babies with each name.
61
female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)
female_babynames.groupby("Name")[["Count"]].agg(sum)
Write a groupby.agg call that returns the total number of babies born every year. The image here shows the example of calculating ratio_to_peak.
ⓘ Start presenting to display the poll results on this slide.
Answer
Before we saw that the code below generates the total number of babies with each name.
Write a groupby.agg call that returns the total babies born in every year:
63
female_babynames.groupby("Name")[["Count"]].agg(sum)
female_babynames.groupby("Year")[["Count"]].agg(sum)
Shorthand groupby Methods
Pandas also provides a number of shorthand functions that you can use in place of agg.
Instead, we could have simply written:
64
female_babynames.groupby("Name")[["Count"]].agg(sum)
female_babynames.groupby("Name")[["Count"]].sum()
For more examples (first, last, mean, median, etc.) see the left sidebar on: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html
Plotting Birth Counts
Plotting the DataFrame we just generated tells an interesting story.
65
puzzle2 = female_babynames.groupby("Year")[["Count"]].agg(sum)
px.line(puzzle2, y = "Count")
A Word of Warning!
We made an enormous assumption when we decided to use this dataset to estimate the birth rate.
66
From Lecture 1: Exploratory Data Analysis and Visualization
Bottom line: Blindly using tools is dangerous!
67
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
From Lecture 1: Exploratory Data Analysis and Visualization
What are the biases, anomalies, or other issues with the data?
68
Ask a Question
Obtain Data
Understand the Data
Understand the World
Reports, Decisions, and Solutions
A fun little data science personal project?
69
Pandas, Part II
Content credit: Narges Norouzi, Lisa Yan, Josh Hug
70
LECTURE 3