1 of 70

Pandas, Part II

More on Pandas (Utility Functions, Grouping, Aggregation)

Data 100/Data 200, Spring 2023 @ UC Berkeley

Narges Norouzi and Lisa Yan

Content credit: Narges Norouzi, Lisa Yan, Josh Hug

1

LECTURE 3

2 of 70

Join at slido.com�#2647893

Start presenting to display the joining instructions on this slide.

3 of 70

New Syntax / Concept Summary

Today we’ll cover:

  • Conditional selection.
  • Handy utility functions.
  • Sorting with a custom key.
  • Creating and dropping columns.
  • Groupby: Output of .groupby(“Name”) is a DataFrameGroupBy object. Condense back into a DataFrame or Series with:
    • groupby.agg

3

4 of 70

More on Conditional Selection

Lecture 03, Data 100 Spring 2023

  • Conditional Selection
  • Handy Utility Functions
  • Custom Sorts
  • Adding, Modifying, and Removing Columns
  • Groupby.agg
  • Some groupby.agg Puzzles

4

5 of 70

Boolean Array Input

Yet another input type supported by loc and [] is the boolean array.

5

babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]

6 of 70

Boolean Array Input

We can perform the same operation using loc.

6

babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]

7 of 70

Boolean Array Input

Useful because boolean arrays can be generated by using logical operators on Series.

7

Length 400761 Series where every entry is either “True” or “False”, where “True” occurs for every babyname with "Sex" = "F".

logical_operator = (babynames["Sex"] == "F")

True in rows 0, 1, 2, …

8 of 70

Boolean Array Input

Useful because boolean arrays can be generated by using logical operators on Series.

8

Length 400761 Series where every entry is either “True” or “False”, where “True” occurs for every babyname with "Sex" = "F".

babynames[babynames["Sex"] == "F"]

Length 235791 Series where every entry belongs to a babyname with "Sex" = "F"

9 of 70

Boolean Array Input

Can also use .loc.

9

9

Length 400761 Series where every entry is either “True” or “False”, where “True” occurs for every babyname with "Sex" = "F".

babynames.loc[babynames["Sex"] == "F"]

Length 235791 Series where every entry belongs to a babyname with "Sex" = "F"

10 of 70

Boolean Array Input

Boolean Series can be combined using various operators, allowing filtering of results by multiple criteria.

  • Example: The & operator.
  • Lab 2 covers more such operators.

10

babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]

11 of 70

Which of the following pandas statements returns a DataFrame of the first 3 baby names with Count > 250.

Start presenting to display the poll results on this slide.

12 of 70

Answer

Which of the following pandas statements returns a DataFrame of the first 3 baby names with Count > 250.

babynames.iloc[[0, 233, 484], [3, 4]]

babynames.loc[[0, 233, 484]]

babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].head(3)

babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]

12

13 of 70

Note on Exam Problems

Q: Are you going to put horrible problems like these on the exam?

�A: Technically such problems would be in scope, but it’s very unlikely they’ll be this nitpicky.

13

babynames.loc[babynames["Count"] > 250, ["Name", "Count"]].iloc[0:2, :]

14 of 70

Alternatives to Boolean Array Selection

Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions.

Pandas provides many alternatives, for example:

  • .isin
  • .str.startswith
  • .groupby.filter (see lecture 4)

14

babynames[(babynames["Name"] == "Bella") |

(babynames["Name"] == "Alex") |

(babynames["Name"] == "Ani") |

(babynames["Name"] == "Lisa")]

15 of 70

Alternatives to Boolean Array Selection

Pandas provides many alternatives, for example:

  • .isin
  • .str.startswith
  • .groupby.filter (see lecture 4)

15

names = ["Bella", "Alex", "Ani", "Lisa"]

babynames[babynames["Name"].isin(names)]

16 of 70

Alternatives to Boolean Array Selection

Pandas provides many alternatives, for example:

  • .isin
  • .str.startswith
  • .groupby.filter (see lecture 4)

16

babynames[babynames["Name"].str.startswith("N")]

17 of 70

Handy Utility Functions

Lecture 03, Data 100 Spring 2023

  • More on Conditional Selection
  • Handy Utility Functions
  • Custom Sorts
  • Adding, Modifying, and Removing Columns
  • Groupby.agg
  • Some groupby.agg Puzzles

17

18 of 70

Numpy

Pandas Series and DataFrames support a large number of operations, including mathematical operations, so long as the data is numerical.

18

bella_count = babynames[babynames["Name"] == "Bella"]["Count"]

np.mean(bella_counts)

270.1860465116279

max(bella_counts)

902

19 of 70

Pandas

In addition to its rich syntax for indexing and support for other libraries (numpy, built-in functions), Pandas provides an enormous number of useful utility functions. Today, we’ll discuss:

  • size/shape
  • describe
  • sample
  • value_counts
  • uniques
  • sort_values

19

20 of 70

shape/size

20

babynames.size

2003810

babynames.shape

(400762, 5)

21 of 70

describe()

21

babynames.describe()

22 of 70

describe()

  • A different set of statistics will be reported if .describe() is called on a Series.

22

babynames["Sex"].describe()

23 of 70

sample()

If you want a DataFrame with a random selection of rows, you can use the sample() method.

  • By default, it is without replacement. Use replace=True for replacement.
  • Naturally, can be chained with other methods and operators (iloc, etc).

23

babynames.sample(5).iloc[:, 2:]

babynames[babynames["Year"] == 2000]

.sample(4, replace=True)

.iloc[:, 2:]

babynames.sample()

24 of 70

value_counts()

The Series.value_counts method counts the number of occurrences of a each unique value in a Series.

  • Return value is also a Series.

24

babyname["Name"].value_counts()

25 of 70

unique()

The Series.unique method returns an array of every unique value in a Series.

25

babynames["Name"].unique()

26 of 70

sort_values()

The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).

26

babynames["Name"].sort_values()

27 of 70

sort_values

The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).

  • The DataFrame version requires an argument specifying the column on which to sort.

27

babynames.sort_values(by = "Count", ascending=False)

28 of 70

Custom Sorts

Lecture 03, Data 100 Spring 2023

  • More on Conditional Selection
  • Handy Utility Functions
  • Custom Sorts
  • Adding, Modifying, and Removing Columns
  • Groupby.agg
  • Some groupby.agg Puzzles

28

29 of 70

Manipulating String Data

How we could find, for example, the top 5 most popular names in California in the year 2021?

29

babynames[babynames["Year"] == 2021]

.sort_values("Count", ascending=False)

30 of 70

Manipulating String Data

What if we wanted to find the longest names in California?

  • Just sorting by name won’t work!

30

babynames.sort_values("Name", ascending=False)

31 of 70

Manipulating String Data

What if we wanted to find the longest names in California?

31

babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False)

.head()

32 of 70

Adding, Modifying, and Removing Columns

Lecture 03, Data 100 Spring 2023

  • More on Conditional Selection
  • Handy Utility Functions
  • Custom Sorts
  • Adding, Modifying, and Removing Columns
  • Groupby.agg
  • Some groupby.agg Puzzles

32

33 of 70

Sorting By Length

Let’s try to solve the sorting problem with different approaches:

  • We will create a temporary column, then sort on it.

33

34 of 70

Approach 1: Create a Temporary Column

Intuition: Create a column equal to the length. Sort by that column.

34

35 of 70

Syntax for Column Addition

Adding a column is easy:

35

Can also do both steps on one line of code

# Create a Series of the length of each name

babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name

babynames["name_lengths"] = babyname_lengths

36 of 70

Syntax for Column Addition

Sorting a table is as usual:

36

babynames = babynames.sort_values(by = "name_lengths", ascending=False)

37 of 70

Syntax for Dropping a Column (or Row)

After sorting, we can drop the temporary column.

  • The drop() method assumes you’re dropping a row by default. Use axis = "columns" to drop a column instead.

37

babynames = babynames.drop("name_lengths", axis = "columns")

38 of 70

Sorting by Arbitrary Functions

Suppose we want to sort by the number of occurrences of “dr” + number of occurrences of “ea”.

  • Use the Series.map method.

38

def dr_ea_count(string):

return string.count('dr') + string.count('ea')

# Use `map` to apply `dr_ea_count` to each name in the "Name" column

babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

babynames = babynames.sort_values(by = "dr_ea_count", ascending=False)

39 of 70

Groupby.agg

Lecture 03, Data 100 Spring 2023

  • More on Conditional Selection
  • Handy Utility Functions
  • Custom Sorts
  • Groupby.agg
  • Some groupby.agg Puzzles

39

40 of 70

Visual Review of Data 8: Grouping and Collection

A

3

B

1

C

4

A

1

B

5

C

9

A

2

C

5

B

6

12

7

3

2

7

3

8

6

7

A

3

A

1

A

2

B

1

B

5

B

6

C

4

C

9

C

5

groupby

12

2

8

7

7

7

3

3

6

A

2

10

B

5

0

C

5

3

Can think of as temporary 3 sub-dataframes

.agg(f), where f = ammd

Condenses every sub-dataframe back into a single row.

41 of 70

groupby()

A groupby operation involves some combination of splitting the object, applying a function, and combining the results.

  • Calling .groupby() generates DataFrameGroupBy objects → "mini" sub-DataFrames
  • Each subframe contains all rows that correspond to a particular year

41

42 of 70

groupby.agg

A groupby operation involves some combination of splitting the object, applying a function, and combining the results.

  • Calling .groupby() generates DataFrameGroupBy objects → "mini" sub-DataFrames
  • Each subframe contains all rows that correspond to a particular year
  • Since we can't work directly with DataFrameGroupBy objects, we will use aggregation methods to summarize each DataFrameGroupBy object into one aggregated row per subframe.

42

43 of 70

Goal

Goal: Find the female baby name whose popularity has fallen the most.

43

female_babynames = babynames[babynames["Sex"] == "F"]

female_babynames = female_babynames.sort_values(["Year", "Count"])

jenn_counts_ser = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"]

Number of Jennifers Born in California Per Year

44 of 70

Goal

Goal: Find the female baby name whose popularity has fallen the most.

Let’s start by defining what we mean by changed popularity.

  • In lecture, let’s define the "ratio to peak" or RTP as the ratio of babies born with a given name today to the maximum number of the name born in a single year.

Example for "Jennifer":

  • In 1972, we hit peak Jennifer. 6,065 Jennifers were born.
  • In 2021, there were only 91 Jennifers.
  • RTP is 91 / 6065 = 0.015004.

44

45 of 70

Calculating RTP

45

max_jenn = max(female_babynames[female_babynames["Name"] == "Jennifer"]["Count"])

6065

curr_jenn = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]

91

rtp = curr_jenn / max_jenn

0.015004122011541632

def ratio_to_peak(series):

return series.iloc[-1] / max(series)

jenn_counts_ser = female_babynames[female_babynames["Name"] == "Jennifer"]["Count"]

ratio_to_peak(jenn_counts_ser)

0.015004122011541632

46 of 70

Approach 1: Getting RTP for Every Name The Hard Way

Approach 1: Hack something together using our existing Python knowledge.

The code above is extremely slow, and also way more complicated than the better approach coming next.

46

# Build dictionary where each entry is the rtp for a given name

# e.g. rtps["jennifer"] should be 0.015004122011541632

rtps = {}

for name in female_babynames["Name"].unique():

counts_of_current_name = female_babynames[female_babynames["Name"] == name]["Count"]

rtps[name] = ratio_to_peak(counts_of_current_name)

# Convert to series

rtps = pd.Series(rtps)

47 of 70

Approach 2: Using Groupby and Agg

The code below is the more idiomatic way of computing what we want.

  • Much simpler, much faster, much more versatile.

47

rtp_table = female_babynames.groupby("Name").agg(ratio_to_peak)

# data8 Tables code, not pandas

female_babynames.group("Name", ratio_to_peak)

Nothing new here! This is just the group function that you saw in data8, where collect = ratio_to_peak.

48 of 70

Comparing the Two Approaches

As a reminder you should almost never be writing code in this class that includes loops or list comprehensions on Pandas series.

  • Use the pandas API as intended!

48

female_babynames.groupby("Name").agg(ratio_to_peak)

Approach 1: [BAD!!]

Approach 2:

49 of 70

Are there any rows for which Year is not 1.0?

Start presenting to display the poll results on this slide.

50 of 70

Answer

Approach 2 generated two columns, Year and Count.

In the five rows shown, note the Year is 1.0 for every value.

Are there any rows for which Year is not 1.0?

  1. Yes, names that appeared for the first time in 2021.
  2. Yes, names that did not appear in 2021.
  3. Yes, names whose peak Count was in 2021.
  4. No, every row has a Year value of 1.0.

50

rtp_table = (

female_babynames

.groupby("Name")

.agg(ratio_to_peak)

)

Note: This is a hard question! I originally tricked myself and thought the answer was B.

51 of 70

Note on Nuisance Columns

At least as of the time of this slide creation (January 2023), executing our agg call results in:

51

female_babynames.groupby("Name").agg(ratio_to_peak)

For more details, see Pandas 1.3 release notes.

52 of 70

Note on Nuisance Columns

At least as of the time of this slide creation (January 2023), executing our agg call results in:

\\\\\

At some point in the future, this code will simply crash!

  • Presumably, the designers of pandas felt like automatically dropping nuisance columns leads to bad coding practices.
  • And in line with the Zen of Python: "Explicit is better than implicit."

52

female_babynames.groupby("Name").agg(ratio_to_peak)

53 of 70

Note on Nuisance Columns

Below, we explicitly select the columns BEFORE calling agg to avoid the warning.

53

rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

54 of 70

Renaming Columns

The code below renames the Count column to "Count RTP".

54

rtp_table = female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})

This syntax will be covered in lab 2 in slightly more detail.

55 of 70

Some Data Science Payoff

By sorting rtp_table we can see the names whose popularity has decreased the most.

55

rtp_table.sort_values("Count RTP")

56 of 70

Some Data Science Payoff

By sorting rtp_table we can see the names whose popularity has decreased the most.

56

rtp_table.sort_values("Count RTP")

px.line(female_babynames[female_babyname["Name"] == "Debra")],

x = "Year", y = "Count")

57 of 70

Some Data Science Payoff

We can get the list of the top 10 names and then plot popularity with::

57

top10 = rtp_table.sort_values("Count RTP").head(10).index

px.line(female_babynames[female_babyname["Name"].isin(top10)],

x = "Year", y = "Count", color = "Name")

58 of 70

Some groupby.agg Puzzles

Lecture 03, Data 100 Spring 2023

  • More on Conditional Selection
  • Handy Utility Functions
  • Custom Sorts
  • Groupby.agg
  • Some groupby.agg Puzzles

58

59 of 70

Groupby Puzzle #1

Before we saw that the code below generates the Count RTP for all female names.

59

female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

60 of 70

Write a groupby.agg call that returns the total number of babies with each name. The image here shows the example of calculating ratio_to_peak.

Start presenting to display the poll results on this slide.

61 of 70

Answer

Before we saw that the code below generates the Count RTP for all female names.

Write a groupby.agg call that returns the total number of babies with each name.

61

female_babynames.groupby("Name")[["Count"]].agg(ratio_to_peak)

female_babynames.groupby("Name")[["Count"]].agg(sum)

62 of 70

Write a groupby.agg call that returns the total number of babies born every year. The image here shows the example of calculating ratio_to_peak.

Start presenting to display the poll results on this slide.

63 of 70

Answer

Before we saw that the code below generates the total number of babies with each name.

Write a groupby.agg call that returns the total babies born in every year:

63

female_babynames.groupby("Name")[["Count"]].agg(sum)

female_babynames.groupby("Year")[["Count"]].agg(sum)

64 of 70

Shorthand groupby Methods

Pandas also provides a number of shorthand functions that you can use in place of agg.

Instead, we could have simply written:

64

female_babynames.groupby("Name")[["Count"]].agg(sum)

female_babynames.groupby("Name")[["Count"]].sum()

For more examples (first, last, mean, median, etc.) see the left sidebar on: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html

65 of 70

Plotting Birth Counts

Plotting the DataFrame we just generated tells an interesting story.

65

puzzle2 = female_babynames.groupby("Year")[["Count"]].agg(sum)

px.line(puzzle2, y = "Count")

66 of 70

A Word of Warning!

We made an enormous assumption when we decided to use this dataset to estimate the birth rate.

  • According to https://lao.ca.gov/LAOEconTax/Article/Detail/691, the true number of babies born in California in 2020 was 421,275.
  • What happened?

66

67 of 70

From Lecture 1: Exploratory Data Analysis and Visualization

  • How is our data organized and what does it contain?
  • Do we already have relevant data?
  • What are the biases, anomalies, or other issues with the data?
  • How do we transform the data to enable effective analysis?

Bottom line: Blindly using tools is dangerous!

67

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

68 of 70

From Lecture 1: Exploratory Data Analysis and Visualization

What are the biases, anomalies, or other issues with the data?

  • We only used names for babies who are female at birth.
  • Not all babies register for social security.
  • The database does not include names of popularity less than 5 per year (for example both of my kids).

68

Ask a Question

Obtain Data

Understand the Data

Understand the World

Reports, Decisions, and Solutions

69 of 70

A fun little data science personal project?

  • Are there enough unique baby names in recent years to skew these data significantly?
  • How would you test that?

69

70 of 70

Pandas, Part II

Content credit: Narges Norouzi, Lisa Yan, Josh Hug

70

LECTURE 3