1 of 53

Join at slido.com�#1227261

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

1227261

2 of 53

Pandas, Part II

More on pandas (Inspection, Modification, and Sorting)

Data 100/Data 200, Spring 2026 @ UC Berkeley

Josh Grossman, Ramesh Sridharan

Content credit: Acknowledgments

2

LECTURE 3

1227261

3 of 53

📣 Announcements

Lab 1 due tonight! (Remember slip days cannot be used for lab assignment)

Homework 1 (coding and math portion) due this Friday (Jan. 30th)

Discussions and Office Hours start this week

Clarification: Mega Section is NOT the same as Data 8

  • Mega Section is just a lecture-style discussion → Same material as other discussions!
  • If you're assigned to Mega Section → Must attend Mega Section for attendance credit.
  • If you’re assigned to a different section → Can substitute Mega Section for attendance credit.
  • Ungraded discussion → You can optionally attend Mega Section just to learn.

�Ramesh's OH: Mondays @ 10am-12pm in Evans (room TBD)

  • Makeup OH this week only: Thursday 2-3:30 in Evans 455

Josh’s OH: Wednesdays @ 1:30pm-3:30pm in Evans 422

DSP Students: Make sure your accommodation letters are submitted via the DSP portal

3

1227261

4 of 53

Recap: extraction with .loc[], .iloc[], and []

We learned three ways of extracting data: .loc[], .iloc[], and indexing with []

df.iloc[my_row_numbers, my_col_numbers]

df.loc[my_row_labels, my_col_labels]

df[my_single_label]

df[my_label_list]

df[my_slice]

df[my_boolean_mask]

4

Indexing with [] only takes one argument, and acts differently depending on what it is:

Each can be either: a single label, a list of labels, a slice*, or a boolean mask.

Each can be either: a single integer, a list of integers, or a slice.

Treated as column labels

Treated as row labels

1227261

5 of 53

Context-dependent extraction with [ ]: Which of the following statements correctly returns the value "blue fish" from the "weird" DataFrame?

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

1227261

6 of 53

Slido solution

Slido solution is in the lecture demo code!

6

1227261

7 of 53

Recap: Boolean Masks with .loc and [ ]

How do we extract rows that satisfy a condition (e.g., rows where column X > 5)?

  • .loc and [ ] accept boolean Series (and arrays) as input.
  • Rows corresponding to True are extracted; rows corresponding to False are not.
  • We can combine multiple conditions using & and |, but be careful with parentheses!

7

elections[(

((elections["Year"] > 2000) |

(elections["Year"] < 1861)) &

(elections["Result"] == "win") &

(elections["%"] >= 55)

)]

Length 187 Series where every entry is either "True" or "False", where "True" occurs for every candidate who won more than 55% of the popular vote in a year

1227261

8 of 53

Which of the following pandas statements returns a DataFrame of the first 3 baby names from California in 2006? (Talk and work with your neighbor!)

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

1227261

9 of 53

Slido solution

Slido solution is in the lecture demo code!

9

1227261

10 of 53

Boolean Array Input

We can also use .loc.

10

Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" == "F".

babynames.loc[babynames["Sex"] == "F", :]

Length 239537 DataFrame where every entry belongs to a babyname with "Sex" == "F".

1227261

11 of 53

Boolean Array Input

Boolean arrays can be generated by using logical operators on Series.

11

(babynames["Sex"] == "F")

Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" == "F".

babynames["Sex"]

1227261

12 of 53

Boolean Array Input

We can use our new boolean array to filter to desired rows. Boolean mask! 🤿

12

Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" == "F".

babynames[(babynames["Sex"] == "F")]

Length 239537 DataFrame where every entry belongs to a babyname with "Sex" == "F".

1227261

13 of 53

Boolean Array Input

Boolean Series can be combined element-wise using boolean operators.

  • The & operator applies boolean_series_1 AND boolean_series_2
  • The | operator applies boolean_series_1 OR boolean_series_2

13

babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]

Rows with Sex=="F" OR before year 2000 (or both!)

1227261

14 of 53

Bitwise Operators

& and | are examples of bitwise operators. They allow us to apply multiple logical conditions.

If p and q are boolean arrays or Series:

14

Symbol

Usage

Meaning

~

~p

NOT p

|

p | q

p OR q

&

p & q

p AND q

^

p ^ q

p XOR q

p XOR q: Either p, or q, but not both. Exclusive OR!

1227261

15 of 53

More Sophisticated Direct Boolean Array Selection

Boolean array selection is useful, but can lead to overly verbose code.

15

babynames[(babynames["Name"] == "Bella") |

(babynames["Name"] == "Alex") |

(babynames["Name"] == "Narges") |

(babynames["Name"] == "Lisa")]

pandas provides many alternatives, for example:

  • .isin
  • .str.startswith
  • .groupby.filter (we’ll see this in Lecture 4)

1227261

16 of 53

More Sophisticated Boolean Array Selection

pandas provides many useful functions for making boolean masks, for example:

  • .isin
  • .str.startswith
  • .groupby.filter (see Lecture 4)

16

names = ["Bella", "Alex", "Narges", "Lisa"]

babynames[babynames["Name"].isin(names)]

Returns a Boolean Series that is True when the corresponding name in babynames is Bella, Alex, Narges, or Lisa.

1227261

17 of 53

What's the most interesting question you want to answer using the babynames table?

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

1227261

18 of 53

How Common are the Most Common Names?

18

1227261

19 of 53

Election Data: Your Questions (A Representative Sampling)

19

Most common questions:

  • Which party wins the most often?
  • Connecting popular vote % with win/loss (e.g., winning popular vote + losing election)

Other interesting themes

  • Winning/losing streaks
  • Candidates changing parties
  • Role and strength of third parties over time
  • Candidates running multiple times

Interesting questions requiring outside data/information

  • Connection with voting rights legislation
  • Candidate level of education
  • How parties develop, separate, etc.

1227261

20 of 53

Agenda

Lecture 03, Data 100 Spring 2026

  • DataFrame inspection
  • Adding, removing, and modifying columns
  • Sorting

20

1227261

21 of 53

DataFrame Inspection

lec03.ipynb

21

Demo Slides

1227261

22 of 53

Retrieving the Index, Columns, and shape

To extract row labels, use DataFrame.index:

To extract column labels, use DataFrame.columns:

For dimensions use DataFrame.shape:

22

elections.columns

elections.shape

elections = pd.read_csv("data/elections.csv", index_col = "Candidate")

elections.index

1227261

23 of 53

.shape and .size

23

  • The .shape attribute provides the dimensions as a tuple (# rows, # columns).
  • The .size attribute provides the total # of entries → # rows multiplied by # columns.

babynames

babynames.shape

(407428, 5)

babynames.size

2037140

407,428 * 5 = 2,037,140

1227261

24 of 53

NumPy

Pandas Series and DataFrames support many operations, including NumPy operations, so long as the data is numeric. Data 8 NumPy reference.

24

yash_count = babynames.loc[babynames["Name"]=="Yash", "Count"]

np.mean(yash_count)

17.142857142857142

np.max(yash_count)

29

Pro Tip: Just because Python shows you lots of decimal places, that doesn’t mean you should report all of them! Only use as much precision as needed. In this example, you should probably use 17 or 17.1.

1227261

25 of 53

Built-In pandas Methods

pandas also provides an enormous number of useful utility functions, including:

  • size/shape
  • describe
  • sample
  • value_counts
  • unique
  • sort_values
  • Among many others!

If you want to manipulate data in some way, there is probably a pandas function that does what you want. Explore the documentation, Google, or ask a large language model (LLM)!

25

1227261

26 of 53

.describe()

26

  • The .describe() method returns summary statistics of a DataFrame or Series.

babynames

babynames.describe()

Note: Only the numeric columns are summarized!

1227261

27 of 53

.describe()

  • A different set of statistics is reported if .describe() is called on a Series.

27

babynames["Sex"].describe()

1227261

28 of 53

.value_counts()

The Series.value_counts() method returns the # of times each unique value appears.

  • Return value is a sorted Series.

28

babyname["Name"].value_counts()

1227261

29 of 53

.sample()

To randomly sample rows from a DataFrame, use the .sample() method.

  • By default, pandas samples without replacement. Use replace=True for replacement.
    • Recall that "sampling with replacement" means each row can be sampled more than one time.

29

Pro Tip: The top of a DataFrame is often not representative of the entire DataFrame. For this reason, I tend to prefer df.sample() over df.head() when exploring data.

Recall that "sampling with replacement" means that each row can be sampled more than one time.

babynames.sample()

n=1 by default

1227261

30 of 53

Chaining methods

pandas methods can be chained together:

30

babynames.sample(5).iloc[:, 2:]

result = (

babynames[babynames["Year"]==2000]

.sample(4, replace=True)

.iloc[:, 2:]

)

Multi-line chaining requires parentheses.

1227261

31 of 53

.unique()

The Series.unique method returns an array of every unique value in a Series.

31

babynames["Name"].unique()

1227261

32 of 53

Which of the following are valid interpretations of the pandas output here (right)?

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

1227261

33 of 53

Interlude

2-min stretch break!

33

1227261

34 of 53

Adding, Removing, and Modifying Columns

  • DataFrame inspection
  • Adding, removing, and modifying columns
  • Sorting

34

Lecture 03, Data 100 Spring 2026

1227261

35 of 53

Adding, Removing, & Modifying Columns

lec03.ipynb

35

Demo Slides

1227261

36 of 53

Syntax for Adding a Column

36

babynames["name_lengths"] = babynames["Name"].str.len()

1. Series with the length of each name

2. Assign Series to a new column called "name_lengths"

Making a new column with the number of characters in each name:

1227261

37 of 53

Syntax for Modifying a Column

37

babynames["name_lengths"] = babynames["name_lengths"] - 1

Modify the name_lengths column to be one less than its original value:

Pro tip: if you only want to modify some rows (not all) of a column, make sure you use .loc on the LHS!

babynames.loc[:, "name_lengths"] = babynames["name_lengths"] - 1

1227261

38 of 53

Syntax for Renaming a Column

Rename a column using the .rename() method.

38

# Rename "name_lengths" to "Length"

babynames = babynames.rename(columns={"name_lengths":"Length"})

Common typo: Flipping the keys and values in the dictionary! Forgive yourself ♥️

1227261

39 of 53

Syntax for Dropping a Column (or Row)

Remove columns using the .drop method.

  • The .drop() method drops rows by default. Use axis="columns" to drop columns.

39

babynames = babynames.drop("Length", axis="columns")

👋

1227261

40 of 53

An Important Note: DataFrame Copies

On the previous slide, we re-assigned babynames to an updated DataFrame.

40

babynames = babynames.drop("Length", axis="columns")

By default, pandas creates a new copy of the DataFrame, without changing the original DataFrame.

babynames.drop("Length", axis="columns")

babynames

Our change was not applied!

1227261

41 of 53

Which of the following returns 'b'? (Talk and work w/ your neighbor!)

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

1227261

42 of 53

Slido solution

Slido solution is in the lecture demo code!

42

1227261

43 of 53

Agenda

Lecture 03, Data 100 Spring 2026

  • DataFrame inspection
  • Adding, removing, and modifying columns
  • Sorting

43

1227261

44 of 53

Sorting

lec03.ipynb

44

Demo Slides

1227261

45 of 53

.sort_values()

The.sort_values() method sorts a DataFrame or Series.

  • DataFrame.sort_values(column_name) → Must specify column for sorting!

45

babynames.sort_values(by="Count", ascending=False)

By default, rows are sorted in ascending order.

Common typo: Forgetting if ascending/descending is the default. Solution? Always be explicit!

babynames["Name"].sort_values()

Notice the index changes order too!

1227261

46 of 53

Which of the following extracts the rows of a DataFrame df where the values in column A are at least as big as the smallest value in column B? (Talk and work w/ your neighbor!)

Presenting with animations, GIFs or speaker notes? Enable our Chrome extension

1227261

47 of 53

Slido solution

Slido solution is in the lecture demo code!

47

1227261

48 of 53

Sorting By Length

Suppose we want to sort entries based on the length of each name.

  • Familiar approach:
    • Create a new column with the length of each name and then sort on it.

48

1227261

49 of 53

Approach 1: Create a New Column and Sort Based on the New Column

49

babynames["name_lengths"] = babynames["Name"].str.len()

babynames = babynames.sort_values(by="name_lengths", ascending=False)

babynames.head(5)

1227261

50 of 53

Approach 2: Sorting Using the key Argument

50

babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False).head()

Create a temporary column with the length of each name, sort on it, then drop it.

Recall that lambda x: x.str.len() is an anonymous function (i.e., not given a name, temporary, one-time use). This anonymous function is applied to the sorting column(s).

1227261

51 of 53

Approach 3: Sorting Using the Series.map method

Using the Series.map method to sort by the number of occurrences of "dr" and "ea"s:

51

# Returns number of times 'dr' and 'ea' appear in `string`

def dr_ea_count(string):

return string.count('dr') + string.count('ea')

# Apply dr_ea_count to each name in the "Name" column

babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

babynames.sort_values(by="dr_ea_count", ascending=False).head()

1227261

52 of 53

A taste of next lecture! Grouping diagram from Data 8

1227261

53 of 53

Pandas, Part II

Data 100/Data 200, Spring 2026 @ UC Berkeley

Josh Grossman, Ramesh Sridharan

Content credit: Acknowledgments

53

LECTURE 3

1227261