Join at slido.com�#1227261
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
1227261
Pandas, Part II
More on pandas (Inspection, Modification, and Sorting)
Data 100/Data 200, Spring 2026 @ UC Berkeley
Josh Grossman, Ramesh Sridharan
Content credit: Acknowledgments
2
LECTURE 3
1227261
📣 Announcements
Lab 1 due tonight! (Remember slip days cannot be used for lab assignment)
Homework 1 (coding and math portion) due this Friday (Jan. 30th)
Discussions and Office Hours start this week
Clarification: Mega Section is NOT the same as Data 8
�Ramesh's OH: Mondays @ 10am-12pm in Evans (room TBD)
Josh’s OH: Wednesdays @ 1:30pm-3:30pm in Evans 422
DSP Students: Make sure your accommodation letters are submitted via the DSP portal
3
1227261
Recap: extraction with .loc[], .iloc[], and []
We learned three ways of extracting data: .loc[], .iloc[], and indexing with []
df.iloc[my_row_numbers, my_col_numbers]
df.loc[my_row_labels, my_col_labels]
df[my_single_label]
df[my_label_list]
df[my_slice]
df[my_boolean_mask]
4
Indexing with [] only takes one argument, and acts differently depending on what it is:
Each can be either: a single label, a list of labels, a slice*, or a boolean mask.
Each can be either: a single integer, a list of integers, or a slice.
Treated as column labels
Treated as row labels
1227261
Do not edit
Context-dependent extraction with [ ]: Which of the following statements correctly returns the value "blue fish" from the "weird" DataFrame?
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
1227261
Slido solution
Slido solution is in the lecture demo code!
6
1227261
Recap: Boolean Masks with .loc and [ ]
How do we extract rows that satisfy a condition (e.g., rows where column X > 5)?
7
elections[(
((elections["Year"] > 2000) |
(elections["Year"] < 1861)) &
(elections["Result"] == "win") &
(elections["%"] >= 55)
)]
Length 187 Series where every entry is either "True" or "False", where "True" occurs for every candidate who won more than 55% of the popular vote in a year
1227261
Do not edit
Which of the following pandas statements returns a DataFrame of the first 3 baby names from California in 2006? (Talk and work with your neighbor!)
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
1227261
Slido solution
Slido solution is in the lecture demo code!
9
1227261
Boolean Array Input
We can also use .loc.
10
Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" == "F".
babynames.loc[babynames["Sex"] == "F", :]
Length 239537 DataFrame where every entry belongs to a babyname with "Sex" == "F".
1227261
Boolean Array Input
Boolean arrays can be generated by using logical operators on Series.
11
(babynames["Sex"] == "F")
Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" == "F".
babynames["Sex"]
1227261
Boolean Array Input
We can use our new boolean array to filter to desired rows. Boolean mask! 🤿
12
Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" == "F".
babynames[(babynames["Sex"] == "F")]
Length 239537 DataFrame where every entry belongs to a babyname with "Sex" == "F".
1227261
Boolean Array Input
Boolean Series can be combined element-wise using boolean operators.
13
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
Rows with Sex=="F" OR before year 2000 (or both!)
1227261
Bitwise Operators
& and | are examples of bitwise operators. They allow us to apply multiple logical conditions.
If p and q are boolean arrays or Series:
14
Symbol | Usage | Meaning |
~ | ~p | NOT p |
| | p | q | p OR q |
& | p & q | p AND q |
^ | p ^ q | p XOR q |
p XOR q: Either p, or q, but not both. Exclusive OR!
1227261
More Sophisticated Direct Boolean Array Selection
Boolean array selection is useful, but can lead to overly verbose code.
15
babynames[(babynames["Name"] == "Bella") |
(babynames["Name"] == "Alex") |
(babynames["Name"] == "Narges") |
(babynames["Name"] == "Lisa")]
pandas provides many alternatives, for example:
1227261
More Sophisticated Boolean Array Selection
pandas provides many useful functions for making boolean masks, for example:
16
names = ["Bella", "Alex", "Narges", "Lisa"]
babynames[babynames["Name"].isin(names)]
Returns a Boolean Series that is True when the corresponding name in babynames is Bella, Alex, Narges, or Lisa.
On the Data 100 Reference Sheet!
1227261
Do not edit
What's the most interesting question you want to answer using the babynames table?
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
1227261
How Common are the Most Common Names?
18
1227261
Election Data: Your Questions (A Representative Sampling)
19
Most common questions:
Other interesting themes
Interesting questions requiring outside data/information
1227261
Agenda
Lecture 03, Data 100 Spring 2026
20
1227261
DataFrame Inspection
lec03.ipynb
21
Demo Slides
1227261
Retrieving the Index, Columns, and shape
To extract row labels, use DataFrame.index:
To extract column labels, use DataFrame.columns:
For dimensions use DataFrame.shape:
22
elections.columns
elections.shape
elections = pd.read_csv("data/elections.csv", index_col = "Candidate")
elections.index
On the Data 100 Reference Sheet!
1227261
.shape and .size
23
babynames
babynames.shape
(407428, 5)
babynames.size
2037140
407,428 * 5 = 2,037,140
1227261
NumPy
Pandas Series and DataFrames support many operations, including NumPy operations, so long as the data is numeric. Data 8 NumPy reference.
24
yash_count = babynames.loc[babynames["Name"]=="Yash", "Count"]
np.mean(yash_count)
17.142857142857142
np.max(yash_count)
29
Pro Tip: Just because Python shows you lots of decimal places, that doesn’t mean you should report all of them! Only use as much precision as needed. In this example, you should probably use 17 or 17.1.
1227261
Built-In pandas Methods
pandas also provides an enormous number of useful utility functions, including:
If you want to manipulate data in some way, there is probably a pandas function that does what you want. Explore the documentation, Google, or ask a large language model (LLM)!
25
1227261
.describe()
26
babynames
babynames.describe()
Note: Only the numeric columns are summarized!
1227261
.describe()
27
babynames["Sex"].describe()
1227261
.value_counts()
The Series.value_counts() method returns the # of times each unique value appears.
28
babyname["Name"].value_counts()
1227261
.sample()
To randomly sample rows from a DataFrame, use the .sample() method.
29
Pro Tip: The top of a DataFrame is often not representative of the entire DataFrame. For this reason, I tend to prefer df.sample() over df.head() when exploring data.
Recall that "sampling with replacement" means that each row can be sampled more than one time.
babynames.sample()
n=1 by default
1227261
Chaining methods
pandas methods can be chained together:
30
babynames.sample(5).iloc[:, 2:]
result = (
babynames[babynames["Year"]==2000]
.sample(4, replace=True)
.iloc[:, 2:]
)
Multi-line chaining requires parentheses.
1227261
.unique()
The Series.unique method returns an array of every unique value in a Series.
31
babynames["Name"].unique()
1227261
Do not edit
Which of the following are valid interpretations of the pandas output here (right)?
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
1227261
Interlude
2-min stretch break!
33
1227261
Adding, Removing, and Modifying Columns
34
Lecture 03, Data 100 Spring 2026
1227261
Adding, Removing, & Modifying Columns
lec03.ipynb
35
Demo Slides
1227261
Syntax for Adding a Column
36
babynames["name_lengths"] = babynames["Name"].str.len()
1. Series with the length of each name
2. Assign Series to a new column called "name_lengths"
Making a new column with the number of characters in each name:
1227261
Syntax for Modifying a Column
37
babynames["name_lengths"] = babynames["name_lengths"] - 1
Modify the name_lengths column to be one less than its original value:
Pro tip: if you only want to modify some rows (not all) of a column, make sure you use .loc on the LHS!
babynames.loc[:, "name_lengths"] = babynames["name_lengths"] - 1
1227261
Syntax for Renaming a Column
Rename a column using the .rename() method.
38
# Rename "name_lengths" to "Length"
babynames = babynames.rename(columns={"name_lengths":"Length"})
Common typo: Flipping the keys and values in the dictionary! Forgive yourself ♥️
1227261
Syntax for Dropping a Column (or Row)
Remove columns using the .drop method.
39
babynames = babynames.drop("Length", axis="columns")
👋
1227261
An Important Note: DataFrame Copies
On the previous slide, we re-assigned babynames to an updated DataFrame.
40
babynames = babynames.drop("Length", axis="columns")
By default, pandas creates a new copy of the DataFrame, without changing the original DataFrame.
babynames.drop("Length", axis="columns")
babynames
Our change was not applied!
1227261
Do not edit
Which of the following returns 'b'? (Talk and work w/ your neighbor!)
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
1227261
Slido solution
Slido solution is in the lecture demo code!
42
1227261
Agenda
Lecture 03, Data 100 Spring 2026
43
1227261
Sorting
lec03.ipynb
44
Demo Slides
1227261
.sort_values()
The.sort_values() method sorts a DataFrame or Series.
45
babynames.sort_values(by="Count", ascending=False)
By default, rows are sorted in ascending order.
Common typo: Forgetting if ascending/descending is the default. Solution? Always be explicit!
babynames["Name"].sort_values()
Notice the index changes order too!
1227261
Do not edit
Which of the following extracts the rows of a DataFrame df where the values in column A are at least as big as the smallest value in column B? (Talk and work w/ your neighbor!)
Presenting with animations, GIFs or speaker notes? Enable our Chrome extension
1227261
Slido solution
Slido solution is in the lecture demo code!
47
1227261
Sorting By Length
Suppose we want to sort entries based on the length of each name.
48
1227261
Approach 1: Create a New Column and Sort Based on the New Column
49
babynames["name_lengths"] = babynames["Name"].str.len()
babynames = babynames.sort_values(by="name_lengths", ascending=False)
babynames.head(5)
1227261
Approach 2: Sorting Using the key Argument
50
babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False).head()
Create a temporary column with the length of each name, sort on it, then drop it.
Recall that lambda x: x.str.len() is an anonymous function (i.e., not given a name, temporary, one-time use). This anonymous function is applied to the sorting column(s).
1227261
Approach 3: Sorting Using the Series.map method
Using the Series.map method to sort by the number of occurrences of "dr" and "ea"s:
51
# Returns number of times 'dr' and 'ea' appear in `string`
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
# Apply dr_ea_count to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
babynames.sort_values(by="dr_ea_count", ascending=False).head()
1227261
A taste of next lecture! Grouping diagram from Data 8
1227261
Pandas, Part II
Data 100/Data 200, Spring 2026 @ UC Berkeley
Josh Grossman, Ramesh Sridharan
Content credit: Acknowledgments
53
LECTURE 3
1227261