Join at slido.com�#2130791
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
2130791
Pandas, Part II
More on pandas (Selections and Utility Functions)
Data 100/Data 200, Spring 2025 @ UC Berkeley
Narges Norouzi and Josh Grossman
2
LECTURE 3
2130791
Goals for this Lecture
Lecture 03, Data 100 Spring 2025
Continue our tour of pandas
Last lecture: introducing tools
Today: "doing things"
3
2130791
Agenda
Lecture 03, Data 100 Spring 2025
4
2130791
Data Extraction with iloc and []
Lecture 03, Data 100 Spring 2025
5
2130791
Integer-based Extraction: .iloc
A different scenario: We want to extract data according to its position.
6
df.iloc[row_integers, column_integers]
0
1
2
3
4
0 1 2 3 4 5
Column integers
Row integers
The .iloc accessor allows us to specify the integers of rows and columns we wish to extract.
2130791
Integer-based Extraction: .iloc
Arguments to .iloc can be:
7
2130791
Integer-based Extraction: .iloc
Arguments to .iloc can be:
8
elections.iloc[[1, 2, 3], [0, 1, 2]]
Select the rows at positions 1, 2, and 3.
Select the columns at positions 0, 1, and 2.
2130791
Integer-based Extraction: .iloc
Arguments to .iloc can be:
9
elections.iloc[[1, 2, 3], 0:3]
Select the rows at positions 1, 2, and 3.
Select all columns from integer 0 to integer 2.
Remember: integer-based slicing is right-end exclusive!
2130791
Integer-based Extraction: .iloc
Just like .loc, we can use a colon with .iloc to extract all rows or all columns.
10
elections.iloc[:, 0:3]
Grab all rows of the columns at integers 0 to 2.
2130791
Integer-based Extraction: .iloc
Arguments to .iloc can be:
11
elections.iloc[[1, 2, 3], 1]
As before, the result for a single value argument is a Series.
We have extracted row integers 1, 2, and 3 from the column at position 1.
elections.iloc[0, 1]
We've extracted the string value with row position 0 and column position 1.
2130791
.loc vs .iloc
Remember:
When choosing between .loc and .iloc, you'll usually choose .loc.
.iloc can still be useful.
12
2130791
… Just When It Was All Making Sense
13
[]
loc
iloc
2130791
Context-dependent Extraction: [ ]
Selection operators:
14
That is, [] is context sensitive.
Let’s see some examples.
2130791
Context-dependent Extraction: [ ]
[] only takes one argument, which may be:
15
elections[3:7]
2130791
Context-dependent Extraction: [ ]
[] only takes one argument, which may be:
16
elections[["Year", "Candidate", "Result"]]
2130791
Context-dependent Extraction: [ ]
[] only takes one argument, which may be:
17
elections["Candidate"]
Extract the "Candidate" column as a Series.
2130791
Why Use []?
In short: [ ] can be much more concise than .loc or .iloc
In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!
18
2130791
Which of the following statements correctly return the value "blue fish" from the "weird" DataFrame?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
2130791
Conditional Selection
Lecture 03, Data 100 Spring 2025
20
2130791
Boolean Array Input for .loc and [ ]
We learned to extract data according to its integer position (.iloc) or its label (.loc)
What if we want to extract rows that satisfy a given condition?
21
babynames_first_10_rows = babynames.loc[:9, :]
2130791
Boolean Array Input for .loc and [ ]
22
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]
2130791
Boolean Array Input
We can perform the same operation using .loc.
23
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]
2130791
Boolean Array Input
Useful because boolean arrays can be generated by using logical operators on Series.
24
Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" = "F".
logical_operator = (babynames["Sex"] == "F")
True in rows 0, 1, 2, …
2130791
Boolean Array Input
Useful because boolean arrays can be generated by using logical operators on Series.
25
Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" = "F".
babynames[(babynames["Sex"] == "F")]
Length 239537 DataFrame where every entry belongs to a babyname with "Sex" = "F".
2130791
Boolean Array Input
Can also use .loc.
26
Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" = "F".
babynames.loc[babynames["Sex"] == "F", :]
Length 239537 DataFrame where every entry belongs to a babyname with "Sex" = "F".
2130791
Boolean Array Input
Boolean Series can be combined using various operators, allowing filtering of results by multiple criteria.
27
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]
Rows that have a Sex of "F" or are earlier than the year 2000 (or both!)
2130791
Bitwise Operators
& and | are examples of bitwise operators. They allow us to apply multiple logical conditions.
If p and q are boolean arrays or Series:
28
Symbol | Usage | Meaning |
~ | ~p | Negation of p |
| | p | q | p OR q |
& | p & q | p AND q |
^ | p ^ q | p XOR q (exclusive or) |
2130791
Which of the following pandas statements returns a DataFrame of the first 3 baby names with Count > 250.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
2130791
Alternatives to Direct Boolean Array Selection
Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions.
30
babynames[(babynames["Name"] == "Bella") |
(babynames["Name"] == "Alex") |
(babynames["Name"] == "Narges") |
(babynames["Name"] == "Lisa")]
pandas provides many alternatives, for example:
2130791
Alternatives to Direct Boolean Array Selection
pandas provides many alternatives, for example:
31
names = ["Bella", "Alex", "Narges", "Lisa"]
babynames[babynames["Name"].isin(names)]
Returns a Boolean Series that is True when the corresponding name in babynames is Bella, Alex, Narges, or Lisa.
2130791
Alternatives to Boolean Array Selection
pandas provides many alternatives, for example:
32
babynames[babynames["Name"].str.startswith("N")]
Returns a Boolean Series that is True when the corresponding name in babynames starts with "N".
2130791
Interlude
33
2130791
2130791
Adding, Removing, and Modifying Columns
34
Lecture 03, Data 100 Spring 2025
2130791
Syntax for Adding a Column
Adding a column is easy:
35
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()
# Add a column named "name_lengths" that
# includes the length of each name
babynames["name_lengths"] = babyname_lengths
2130791
Syntax for Modifying a Column
Modifying a column is very similar to adding a column.
36
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"]-1
2130791
Syntax for Renaming a Column
Rename a column using the (creatively named) .rename() method.
37
# Rename "name_lengths" to "Length"
babynames = babynames.rename(columns={"name_lengths":"Length"})
2130791
Syntax for Dropping a Column (or Row)
Remove columns using the (also creatively named) .drop method.
38
babynames = babynames.drop("Length", axis="columns")
2130791
An Important Note: DataFrame Copies
Notice that we re-assigned babynames to an updated value on the previous slide.
39
babynames = babynames.drop("Length", axis="columns")
By default, pandas methods create a copy of the DataFrame, without changing the original DataFrame at all. To apply our changes, we must update our DataFrame to this new, modified copy.
babynames.drop("Length", axis="columns")
babynames
Our change was not applied!
2130791
Useful Utility Functions
Lecture 03, Data 100 Spring 2025
40
2130791
NumPy
Pandas Series and DataFrames support a large number of operations, including mathematical operations, so long as the data is numerical. Data 8 NumPy reference.
41
yash_count = babynames[babynames["Name"]=="Yash"]["Count"]
np.mean(yash_count)
17.142857142857142
np.max(yash_count)
29
2130791
Built-In pandas Methods
In addition to its rich syntax for indexing and support for other libraries (NumPy, native Python functions), pandas provides an enormous number of useful utility functions. Today, we’ll discuss just a few:
The pandas library is rich in utility functions (we could spend the entire summer talking about them)! We encourage you to explore as you complete your assignments by Googling and reading documentation, just as data scientists do.
42
2130791
.shape and .size
43
babynames.size
2037140
babynames.shape
(407428, 5)
babynames
2130791
.describe()
44
babynames.describe()
babynames
2130791
.describe()
45
babynames["Sex"].describe()
2130791
.sample()
To sample a random selection of rows from a DataFrame, we use the .sample() method.
46
babynames.sample(5).iloc[:, 2:]
babynames[babynames["Year"]==2000]
.sample(4, replace=True)
.iloc[:, 2:]
babynames.sample()
2130791
.value_counts()
The Series.value_counts method counts the number of occurrences of each unique value in a Series (it counts the number of times each value appears).
47
babyname["Name"].value_counts()
2130791
.unique()
The Series.unique method returns an array of every unique value in a Series.
48
babynames["Name"].unique()
2130791
.sort_values()
The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).
49
babynames["Name"].sort_values()
2130791
.sort_values()
The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).
50
babynames.sort_values(by="Count", ascending=False)
By default, rows are sorted in ascending order.
2130791
Lecture 3 ended here!
We will cover the rest in lecture 4
51
2130791
Custom Sorts
Lecture 03, Data 100 Spring 2025
52
2130791
Sorting By Length
Let’s try to solve the sorting problem with different approaches.
53
2130791
Approach 1: Create a Temporary Column and Sort Based on the New Column
Sorting the DataFrame as usual:
54
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()
# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames = babynames.sort_values(by="name_lengths", ascending=False)
babynames.head(5)
2130791
Approach 2: Sorting Using the key Argument
55
babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False)
.head()
2130791
Approach 3: Sorting Using the map Function
Suppose we want to sort by the number of occurrences of "dr" and "ea"s.
56
def dr_ea_count(string):
return string.count('dr') + string.count('ea')
# Use map to apply dr_ea_count to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)
babynames = babynames.sort_values(by="dr_ea_count", ascending=False)
babynames.head()
2130791
Pandas, Part II
Content credit: Acknowledgments
57
LECTURE 3
2130791