1 of 57

Join at slido.com�#2130791

ⓘ

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

2130791

2 of 57

Pandas, Part II

More on pandas (Selections and Utility Functions)

Data 100/Data 200, Spring 2025 @ UC Berkeley

Narges Norouzi and Josh Grossman

LECTURE 3

2130791

3 of 57

Goals for this Lecture

Lecture 03, Data 100 Spring 2025

Continue our tour of pandas

Extracting data using .iloc and []
Extract data according to a condition
Modify columns in a DataFrame
Aggregate data

Last lecture: introducing tools

Today: "doing things"

2130791

4 of 57

Agenda

Lecture 03, Data 100 Spring 2025

Data extraction with iloc, and []
Conditional selection
Adding, removing, and modifying columns
Useful utility functions
Custom sorts

2130791

5 of 57

Data Extraction with iloc and []

Lecture 03, Data 100 Spring 2025

Data extraction with iloc, and []
Conditional selection
Adding, removing, and modifying columns
Useful utility functions
Custom sorts

2130791

6 of 57

Integer-based Extraction: .iloc

A different scenario: We want to extract data according to its position.

Example: Grab the 1st, 2nd, and 3rd columns of the DataFrame.

df.iloc[row_integers, column_integers]

0 1 2 3 4 5

Column integers

Row integers

The .iloc accessor allows us to specify the integers of rows and columns we wish to extract.

Python convention: The first position has integer index 0.

2130791

7 of 57

Integer-based Extraction: .iloc

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right hand side of the slice).
A single value.

2130791

8 of 57

Integer-based Extraction: .iloc

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right hand side of the slice).
A single value.

elections.iloc[[1, 2, 3], [0, 1, 2]]

Select the rows at positions 1, 2, and 3.

Select the columns at positions 0, 1, and 2.

2130791

9 of 57

Integer-based Extraction: .iloc

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right hand side of the slice).
A single value.

elections.iloc[[1, 2, 3], 0:3]

Select the rows at positions 1, 2, and 3.

Select all columns from integer 0 to integer 2.

Remember: integer-based slicing is right-end exclusive!

2130791

10 of 57

Integer-based Extraction: .iloc

Just like .loc, we can use a colon with .iloc to extract all rows or all columns.

elections.iloc[:, 0:3]

Grab all rows of the columns at integers 0 to 2.

2130791

11 of 57

Integer-based Extraction: .iloc

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right hand side of the slice).
A single value.

elections.iloc[[1, 2, 3], 1]

As before, the result for a single value argument is a Series.

We have extracted row integers 1, 2, and 3 from the column at position 1.

elections.iloc[0, 1]

We've extracted the string value with row position 0 and column position 1.

2130791

12 of 57

.loc vs .iloc

Remember:

.loc performs label-based extraction
.iloc performs integer-based extraction

When choosing between .loc and .iloc, you'll usually choose .loc.

Safer: If the order of data gets shuffled in a public database, your code still works.
Readable: Easier to understand what elections.loc[:, ["Year", "Candidate", "Result"]] means than elections.iloc[:, [0, 1, 4]]

.iloc can still be useful.

Example: If you have a DataFrame of movie earnings sorted by earnings, can use .iloc to get the median earnings for a given year (index into the middle).

2130791

13 of 57

… Just When It Was All Making Sense

[]

loc

iloc

2130791

14 of 57

Context-dependent Extraction: [ ]

Selection operators:

.loc selects items by label. First argument is rows, second argument is columns.
.iloc selects items by integer. First argument is rows, second argument is columns.

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

That is, [] is context sensitive.

Let’s see some examples.

2130791

15 of 57

Context-dependent Extraction: [ ]

[] only takes one argument, which may be:

A slice of row integers.
A list of column labels.
A single column label.

elections[3:7]

2130791

16 of 57

Context-dependent Extraction: [ ]

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

elections[["Year", "Candidate", "Result"]]

2130791

17 of 57

Context-dependent Extraction: [ ]

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

elections["Candidate"]

Extract the "Candidate" column as a Series.

2130791

18 of 57

Why Use []?

In short: [ ] can be much more concise than .loc or .iloc

Consider the case where we wish to extract the "Candidate" column. It is far simpler to write elections["Candidate"] than it is to write elections.loc[:, "Candidate"]

In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!

2130791

19 of 57

Which of the following statements correctly return the value "blue fish" from the "weird" DataFrame?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

2130791

20 of 57

Conditional Selection

Lecture 03, Data 100 Spring 2025

Data extraction with iloc, and []
Conditional selection
Adding, removing, and modifying columns
Useful utility functions
Custom sorts

2130791

21 of 57

Boolean Array Input for .loc and [ ]

We learned to extract data according to its integer position (.iloc) or its label (.loc)

What if we want to extract rows that satisfy a given condition?

babynames_first_10_rows = babynames.loc[:9, :]

.loc and [ ] also accept boolean arrays as input.
Rows corresponding to True are extracted; rows corresponding to False are not.

2130791

22 of 57

Boolean Array Input for .loc and [ ]

.loc and [ ] also accept boolean arrays as input.
Rows corresponding to True are extracted; rows corresponding to False are not.

babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]

2130791

23 of 57

Boolean Array Input

We can perform the same operation using .loc.

babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]

2130791

24 of 57

Boolean Array Input

Useful because boolean arrays can be generated by using logical operators on Series.

Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" = "F".

logical_operator = (babynames["Sex"] == "F")

True in rows 0, 1, 2, …

2130791

25 of 57

Boolean Array Input

Useful because boolean arrays can be generated by using logical operators on Series.

Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" = "F".

babynames[(babynames["Sex"] == "F")]

Length 239537 DataFrame where every entry belongs to a babyname with "Sex" = "F".

2130791

26 of 57

Boolean Array Input

Can also use .loc.

Length 407428 Series where every entry is either "True" or "False", where "True" occurs for every babyname with "Sex" = "F".

babynames.loc[babynames["Sex"] == "F", :]

Length 239537 DataFrame where every entry belongs to a babyname with "Sex" = "F".

2130791

27 of 57

Boolean Array Input

Boolean Series can be combined using various operators, allowing filtering of results by multiple criteria.

The & operator allows us to apply logical_operator_1 and logical_operator_2
The | operator allows us to apply logical_operator_1 or logical_operator_2

babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]

Rows that have a Sex of "F" or are earlier than the year 2000 (or both!)

2130791

28 of 57

Bitwise Operators

& and | are examples of bitwise operators. They allow us to apply multiple logical conditions.

If p and q are boolean arrays or Series:

Symbol	Usage	Meaning
~	~p	Negation of p
\|	p \| q	p OR q
&	p & q	p AND q
^	p ^ q	p XOR q (exclusive or)

2130791

29 of 57

Which of the following pandas statements returns a DataFrame of the first 3 baby names with Count > 250.

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

2130791

30 of 57

Alternatives to Direct Boolean Array Selection

Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions.

babynames[(babynames["Name"] == "Bella") |

(babynames["Name"] == "Alex") |

(babynames["Name"] == "Narges") |

(babynames["Name"] == "Lisa")]

pandas provides many alternatives, for example:

.isin
.str.startswith
.groupby.filter (we’ll see this in Lecture 4)

2130791

31 of 57

Alternatives to Direct Boolean Array Selection

pandas provides many alternatives, for example:

.isin
.str.startswith
.groupby.filter (see lecture 4)

names = ["Bella", "Alex", "Narges", "Lisa"]

babynames[babynames["Name"].isin(names)]

Returns a Boolean Series that is True when the corresponding name in babynames is Bella, Alex, Narges, or Lisa.

2130791

32 of 57

Alternatives to Boolean Array Selection

pandas provides many alternatives, for example:

.isin
.str.startswith
.groupby.filter (see lecture 4)

babynames[babynames["Name"].str.startswith("N")]

Returns a Boolean Series that is True when the corresponding name in babynames starts with "N".

2130791

33 of 57

Interlude

2130791

34 of 57

Adding, Removing, and Modifying Columns

Data extraction with iloc, and []
Conditional selection
Adding, removing, and modifying columns
Useful utility functions
Custom sorts

Lecture 03, Data 100 Spring 2025

2130791

35 of 57

Syntax for Adding a Column

Adding a column is easy:

Use [ ] to reference the desired new column.
Assign this column to a Series or array of the appropriate length.

# Create a Series of the length of each name

babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that

# includes the length of each name

babynames["name_lengths"] = babyname_lengths

2130791

36 of 57

Syntax for Modifying a Column

Modifying a column is very similar to adding a column.

Use [ ] to reference the existing column.
Assign this column to a new Series or array of the appropriate length.

# Modify the "name_lengths" column to be one less than its original value

babynames["name_lengths"] = babynames["name_lengths"]-1

2130791

37 of 57

Syntax for Renaming a Column

Rename a column using the (creatively named) .rename() method.

.rename() takes in a dictionary that maps old column names to their new ones.

# Rename "name_lengths" to "Length"

babynames = babynames.rename(columns={"name_lengths":"Length"})

2130791

38 of 57

Syntax for Dropping a Column (or Row)

Remove columns using the (also creatively named) .drop method.

The .drop() method assumes you're dropping a row by default. Use axis="columns" to drop a column instead.

babynames = babynames.drop("Length", axis="columns")

2130791

39 of 57

An Important Note: DataFrame Copies

Notice that we re-assigned babynames to an updated value on the previous slide.

babynames = babynames.drop("Length", axis="columns")

By default, pandas methods create a copy of the DataFrame, without changing the original DataFrame at all. To apply our changes, we must update our DataFrame to this new, modified copy.

babynames.drop("Length", axis="columns")

babynames

Our change was not applied!

2130791

40 of 57

Useful Utility Functions

Lecture 03, Data 100 Spring 2025

Data extraction with iloc, and []
Conditional selection
Adding, removing, and modifying columns
Useful utility functions
Custom sorts

2130791

41 of 57

NumPy

Pandas Series and DataFrames support a large number of operations, including mathematical operations, so long as the data is numerical. Data 8 NumPy reference.

yash_count = babynames[babynames["Name"]=="Yash"]["Count"]

np.mean(yash_count)

17.142857142857142

np.max(yash_count)

2130791

42 of 57

Built-In pandas Methods

In addition to its rich syntax for indexing and support for other libraries (NumPy, native Python functions), pandas provides an enormous number of useful utility functions. Today, we’ll discuss just a few:

size/shape
describe
sample
value_counts
unique
sort_values

The pandas library is rich in utility functions (we could spend the entire summer talking about them)! We encourage you to explore as you complete your assignments by Googling and reading documentation, just as data scientists do.

2130791

43 of 57

.shape and .size

babynames.size

2037140

babynames.shape

(407428, 5)

.shape returns the shape of a DataFrame or Series in the form (number of rows, number of columns.
.size returns the total number of entries in a DataFrame or Series (number of rows times number of columns).

babynames

2130791

44 of 57

.describe()

babynames.describe()

.describe() returns a "description" of a DataFrame or Series that lists summary statistics of the data.

babynames

2130791

45 of 57

.describe()

A different set of statistics will be reported if .describe() is called on a Series.

babynames["Sex"].describe()

2130791

46 of 57

.sample()

To sample a random selection of rows from a DataFrame, we use the .sample() method.

By default, it is without replacement. Use replace=True for replacement.
Naturally, can be chained with other methods and operators (iloc, etc).

babynames.sample(5).iloc[:, 2:]

babynames[babynames["Year"]==2000]

.sample(4, replace=True)

.iloc[:, 2:]

babynames.sample()

2130791

47 of 57

.value_counts()

The Series.value_counts method counts the number of occurrences of each unique value in a Series (it counts the number of times each value appears).

Return value is also a Series.

babyname["Name"].value_counts()

2130791

48 of 57

.unique()

The Series.unique method returns an array of every unique value in a Series.

babynames["Name"].unique()

2130791

49 of 57

.sort_values()

The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).

Series.sort_values( ) will automatically sort all values in the Series.
DataFrame.sort_values(column_name) must specify the name of the column to be used for sorting.

babynames["Name"].sort_values()

2130791

50 of 57

.sort_values()

The DataFrame.sort_values and Series.sort_values methods sort a DataFrame (or Series).

Series.sort_values( ) will automatically sort all values in the Series.
DataFrame.sort_values(column_name) must specify the name of the column to be used for sorting.

babynames.sort_values(by="Count", ascending=False)

By default, rows are sorted in ascending order.

2130791

51 of 57

Lecture 3 ended here!

We will cover the rest in lecture 4

2130791

52 of 57

Custom Sorts

Lecture 03, Data 100 Spring 2025

Data extraction with iloc, and []
Conditional selection
Adding, removing, and modifying columns
Useful utility functions
Custom sorts

2130791

53 of 57

Sorting By Length

Let’s try to solve the sorting problem with different approaches.

Assume that we want to sort entries based on the length of the name.
Approach 1:

We will create a temporary column which holds the length of each name and then will sort on it.

2130791

54 of 57

Approach 1: Create a Temporary Column and Sort Based on the New Column

Sorting the DataFrame as usual:

# Create a Series of the length of each name

babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name

babynames["name_lengths"] = babyname_lengths

babynames = babynames.sort_values(by="name_lengths", ascending=False)

babynames.head(5)

2130791

55 of 57

Approach 2: Sorting Using the key Argument

babynames.sort_values("Name", key=lambda x: x.str.len(), ascending=False)

.head()

2130791

56 of 57

Approach 3: Sorting Using the map Function

Suppose we want to sort by the number of occurrences of "dr" and "ea"s.

Use the Series.map method.

def dr_ea_count(string):

return string.count('dr') + string.count('ea')

# Use map to apply dr_ea_count to each name in the "Name" column

babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

babynames = babynames.sort_values(by="dr_ea_count", ascending=False)

babynames.head()

2130791

57 of 57

Pandas, Part II

Content credit: Acknowledgments

LECTURE 3

2130791