Pre-Announcement
If you want to do your own pre-announcement, email me.
Announcements
For consistency with 61A and 61B, I will start using the term “method” everywhere to return to a function that belongs to a class.
There is a live lecture Piazza thread:
Attendance will be taken today!
DS100: Fall 2018
Lecture 3 (Josh Hug): Intro to Pandas, Part II
Goals For Today
Goals For Today
Groupby
(and isin)
groupby
Often we want to perform aggregate analysis across data points that share some feature, for example:
groupby is an incredibly powerful tool for these sorts of questions.
�
groupby Demo
See 03_groupby_basics.ipynb
groupby Key Concepts
If we call groupby on a Series:
SeriesGroupBy objects can then be aggregated back into a Series using an aggregation method.
groupby Key Concepts
If we call groupby on a DataFrame:
DataFrameGroupBy objects can then be aggregated back into a DataFrame or a Series using an aggregation method.
groupby and agg
Most of the built-in handy aggregation methods are just shorthand for a universal aggregation method called agg.
Series groupby/agg Summary
A
3
B
1
C
4
A
1
B
5
C
9
A
2
D
5
B
6
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
D
5
A
6
B
12
C
13
groupby
.agg(f), where f = sum
D
5
DataFrame groupby/agg Summary
A
3
B
1
C
4
A
1
B
5
C
9
A
2
C
5
B
6
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
C
5
A
3
B
6
C
9
groupby
.agg(f), where f = max
ak
tx
fl
hi
mi
ak
ca
sd
nc
ak
hi
ca
tx
mi
nc
fl
ak
sd
hi
tx
sd
The MultiIndex
If we group a Series (or DataFrame) by multiple Series and then perform an aggregation operation, the resulting Series (or Dataframe) will have a MultiIndex.
The resulting DataFrame has:
Filtering by Group
Another common use for groups is to filter data.
Series groupby/filter Summary
A
3
B
1
C
4
A
1
B
5
C
9
A
2
D
5
B
6
A
3
A
1
A
2
B
1
B
5
B
6
C
4
C
9
D
5
groupby
.filter(f), where
f = lambda sf: sf[“num”].sum() > 10
B
1
C
4
B
5
C
9
B
6
12
13
6
5
isin
We saw last time how to build boolean arrays for filtering, e.g.
If we have a list of valid items, e.g. “Republican” or “Democratic”, we could use the | operator (| means or), but a better way is to use isin.
Baby Names Case Study Q2
Baby Names
Let’s try solving another real world problem using the baby names dataset: What was the most popular name in every state in every year and for every labeled gender?
Head to 03-case-study.ipynb.
Practice Exercises on Enrollment Data
(see 03-enrollment)
Spring Enrollment Data
Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.
Spring Enrollment Data
Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.
Spring Enrollment Data
Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.
Spring Enrollment Data
Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.
Spring Enrollment Data
Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.
A quick look at pivot
Pivot Tables
You’ve already seen pivot tables in data 8.
Let’s talk about how to do this basic case in pandas. We may discuss more advanced uses of pivot tables later.
Pivot Tables
A
3
B
1
C
4
A
1
B
5
C
9
A
2
D
5
B
6
U
V
U
V
U
V
U
U
V
A
3
A
2
U
U
A
1
V
B
5
U
B
1
B
6
V
V
C
4
U
C
9
V
D
5
U
A
5
U
A
1
V
B
5
U
B
7
V
C
4
U
C
9
V
D
5
U
A
5
B
5
C
4
D
5
1
7
9
NaN
U
V
...
R
C
group
f
f
f
f
f
f
f
Baby Names Case Study Q3
Baby Names
Let’s try solving another real world problem using the baby names dataset: Can we deduce a person’s birth sex from the last letter of their name?
Attendance question link: www.yellkey.com/every
Head to 03-case-study.ipynb.