JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 29

Pre-Announcement

If you want to do your own pre-announcement, email me.

Alan here from ULAB is here.
Organization that bring together people with skills with people who don’t to help freshman and sophomores get involved with doing cool STEM things.
Looking for mentors who have taken Data 8 or upper division CS units.
Can get upper div CS units, up to 4.
ulab.berkeley.edu

2 of 29

Announcements

For consistency with 61A and 61B, I will start using the term “method” everywhere to return to a function that belongs to a class.

Example: dog.bark() indicates bark is a method of the dog class.
Also correct to say dog.bark() indicates bark is a function of the dog class, but I’ll be using “method” instead.

There is a live lecture Piazza thread:

Use it to ask any questions you don’t want to ask me.
It’s post #196.

Attendance will be taken today!

3 of 29

DS100: Fall 2018

Lecture 3 (Josh Hug): Intro to Pandas, Part II

Groupby for Aggregation
Groupby (and isin) for Filtering
Pivot Tables
More Baby Names Case Studies

datastructur.es

4 of 29

Goals For Today

Discuss aggregation operations:

Groupby
Pivot

Identifying/dealing with ugly data.

Operations for identifying with null or strange values.
Operations for modifying data frames.

More case study.

5 of 29

Groupby

(and isin)

6 of 29

groupby

Often we want to perform aggregate analysis across data points that share some feature, for example:

What was the average share of the vote across all U.S. elections for each political party?
What was the size of the average class in each department at Berkeley in each term?
Which instructors taught the largest classes at Berkeley and how large were they?

groupby is an incredibly powerful tool for these sorts of questions.

�

7 of 29

groupby Demo

See 03_groupby_basics.ipynb

8 of 29

groupby Key Concepts

If we call groupby on a Series:

The resulting output is a SeriesGroupBy object.
The Series that are passed as arguments to groupby must share an index with the calling Series.

SeriesGroupBy objects can then be aggregated back into a Series using an aggregation method.

9 of 29

groupby Key Concepts

If we call groupby on a DataFrame:

The resulting output is a DataFrameGroupBy object.

DataFrameGroupBy objects can then be aggregated back into a DataFrame or a Series using an aggregation method.

10 of 29

groupby and agg

Most of the built-in handy aggregation methods are just shorthand for a universal aggregation method called agg.

Example, .mean() is just .agg(np.mean).

11 of 29

Series groupby/agg Summary

groupby

.agg(f), where f = sum

12 of 29

DataFrame groupby/agg Summary

groupby

.agg(f), where f = max

13 of 29

The MultiIndex

If we group a Series (or DataFrame) by multiple Series and then perform an aggregation operation, the resulting Series (or Dataframe) will have a MultiIndex.

The resulting DataFrame has:

Two columns “%” and “Year”
A MultiIndex, where results of aggregate function are indexed by Party first, then Result.

14 of 29

Filtering by Group

Another common use for groups is to filter data.

filter takes an argument f.
f is a function that:

Takes a DataFrame as input.
Returns either true or false.

For each group g, f is applied to the subframe comprised of the rows from the original dataframe corresponding to that group.

15 of 29

Series groupby/filter Summary

groupby

.filter(f), where

f = lambda sf: sf[“num”].sum() > 10

16 of 29

isin

We saw last time how to build boolean arrays for filtering, e.g.

If we have a list of valid items, e.g. “Republican” or “Democratic”, we could use the | operator (| means or), but a better way is to use isin.

Ugly:
Better:

17 of 29

Baby Names Case Study Q2

18 of 29

Baby Names

Let’s try solving another real world problem using the baby names dataset: What was the most popular name in every state in every year and for every labeled gender?

Spoiler alert, we will build a MultiIndexed DataFrame where the data column is “count”.
MultiIndex will be by state, year, and gender.

Head to 03-case-study.ipynb.

As with the previous case study, we’ll use some stuff that we haven’t formally learned (e.g. combining multiple dataframes that are stored across multiple files, %%time)!

19 of 29

Practice Exercises on Enrollment Data

(see 03-enrollment)

20 of 29

Spring Enrollment Data