1 of 29

Pre-Announcement

If you want to do your own pre-announcement, email me.

  • Alan here from ULAB is here.
  • Organization that bring together people with skills with people who don’t to help freshman and sophomores get involved with doing cool STEM things.
  • Looking for mentors who have taken Data 8 or upper division CS units.
  • Can get upper div CS units, up to 4.
  • ulab.berkeley.edu

2 of 29

Announcements

For consistency with 61A and 61B, I will start using the term “method” everywhere to return to a function that belongs to a class.

  • Example: dog.bark() indicates bark is a method of the dog class.
  • Also correct to say dog.bark() indicates bark is a function of the dog class, but I’ll be using “method” instead.

There is a live lecture Piazza thread:

  • Use it to ask any questions you don’t want to ask me.
  • It’s post #196.

Attendance will be taken today!

3 of 29

DS100: Fall 2018

Lecture 3 (Josh Hug): Intro to Pandas, Part II

  • Groupby for Aggregation
  • Groupby (and isin) for Filtering
  • Pivot Tables
  • More Baby Names Case Studies

4 of 29

Goals For Today

Goals For Today

  • Discuss aggregation operations:
    • Groupby
    • Pivot
  • Identifying/dealing with ugly data.
    • Operations for identifying with null or strange values.
    • Operations for modifying data frames.
  • More case study.

5 of 29

Groupby

(and isin)

6 of 29

groupby

Often we want to perform aggregate analysis across data points that share some feature, for example:

  • What was the average share of the vote across all U.S. elections for each political party?
  • What was the size of the average class in each department at Berkeley in each term?
  • Which instructors taught the largest classes at Berkeley and how large were they?

groupby is an incredibly powerful tool for these sorts of questions.

7 of 29

groupby Demo

See 03_groupby_basics.ipynb

8 of 29

groupby Key Concepts

If we call groupby on a Series:

  • The resulting output is a SeriesGroupBy object.
  • The Series that are passed as arguments to groupby must share an index with the calling Series.

SeriesGroupBy objects can then be aggregated back into a Series using an aggregation method.

9 of 29

groupby Key Concepts

If we call groupby on a DataFrame:

  • The resulting output is a DataFrameGroupBy object.

DataFrameGroupBy objects can then be aggregated back into a DataFrame or a Series using an aggregation method.

10 of 29

groupby and agg

Most of the built-in handy aggregation methods are just shorthand for a universal aggregation method called agg.

  • Example, .mean() is just .agg(np.mean).

11 of 29

Series groupby/agg Summary

A

3

B

1

C

4

A

1

B

5

C

9

A

2

D

5

B

6

A

3

A

1

A

2

B

1

B

5

B

6

C

4

C

9

D

5

A

6

B

12

C

13

groupby

.agg(f), where f = sum

D

5

12 of 29

DataFrame groupby/agg Summary

A

3

B

1

C

4

A

1

B

5

C

9

A

2

C

5

B

6

A

3

A

1

A

2

B

1

B

5

B

6

C

4

C

9

C

5

A

3

B

6

C

9

groupby

.agg(f), where f = max

ak

tx

fl

hi

mi

ak

ca

sd

nc

ak

hi

ca

tx

mi

nc

fl

ak

sd

hi

tx

sd

13 of 29

The MultiIndex

If we group a Series (or DataFrame) by multiple Series and then perform an aggregation operation, the resulting Series (or Dataframe) will have a MultiIndex.

The resulting DataFrame has:

  • Two columns “%” and “Year”
  • A MultiIndex, where results of aggregate function are indexed by Party first, then Result.

14 of 29

Filtering by Group

Another common use for groups is to filter data.

  • filter takes an argument f.
  • f is a function that:
    • Takes a DataFrame as input.
    • Returns either true or false.
  • For each group g, f is applied to the subframe comprised of the rows from the original dataframe corresponding to that group.

15 of 29

Series groupby/filter Summary

A

3

B

1

C

4

A

1

B

5

C

9

A

2

D

5

B

6

A

3

A

1

A

2

B

1

B

5

B

6

C

4

C

9

D

5

groupby

.filter(f), where

f = lambda sf: sf[“num”].sum() > 10

B

1

C

4

B

5

C

9

B

6

12

13

6

5

16 of 29

isin

We saw last time how to build boolean arrays for filtering, e.g.

If we have a list of valid items, e.g. “Republican” or “Democratic”, we could use the | operator (| means or), but a better way is to use isin.

  • Ugly:
  • Better:

17 of 29

Baby Names Case Study Q2

18 of 29

Baby Names

Let’s try solving another real world problem using the baby names dataset: What was the most popular name in every state in every year and for every labeled gender?

  • Spoiler alert, we will build a MultiIndexed DataFrame where the data column is “count”.
  • MultiIndex will be by state, year, and gender.

Head to 03-case-study.ipynb.

  • As with the previous case study, we’ll use some stuff that we haven’t formally learned (e.g. combining multiple dataframes that are stored across multiple files, %%time)!

19 of 29

Practice Exercises on Enrollment Data

(see 03-enrollment)

20 of 29

Spring Enrollment Data

Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.

  • Warmup: How would you find all offerings of this class? For reference, our course number is C100.

21 of 29

Spring Enrollment Data

Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.

  • Warmup: How would you find all offerings of this class? For reference, our course number is C100.
  • Answer: df[df["Number"] == "C100"]

22 of 29

Spring Enrollment Data

Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.

  • Challenge 2: Create a series where each row correspond to one subject (e.g. English), and each value corresponds to the average number of students for courses in that subject. For example, your series might have a row saying that the average number of students in a Computer Science class is 88.
  • Answer: See 03-enrollment.ipynb

23 of 29

Spring Enrollment Data

Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.

  • Challenge 3: Create a multi-indexed series where each row corresponds to one subject (e.g. English) offered during one semester (e.g. Spring 2017), and each value corresponds to the maximum number of students for courses in that subject during that semester. For example, you might have a row saying that the maximum number of students in a computer science course during Spring 2012 was 575.
  • Answer: See 03-enrollment.ipynb

24 of 29

Spring Enrollment Data

Suppose we have a DataFrame called df that contains all Spring offerings of courses in several departments offered at Berkeley between 2012 and 2018.

  • Challenge 4: Try to compute the size of the largest class ever taught by each instructor. This challenge is stated more vaguely on purpose. You'll have to decide what the data structure looks like. Your result should be sorted in decreasing order of class size.
  • Answer: See 03-enrollment.ipynb

25 of 29

A quick look at pivot

26 of 29

Pivot Tables

You’ve already seen pivot tables in data 8.

  • Most basic usage: Working with data that has been classified according to two variables.

Let’s talk about how to do this basic case in pandas. We may discuss more advanced uses of pivot tables later.

27 of 29

Pivot Tables

A

3

B

1

C

4

A

1

B

5

C

9

A

2

D

5

B

6

U

V

U

V

U

V

U

U

V

A

3

A

2

U

U

A

1

V

B

5

U

B

1

B

6

V

V

C

4

U

C

9

V

D

5

U

A

5

U

A

1

V

B

5

U

B

7

V

C

4

U

C

9

V

D

5

U

A

5

B

5

C

4

D

5

1

7

9

NaN

U

V

...

R

C

group

f

f

f

f

f

f

f

28 of 29

Baby Names Case Study Q3

29 of 29

Baby Names

Let’s try solving another real world problem using the baby names dataset: Can we deduce a person’s birth sex from the last letter of their name?

Attendance question link: www.yellkey.com/every

  • What last letter do you think is most indicative of birth sex?

Head to 03-case-study.ipynb.

  • As with the previous case study, we’ll use some stuff that we haven’t formally learned (e.g. visualization)!