1 of 32

Lecture 3

Data Tables, Indexes, pandas

DS 100

Fall 2017

Slides created by Sam Lau (samlau95@berkeley.edu)

2 of 32

Announcements

3 of 32

Last Time...

4 of 32

Can Big Data Account for no SRS?

5 of 32

Chance that 1st-born in DS100

6 of 32

5th yr MS student in CS

CS Edu with Josh Hug
Interaction + Learning

8th time as TA

2nd time for DS100

Instructor for Data 8 this past summer

7 of 32

Where we are

8 of 32

Data Science Lifecycle

Ask question(s)

Obtain data

Understand the data

Understand the world

9 of 32

Data Science Lifecycle

Ask question(s)

Obtain data

Understand the data

Understand the world

Your brain

The Internet

pandas and EDA

Inference and prediction

10 of 32

Today: pandas

http://pandas.pydata.org/

11 of 32

How this lecture will work

Using the dataset of baby names, we will...

Ask questions

Break down each question into steps

Learn the pandas knowledge needed for each step

12 of 32

What you will learn

Data manipulation in pandas

Sorting, filtering, grouping, pivot tables

Data visualization in pandas and seaborn

Bar charts, histograms, scatter plots

Prior knowledge of all concepts assumed!

~3 weeks of Data 8 in 1.5 hours
Practical, not conceptual

13 of 32

You won’t remember everything, but...

14 of 32

Getting the data

(Demo)

15 of 32

Question 1:

What was the most popular name in CA last year?

16 of 32

Always have high-level steps

Read in the data for CA

Keep only year 2016

Sort rows by count

Table.read_table

Table.where

Table.sort

17 of 32

In pandas

Read in the data for CA

Keep only year 2016

Sort rows by count

pd.read_csv

Slicing

df.sort_values

(Demo)

18 of 32

Recap

pd.read_csv(...) => DataFrame

DataFrame is like the Data 8 Table
Series is like a NumPy array

Slice DFs by label or by position

df.loc and df.iloc
DF index is a label for each row, used for slicing

df.sort_values(...) like Table.sort

19 of 32

Question 2:

What were the most popular names in each state for each year?

20 of 32

Break it down

Put all DFs together

Group by state and year

pd.concat

df.groupby

(Demo)

21 of 32

Recap

glob(...)

Returns list of files matching pattern

df.groupby(...).agg(...)

Groups one or more columns, applying aggregate function on each group

df.groupby(...).sum() # or .max(), etc.

Shorthand for df.groupby(...).agg(np.sum)

22 of 32

When do I need to group?

Do I need to count the times each value appears?

Do I need to aggregate values together?

Am I looping through a column’s unique values?

23 of 32

Question 3:

Can I deduce gender from the last letter of a person’s name?

24 of 32

Survey Question

Which last letter is most indicative of a person’s gender?

bit.ly/ds100-2

g
m
t
z
e
This is a trick question, Sam!

25 of 32

Break it down

Compute last letter of each name

Group by last letter

Visualize distribution

series.str

df.groupby

df.plot

(Demo)

26 of 32

Recap

series.str

To use string methods
Use series.apply when you need flexibility

df.pivot_table(...)

Computes a pivot table

df.plot

To use plotting methods

27 of 32

When do I need to pivot?

Am I grouping by two columns...

And do I want the resulting table to be easier to read?

Or, am I using pandas plotting on the groups?

28 of 32

Seaborn

http://seaborn.pydata.org/index.html

29 of 32

Seaborn

Statistical data visualization

Has common plots with some bonus features

And some fancier plots too

Works well with pandas DataFrames

sns.pairplot(df, hue="species")

30 of 32

How to Seaborn

DataFrame should ideally be in long-form (not grouped)

Most Seaborn methods work like this:�sns.barplot(x=..., y=..., hue=..., data=df)

(Demo)

31 of 32

Recap

Pandas for tabular data manipulation

Slicing for row/column selection
Group with df.groupby
Pivot with df.pivot_table
Join with pd.merge (covered in lab next week)
df.plot for basic plots

Seaborn for statistical plots

Reference the docs for available methods

32 of 32

Use the docs!

And Google.