1 of 32

Lecture 3

Data Tables, Indexes, pandas

DS 100

Fall 2017

Slides created by Sam Lau (samlau95@berkeley.edu)

2 of 32

Announcements

3 of 32

Last Time...

4 of 32

Can Big Data Account for no SRS?

5 of 32

Chance that 1st-born in DS100

6 of 32

Me

  • 5th yr MS student in CS
    • CS Edu with Josh Hug
    • Interaction + Learning

  • 8th time as TA
    • 2nd time for DS100

  • Instructor for Data 8 this past summer

7 of 32

Where we are

8 of 32

Data Science Lifecycle

  • Ask question(s)

  • Obtain data

  • Understand the data

  • Understand the world

9 of 32

Data Science Lifecycle

  • Ask question(s)

  • Obtain data

  • Understand the data

  • Understand the world
  • Your brain

  • The Internet

  • pandas and EDA

  • Inference and prediction

10 of 32

Today: pandas

11 of 32

How this lecture will work

  • Using the dataset of baby names, we will...

  • Ask questions

  • Break down each question into steps

  • Learn the pandas knowledge needed for each step

12 of 32

What you will learn

  • Data manipulation in pandas
    • Sorting, filtering, grouping, pivot tables

  • Data visualization in pandas and seaborn
    • Bar charts, histograms, scatter plots

  • Prior knowledge of all concepts assumed!
    • ~3 weeks of Data 8 in 1.5 hours
    • Practical, not conceptual

13 of 32

You won’t remember everything, but...

14 of 32

Getting the data

(Demo)

15 of 32

Question 1:

What was the most popular name in CA last year?

16 of 32

Always have high-level steps

  1. Read in the data for CA

  • Keep only year 2016

  • Sort rows by count
  1. Table.read_table

  • Table.where

  • Table.sort

17 of 32

In pandas

  • Read in the data for CA

  • Keep only year 2016

  • Sort rows by count
  • pd.read_csv

  • Slicing

  • df.sort_values

(Demo)

18 of 32

Recap

  • pd.read_csv(...) => DataFrame
    • DataFrame is like the Data 8 Table
    • Series is like a NumPy array

  • Slice DFs by label or by position
    • df.loc and df.iloc
    • DF index is a label for each row, used for slicing

  • df.sort_values(...) like Table.sort

19 of 32

Question 2:

What were the most popular names in each state for each year?

20 of 32

Break it down

  • Put all DFs together

  • Group by state and year
  • pd.concat

  • df.groupby

(Demo)

21 of 32

Recap

  • glob(...)
    • Returns list of files matching pattern

  • df.groupby(...).agg(...)
    • Groups one or more columns, applying aggregate function on each group

  • df.groupby(...).sum() # or .max(), etc.
    • Shorthand for df.groupby(...).agg(np.sum)

22 of 32

When do I need to group?

  • Do I need to count the times each value appears?

  • Do I need to aggregate values together?

  • Am I looping through a column’s unique values?

23 of 32

Question 3:

Can I deduce gender from the last letter of a person’s name?

24 of 32

Survey Question

Which last letter is most indicative of a person’s gender?

bit.ly/ds100-2

  1. g
  2. m
  3. t
  4. z
  5. e
  6. This is a trick question, Sam!

25 of 32

Break it down

  • Compute last letter of each name

  • Group by last letter

  • Visualize distribution
  • series.str

  • df.groupby

  • df.plot

(Demo)

26 of 32

Recap

  • series.str
    • To use string methods
    • Use series.apply when you need flexibility

  • df.pivot_table(...)
    • Computes a pivot table

  • df.plot
    • To use plotting methods

27 of 32

When do I need to pivot?

  • Am I grouping by two columns...

  • And do I want the resulting table to be easier to read?

  • Or, am I using pandas plotting on the groups?

28 of 32

Seaborn

29 of 32

Seaborn

  • Statistical data visualization

  • Has common plots with some bonus features
    • And some fancier plots too

  • Works well with pandas DataFrames

sns.pairplot(df, hue="species")

30 of 32

How to Seaborn

  • DataFrame should ideally be in long-form (not grouped)

  • Most Seaborn methods work like this:�sns.barplot(x=..., y=..., hue=..., data=df)

(Demo)

31 of 32

Recap

  • Pandas for tabular data manipulation
    • Slicing for row/column selection
    • Group with df.groupby
    • Pivot with df.pivot_table
    • Join with pd.merge (covered in lab next week)
    • df.plot for basic plots

  • Seaborn for statistical plots
    • Reference the docs for available methods

32 of 32

Use the docs!

And Google.