1 of 44

Intro to pandas, Part 1

UC Berkeley Data 100 Summer 2019

Sam Lau

(Slides adapted from Josh Hug and John DeNero)

Learning goals:

  • Introduce the DataFrame and the Series
  • Learn how to slice a DF using indexing
  • Learn commonly used pandas methods

2 of 44

Announcements

There is a live lecture Piazza thread: Leo will post soon.

Starting Wed, lecture is moved to North Gate Hall room 105!

  • Everyone also gets attendance today; GForm tomorrow

Exam Conflict form link changed: http://bit.ly/su19-alt-final

  • Due Friday at 11:59pm
  • DSP exams will have a separate form, will send later.

3 of 44

Announcements

Office hours scheduled today for HW1!

  • 11am-12pm, 2-4pm in 355 Evans
  • Room will change after this week

Small group tutoring is starting next week; more info soon

I will try to do a better job of asking for names today. Also please add your preferred pronouns.

4 of 44

Pandas Data Structures:�Data Frames, Series, and Indices (Reading: Chapter 3)

Will move fast today; use lab time to let material sink in.

5 of 44

Pandas Data Structures

There are three fundamental data structures in pandas:

  • Data Frame: 2D data tabular data.
  • Series: 1D data. I usually think of it as columnar data.
  • Index: A sequence of row labels.

Data Frame

Series

Index

6 of 44

Data Frames, Series, and Indices

We can think of a Data Frame as a collection of Series that all share the same Index.

  • Candidate, Party, %, Year, and Result Series all share an index from 0 to 5.

Candidate Series

Party Series

% Series

Year Series

Result Series

Non-native English speaker note: The plural of “series” is “series”. Sorry.

7 of 44

Indices Are Not Necessarily Row Numbers

Indices (a.k.a. row labels) can also:

  • Be non-numeric.
  • Have a name, e.g. “State”.

8 of 44

Indices

The row labels that constitute an index do not have to be unique.

  • Left: The index values are all unique and numeric, acting as a row number.
  • Right: The index values are named and non-unique.

9 of 44

Column Names Must Be Unique!

Column names in Pandas are always unique!

  • Example: Can’t have two columns named “Candidate”.

10 of 44

Hands On Exercise

Let’s experiment with reading csv files and playing around with indices.

See lec02-live.ipynb. (Link on course website)

(demo)

11 of 44

Indexing with The [] Operator

12 of 44

Indexing by Column Names Using [] Operator

Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as “Column Selection” or sometimes “indexing by column”.

  • Column name argument to [] yields Series.
  • List argument to [] yields a Data Frame.

13 of 44

Indexing by Column Names Using [] Operator

Column name argument to [] yields Series.

14 of 44

Indexing by Column Names Using [] Operator

Column name argument to [] yields Series.

List argument to [] yields a Data Frame.

15 of 44

Indexing by Row Slices Using [] Operator

We can also index by row numbers using the [] operator.

  • Numeric slice argument to [] yields rows.
  • Example: [0:3] yields rows 0 to 2.

16 of 44

[] Summary

[]

List

[]

Numeric Slice

[]

Name

DataFrame

DataFrame

Series

Single Column Selection

Multiple Column Selection

(Multiple) Row Selection

17 of 44

Note: Row Selection Requires Slicing!!

elections[0] will not work unless the elections data frame has a column whose name is the numeric zero.

  • Note: It is actually possible for columns to have names that non-String types, e.g. numeric, datetime etc.

18 of 44

Question

Try to predict the output of the following:

  • weird[1]
  • weird[“1”]
  • weird[1:]

[]

Name

Series

Single Column Selection

[]

List

DataFrame

Multiple Column Selection

[]

Numeric Slice

DataFrame

(Multiple) Row Selection

(demo)

19 of 44

Boolean Array Selection

20 of 44

Boolean Array Input

Yet another input type supported by [] is the boolean array.

21 of 44

Boolean Array Input

Yet another input type supported by [] is the boolean array. Useful because boolean arrays can be generated by using logical operators on Series.

Length 23 Series where every entry is “Republican”, “Democrat” or “Independent.”

Length 23 Series where every entry is either “True” or “False”, where “True” occurs for every independent candidate.

22 of 44

Boolean Array Input

Boolean Series can be combined using the & operator, allowing filtering of results by multiple criteria.

(demo)

23 of 44

Indexing with loc and iloc

24 of 44

.loc and .iloc

.loc and .iloc are alternate ways to index into a DataFrame.

  • They take a lot of getting used to! Documentation and ideas behind them are quite complex.
  • I’ll go over common usages (see docs for weirder ones).

Documentation:

25 of 44

.loc

.loc does two things:

  • Access values by labels.
  • Access values using a boolean array (a la Boolean Array Selection).

26 of 44

.loc with Lists

The most basic use of loc is to provide a list of row and column labels, which returns a DataFrame.

27 of 44

.loc with Slices

.loc is also commonly used with slices.

  • Slicing works with all label types, not just numeric labels.
  • Slices with loc are inclusive, not exclusive.

28 of 44

.loc with Single Values for Column Label

If we provide only a single label as column argument, we get a Series.

29 of 44

.loc with Single Values for Column Label

As before with the [] operator, if we provide a list of only one label as an argument, we get back a dataframe.

30 of 44

.loc with Single Values for Row Label

If we provide only a single row label, we get a Series.

  • Series made up of the values from the requested row, not column
  • Index is the names of the columns from the data frame.
  • Putting the single row label in a list yields a dataframe version.

31 of 44

.loc Supports Boolean Arrays

.loc supports Boolean Arrays exactly as you’d expect.

32 of 44

.iloc: Selection by Position

In contrast to loc, iloc doesn’t think about labels at all. Instead, it returns the items that appear in the numerical positions specified.

Advantages of loc:

  • Harder to make mistakes.
  • Easier to read code.
  • Not vulnerable to changes to the ordering of rows/cols in raw data files.

Nonetheless, iloc can be more convenient. Use iloc judiciously. �

(demo)

33 of 44

Slicing Connections

34 of 44

5 min break

35 of 44

Handy Properties and Utility Functions for Series and DataFrames

36 of 44

head, size, shape, and describe

head: Displays only the top few rows.

size: Gives the total number of data points.

shape: Gives the size of the data in rows and columns.

describe: Provides a summary of the data.

37 of 44

index and columns

index: Returns the index (a.k.a. row labels).

columns: Returns the labels for the columns.

38 of 44

The sort_values Method

One incredibly useful method for DataFrames is sort_values, which creates a copy of a DataFrame sorted by a specific column.

39 of 44

The sort_values Method

We can also use sort_values on a Series, which returns a copy with with the values in order.

40 of 44

The value_counts Method

Series also has the function value_counts, which creates a new Series showing the counts of every value.

41 of 44

The unique Method

Another handy method for Series is unique, which returns all unique values as an array.

42 of 44

Baby Names Case Study Q1

43 of 44

Baby Names

Let’s try solving a real world problem using the baby names dataset: What was the most popular name in California last year (2019)?

Along the way, we’ll see some examples of what it’s like to deal with real data, and will also explore some fancy iPython features.

(demo)

44 of 44

Summary

  • pandas data structures:
    • DataFrames are 2D tables of data.
    • Series are 1D array-like columns of data.
  • Slicing:
    • .loc for slicing by label, .iloc by index
    • Boolean slicing to slice by condition
  • Useful methods:
    • .read_csv, .head, .shape, .describe, .sort_values, .value_counts, .unique.