Intro to pandas, Part 1
UC Berkeley Data 100 Summer 2019
Sam Lau
(Slides adapted from Josh Hug and John DeNero)
Learning goals:
Announcements
There is a live lecture Piazza thread: Leo will post soon.
Starting Wed, lecture is moved to North Gate Hall room 105!
Exam Conflict form link changed: http://bit.ly/su19-alt-final
Announcements
Office hours scheduled today for HW1!
Small group tutoring is starting next week; more info soon
I will try to do a better job of asking for names today. Also please add your preferred pronouns.
Pandas Data Structures:�Data Frames, Series, and Indices (Reading: Chapter 3)
Will move fast today; use lab time to let material sink in.
Pandas Data Structures
There are three fundamental data structures in pandas:
Data Frame
Series
Index
Data Frames, Series, and Indices
We can think of a Data Frame as a collection of Series that all share the same Index.
Candidate Series
Party Series
% Series
Year Series
Result Series
Non-native English speaker note: The plural of “series” is “series”. Sorry.
Indices Are Not Necessarily Row Numbers
Indices (a.k.a. row labels) can also:
Indices
The row labels that constitute an index do not have to be unique.
Column Names Must Be Unique!
Column names in Pandas are always unique!
Hands On Exercise
Let’s experiment with reading csv files and playing around with indices.
See lec02-live.ipynb. (Link on course website)
(demo)
Indexing with The [] Operator
Indexing by Column Names Using [] Operator
Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as “Column Selection” or sometimes “indexing by column”.
Indexing by Column Names Using [] Operator
Column name argument to [] yields Series.
Indexing by Column Names Using [] Operator
Column name argument to [] yields Series.
List argument to [] yields a Data Frame.
Indexing by Row Slices Using [] Operator
We can also index by row numbers using the [] operator.
[] Summary
[]
List
[]
Numeric Slice
[]
Name
DataFrame
DataFrame
Series
Single Column Selection
Multiple Column Selection
(Multiple) Row Selection
Note: Row Selection Requires Slicing!!
elections[0] will not work unless the elections data frame has a column whose name is the numeric zero.
Question
Try to predict the output of the following:
[]
Name
Series
Single Column Selection
[]
List
DataFrame
Multiple Column Selection
[]
Numeric Slice
DataFrame
(Multiple) Row Selection
(demo)
Boolean Array Selection
Boolean Array Input
Yet another input type supported by [] is the boolean array.
Boolean Array Input
Yet another input type supported by [] is the boolean array. Useful because boolean arrays can be generated by using logical operators on Series.
Length 23 Series where every entry is “Republican”, “Democrat” or “Independent.”
Length 23 Series where every entry is either “True” or “False”, where “True” occurs for every independent candidate.
Boolean Array Input
Boolean Series can be combined using the & operator, allowing filtering of results by multiple criteria.
(demo)
Indexing with loc and iloc
.loc and .iloc
.loc and .iloc are alternate ways to index into a DataFrame.
Documentation:
.loc
.loc does two things:
�
.loc with Lists
The most basic use of loc is to provide a list of row and column labels, which returns a DataFrame.
.loc with Slices
.loc is also commonly used with slices.
.loc with Single Values for Column Label
If we provide only a single label as column argument, we get a Series.
.loc with Single Values for Column Label
As before with the [] operator, if we provide a list of only one label as an argument, we get back a dataframe.
.loc with Single Values for Row Label
If we provide only a single row label, we get a Series.
.loc Supports Boolean Arrays
.loc supports Boolean Arrays exactly as you’d expect.
.iloc: Selection by Position
In contrast to loc, iloc doesn’t think about labels at all. Instead, it returns the items that appear in the numerical positions specified.
Advantages of loc:
Nonetheless, iloc can be more convenient. Use iloc judiciously. �
(demo)
Slicing Connections
5 min break
Handy Properties and Utility Functions for Series and DataFrames
head, size, shape, and describe
head: Displays only the top few rows.
size: Gives the total number of data points.
shape: Gives the size of the data in rows and columns.
describe: Provides a summary of the data.
index and columns
index: Returns the index (a.k.a. row labels).
columns: Returns the labels for the columns.
The sort_values Method
One incredibly useful method for DataFrames is sort_values, which creates a copy of a DataFrame sorted by a specific column.
The sort_values Method
We can also use sort_values on a Series, which returns a copy with with the values in order.
The value_counts Method
Series also has the function value_counts, which creates a new Series showing the counts of every value.
The unique Method
Another handy method for Series is unique, which returns all unique values as an array.
Baby Names Case Study Q1
Baby Names
Let’s try solving a real world problem using the baby names dataset: What was the most popular name in California last year (2019)?
Along the way, we’ll see some examples of what it’s like to deal with real data, and will also explore some fancy iPython features.
(demo)
Summary