Pandas, Part 1
Introduction to Pandas syntax and operators
Data 100, Summer 2021 @ UC Berkeley
Raguvir Kunani and Isaac Schmidt
(content by Josh Hug, Fernando Pérez)
LECTURE 4
Goals For This Lecture
If you’ve taken Data 8, you might find “Intro to Pandas if you’ve taken Data 8” useful.
Data Frames: a high-level, statistical perspective
The world, a statistician's view (I'm NOT a statistician 😀)
🌍
A (statistical) population from which we draw samples.
Each sample has certain features.
A generic DataFrame
Features
Samples
Connecting with SQL: dataframes and relational ideas
Recent Berkeley work: a theory of dataframes
Pandas Data Structures:�Data Frames, Series, and Indices
Pandas Data Structures
There are three fundamental data structures in pandas:
Data Frame
Series
Index
The Relationship Between Data Frames, Series, and Indices
We can think of a Data Frame as a collection of Series that all share the same Index.
Candidate Series
Party Series
% Series
Year Series
Result Series
Non-native English speaker note: The plural of “series” is “series”. Sorry.
Indices Are Not Necessarily Row Numbers
Indices (a.k.a. row labels) can also:
Indices
The row labels that constitute an index do not have to be unique.
Column Names Are Usually Unique!
Column names in Pandas are almost always unique!
Summary: structure of a Series
Summary: structure of a DataFrame
Hands On Exercise
Let’s experiment with reading csv files and playing around with indices.
Indexing with The [] Operator
Indexing by Column Names Using [] Operator
Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as “Column Selection” or sometimes “indexing by column”.
Indexing by Column Names Using [] Operator
Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as “Column Selection” or sometimes “indexing by column”.
Indexing by Row Slices Using [] Operator
We can also index by row numbers using the [] operator.
[] Summary
[]
List
[]
Numeric Slice
[]
Name
DataFrame
DataFrame
Series
Single Column Selection
Multiple Column Selection
(Multiple) Row Selection
Note: Row Selection Requires Slicing!!
elections[0] will not work unless the elections data frame has a column whose name is the numeric zero.
Question
Try to predict the output of the following:
[]
Name
Series
Single Column Selection
[]
List
DataFrame
Multiple Column Selection
[]
Numeric Slice
DataFrame
(Multiple) Row Selection
Boolean Array Selection
and Querying
Boolean Array Input
Yet another input type supported by [] is the boolean array.
Entry number 7
Boolean Array Input
Yet another input type supported by [] is the boolean array. Useful because boolean arrays can be generated by using logical operators on Series.
Length 23 Series where every entry is “Republican”, “Democrat” or “Independent.”
Length 23 Series where every entry is either “True” or “False”, where “True” occurs for every independent candidate.
Boolean Array Input
Boolean Series can be combined using the & operator, allowing filtering of results by multiple criteria.
isin
The isin function makes it more convenient to find rows that match one of many possible values.
Example: Suppose we want to find “Republican” or “Democratic” candidates. Could use the | operator (| means or), or we can use isin.
The Query Command
The query command provides an alternate way to combine multiple conditions.
Indexing with .loc and .iloc
Sampling with .sample
Loc and iloc
Loc and iloc are alternate ways to index into a DataFrame.
Documentation:
Loc
Loc does two things:
�
Loc with Lists
The most basic use of loc is to provide a list of row and column labels, which returns a DataFrame.
Loc with Lists
The most basic use of loc is to provide a list of row and column labels, which returns a DataFrame.
Loc with Slices
Loc is also commonly used with slices.
Loc with Slices
Loc is also commonly used with slices.
Loc with Single Values for Column Label
If we provide only a single label as column argument, we get a Series.
Loc with Single Values for Column Label
As before with the [] operator, if we provide a list of only one label as an argument, we get back a dataframe.
Loc with Single Values for Row Label
If we provide only a single row label, we get a Series.
Loc Supports Boolean Arrays
Loc supports Boolean Arrays exactly as you’d expect.
iloc: Integer-Based Indexing for Selection by Position
In contrast to loc, iloc doesn’t think about labels at all. Instead, it returns the items that appear in the numerical positions specified.
Advantages of loc:
Nonetheless, iloc can be more convenient. Use iloc judiciously. �
Annoying Question Challenge
Which of the following pandas statements returns a DataFrame of the first 3 Candidate names only for candidates that won with more than 50% of the vote.
elections.iloc[[0, 3, 5], [0, 3]]
elections.loc[[0, 3, 5], ["Candidate":"Year"]
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
Annoying Question Challenge
Which of the following pandas statements returns a DataFrame of the first 3 Candidate names only for candidates that won with more than 50% of the vote.
elections.iloc[[0, 3, 5], [0, 3]]
elections.loc[[0, 3, 5], ["Candidate":"Year"]
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]
See notebook for why!
Note on Exam Problems
Q: Are you going to put horrible problems like these on the exam?
�A: Technically such problems would be in scope, but it’s very unlikely they’ll be this nitpicky.
Sample
If you want a DataFrame consisting of a random selection of rows, you can use the sample method.
Handy Properties and Utility Functions for Series and DataFrames
Numpy Operations
Pandas Series and DataFrames support a large number of operations, including mathematical operations so long as the data is numerical.
head, size, shape, and describe
head: Displays only the top few rows.
size: Gives the total number of data points.
shape: Gives the size of the data in rows and columns.
describe: Provides a summary of the data.
index and columns
index: Returns the index (a.k.a. row labels).
columns: Returns the labels for the columns.
The sort_values Method
One incredibly useful method for DataFrames is sort_values, which creates a copy of a DataFrame sorted by a specific column.
The sort_values Method
We can also use sort_values on a Series, which returns a copy with with the values in order.
The value_counts Method
Series also has the function value_counts, which creates a new Series showing the counts of every value.
The unique Method
Another handy method for Series is unique, which returns all unique values as an array.
The Things We Just Saw
Baby Names Exploration
Wrapping Up
To wrap up today, let’s try answering some questions about a list of California baby names.
I’ll start with my own goal, and will then take suggested goals from you and try to write code to achieve your goals.