1 of 55

Pandas, Part 1

Introduction to Pandas syntax and operators

Data 100, Summer 2021 @ UC Berkeley

Raguvir Kunani and Isaac Schmidt

(content by Josh Hug, Fernando Pérez)

LECTURE 4

2 of 55

Goals For This Lecture

  • Introduce Pandas, with emphasis on:
    • A mental model of DataFrames - linking to statistics.
    • Key Data Structures (data frames, series, indices).
    • How to index into these structures.
    • How to read files to create these structures.
    • Other basic operations on these structures.
  • Will go through quite a lot of the language without full explanations.
    • We expect you to fill in the gaps on homeworks, labs, projects, and through your own experimentation.
  • Solve some very basic data science problems using Jupyter/pandas.

If you’ve taken Data 8, you might find “Intro to Pandas if you’ve taken Data 8” useful.

3 of 55

Data Frames: a high-level, statistical perspective

4 of 55

The world, a statistician's view (I'm NOT a statistician 😀)

🌍

A (statistical) population from which we draw samples.

Each sample has certain features.

A generic DataFrame

(from https://arxiv.org/abs/2001.00888)

Features

Samples

5 of 55

Connecting with SQL: dataframes and relational ideas

  • Statistical modeling ultimately involves lots of linear algebra manipulations.
  • The Relational Algebra that underlies databases (SQL) can be connected with Linear Algebra ideas.

6 of 55

Recent Berkeley work: a theory of dataframes

7 of 55

Pandas Data Structures:�Data Frames, Series, and Indices

8 of 55

Pandas Data Structures

There are three fundamental data structures in pandas:

  • Data Frame: 2D data tabular data.
  • Series: 1D data. I usually think of it as columnar data.
  • Index: A sequence of row labels.

Data Frame

Series

Index

9 of 55

The Relationship Between Data Frames, Series, and Indices

We can think of a Data Frame as a collection of Series that all share the same Index.

  • Candidate, Party, %, Year, and Result Series all share an index from 0 to 5.

Candidate Series

Party Series

% Series

Year Series

Result Series

Non-native English speaker note: The plural of “series” is “series”. Sorry.

10 of 55

Indices Are Not Necessarily Row Numbers

Indices (a.k.a. row labels) can also:

  • Be non-numeric.
  • Have a name, e.g. “State”.

11 of 55

Indices

The row labels that constitute an index do not have to be unique.

  • Left: The index values are all unique and numeric, acting as a row number.
  • Right: The index values are named and non-unique.

12 of 55

Column Names Are Usually Unique!

Column names in Pandas are almost always unique!

  • Example: Really shouldn’t have two columns named “Candidate”.

13 of 55

Summary: structure of a Series

14 of 55

Summary: structure of a DataFrame

15 of 55

Hands On Exercise

Let’s experiment with reading csv files and playing around with indices.

  • See 05-pandas-basics.ipynb.

16 of 55

Indexing with The [] Operator

17 of 55

Indexing by Column Names Using [] Operator

Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as “Column Selection” or sometimes “indexing by column”.

  • Column name argument to [] yields Series.
  • List argument to [] yields a Data Frame.

18 of 55

Indexing by Column Names Using [] Operator

Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as “Column Selection” or sometimes “indexing by column”.

  • Column name argument to [] yields Series.
  • List argument (even of one name) to [] yields a Data Frame.

19 of 55

Indexing by Row Slices Using [] Operator

We can also index by row numbers using the [] operator.

  • Numeric slice argument to [] yields rows.
  • Example: [0:3] yields rows 0 to 2.

20 of 55

[] Summary

[]

List

[]

Numeric Slice

[]

Name

DataFrame

DataFrame

Series

Single Column Selection

Multiple Column Selection

(Multiple) Row Selection

21 of 55

Note: Row Selection Requires Slicing!!

elections[0] will not work unless the elections data frame has a column whose name is the numeric zero.

  • Note: It is actually possible for columns to have names that are non-String types, e.g. numeric, datetime etc.

22 of 55

Question

Try to predict the output of the following:

  • weird[1]
  • weird[“1”]
  • weird[1:]

[]

Name

Series

Single Column Selection

[]

List

DataFrame

Multiple Column Selection

[]

Numeric Slice

DataFrame

(Multiple) Row Selection

23 of 55

Boolean Array Selection

and Querying

24 of 55

Boolean Array Input

Yet another input type supported by [] is the boolean array.

Entry number 7

25 of 55

Boolean Array Input

Yet another input type supported by [] is the boolean array. Useful because boolean arrays can be generated by using logical operators on Series.

Length 23 Series where every entry is “Republican”, “Democrat” or “Independent.”

Length 23 Series where every entry is either “True” or “False”, where “True” occurs for every independent candidate.

26 of 55

Boolean Array Input

Boolean Series can be combined using the & operator, allowing filtering of results by multiple criteria.

27 of 55

isin

The isin function makes it more convenient to find rows that match one of many possible values.

Example: Suppose we want to find “Republican” or “Democratic” candidates. Could use the | operator (| means or), or we can use isin.

  • Ugly:
  • Better:

28 of 55

The Query Command

The query command provides an alternate way to combine multiple conditions.

29 of 55

Indexing with .loc and .iloc

Sampling with .sample

30 of 55

Loc and iloc

Loc and iloc are alternate ways to index into a DataFrame.

  • They take a lot of getting used to! Documentation and ideas behind them are �quite complex.
  • I’ll go over common usages (see docs for weirder ones).

Documentation:

31 of 55

Loc

Loc does two things:

  • Access values by labels.
  • Access values using a boolean array (a la Boolean Array Selection).

32 of 55

Loc with Lists

The most basic use of loc is to provide a list of row and column labels, which returns a DataFrame.

33 of 55

Loc with Lists

The most basic use of loc is to provide a list of row and column labels, which returns a DataFrame.

34 of 55

Loc with Slices

Loc is also commonly used with slices.

  • Slicing works with all label types, not just numeric labels.
  • Slices with loc are inclusive, not exclusive.

35 of 55

Loc with Slices

Loc is also commonly used with slices.

  • Slicing works with all label types, not just numeric labels.
  • Slices with loc are inclusive, not exclusive.

36 of 55

Loc with Single Values for Column Label

If we provide only a single label as column argument, we get a Series.

37 of 55

Loc with Single Values for Column Label

As before with the [] operator, if we provide a list of only one label as an argument, we get back a dataframe.

38 of 55

Loc with Single Values for Row Label

If we provide only a single row label, we get a Series.

  • Such a series represents a ROW not a column!
  • The index of this Series is the names of the columns from the data frame.
  • Putting the single row label in a list yields a dataframe version.

39 of 55

Loc Supports Boolean Arrays

Loc supports Boolean Arrays exactly as you’d expect.

40 of 55

iloc: Integer-Based Indexing for Selection by Position

In contrast to loc, iloc doesn’t think about labels at all. Instead, it returns the items that appear in the numerical positions specified.

Advantages of loc:

  • Harder to make mistakes.
  • Easier to read code.
  • Not vulnerable to changes to the ordering of rows/cols in raw data files.

Nonetheless, iloc can be more convenient. Use iloc judiciously. �

41 of 55

Annoying Question Challenge

Which of the following pandas statements returns a DataFrame of the first 3 Candidate names only for candidates that won with more than 50% of the vote.

elections.iloc[[0, 3, 5], [0, 3]]

elections.loc[[0, 3, 5], ["Candidate":"Year"]

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

42 of 55

Annoying Question Challenge

Which of the following pandas statements returns a DataFrame of the first 3 Candidate names only for candidates that won with more than 50% of the vote.

elections.iloc[[0, 3, 5], [0, 3]]

elections.loc[[0, 3, 5], ["Candidate":"Year"]

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

See notebook for why!

43 of 55

Note on Exam Problems

Q: Are you going to put horrible problems like these on the exam?

�A: Technically such problems would be in scope, but it’s very unlikely they’ll be this nitpicky.

44 of 55

Sample

If you want a DataFrame consisting of a random selection of rows, you can use the sample method.

  • By default, it is without replacement. Use replace=true for replacement.
  • Naturally, can be chained with our selection operators [], loc, iloc.

45 of 55

Handy Properties and Utility Functions for Series and DataFrames

46 of 55

Numpy Operations

Pandas Series and DataFrames support a large number of operations, including mathematical operations so long as the data is numerical.

47 of 55

head, size, shape, and describe

head: Displays only the top few rows.

size: Gives the total number of data points.

shape: Gives the size of the data in rows and columns.

describe: Provides a summary of the data.

48 of 55

index and columns

index: Returns the index (a.k.a. row labels).

columns: Returns the labels for the columns.

49 of 55

The sort_values Method

One incredibly useful method for DataFrames is sort_values, which creates a copy of a DataFrame sorted by a specific column.

50 of 55

The sort_values Method

We can also use sort_values on a Series, which returns a copy with with the values in order.

51 of 55

The value_counts Method

Series also has the function value_counts, which creates a new Series showing the counts of every value.

52 of 55

The unique Method

Another handy method for Series is unique, which returns all unique values as an array.

53 of 55

The Things We Just Saw

  • sort_values
  • value_counts
  • unique

54 of 55

Baby Names Exploration

55 of 55

Wrapping Up

To wrap up today, let’s try answering some questions about a list of California baby names.

I’ll start with my own goal, and will then take suggested goals from you and try to write code to achieve your goals.