Data Tools
Lecture 2
Tools we need for ML – pandas and visualization
EECS 189/289, Spring 2026 @ UC Berkeley
Jennifer Listgarten and Alex Dimakis
All emails should go to: cs189-instructors@berkeley.edu
Why Start with Data?
Every model begins and ends with data. It’s the raw material for training and evaluation.
Success in ML depends on how well we process inputs and outputs from experiments.
Whether you join a lab or work in industry, strong data wrangling and visualization skills are essential.
7824888
Roadmap
7824888
pandas Data Structures
7824888
pandas
Using pandas, we can:
pandas is the standard tool across research and industry for working with tabular data.
Stands for "panel data"
7824888
pandas Data Types
A DataFrame
A Series
An Index
7824888
Series
import pandas as pd
welcome_series = pd.Series(["welcome", "to", "CS 189"])
welcome_series.values
welcome_series.index
7824888
DataFrame
Landmark Series
Type Series
Height Series
Year Built Series
You can create a DataFrame:
7824888
Creating a DataFrame
data = {
'Landmark': ['Sather Gate', 'Campanile', 'Doe Library', 'Memorial Glade', 'Sproul Plaza'],
'Type': ['Gate', 'Tower', 'Library', 'Open Space', 'Plaza'],
'Height': [30, 307, 80, 0, 0],
'Year Built': [1910, 1914, 1911, None, 1962]
}
df = pd.DataFrame(data)
event_data = pd.read_csv("data/uc_berkeley_events.csv", index_col='Year')
7824888
Exploring DataFrames
7824888
Utility Functions for DataFrame
7824888
head() and tail()
Use head() to display the first few rows of the DataFrame.
df.head() # default is 5
Use tail() to display the last few rows
df.tail(3)
7824888
shape and size
shape gets the number of rows and columns in the DataFrame.
size gets the total number of elements in the DataFrame.
5
4
df.shape
(5, 4)
df.size
20
7824888
sample()
.sample(n=#)
Randomly sample n rows.
.sample(frac=#)
Randomly sample frac of rows.
.sample(n=#, replace=True)
.sample(n=#, random_state=42)
Allows the same row to appear multiple times.
Using random_state for reproducibility.
To sample a random selection of rows from a DataFrame, we use the .sample() method.
7824888
value_counts() and unique()
Series.value_counts
Counts the number of occurrences of each unique value in a Series.
type_counts = df['Type'].value_counts()
Series.unique
Returns an array of every unique value in a Series.
type_counts = df['Type'].unique()
7824888
Selecting and Retrieving Data
7824888
Integer-based Extraction: iloc[]
We want to extract data according to its position.
Row position
Column position
0
1
2
3
4
0 1 2 3
Python convention: The first position has integer index 0.
Arguments to .iloc can be:
7824888
Integer-based Extraction: iloc[]
0
1
2
3
4
0 1 2 3
Arguments to .iloc can be:
7824888
Integer-based Extraction: iloc[]
0
1
2
3
4
0 1 2 3
Arguments to .iloc can be:
df.iloc[[1, 2]]
x
df.iloc[[1, 2], [1, 2, 3]]
x
7824888
Integer-based Extraction: iloc[]
0
1
2
3
4
0 1 2 3
Arguments to .iloc can be:
df.iloc[2:4]
x
df.iloc[:, 1:2]
x
7824888
Integer-based Extraction: iloc[]
0
1
2
3
4
0 1 2 3
Arguments to .iloc can be:
df.iloc[:, 0]
x
df.iloc[2]
x
x
df.iloc[0, 1]
7824888
Label-based Extraction: loc[]
We want to extract data according to its labels.
Row Labels
Column Labels
Arguments to .loc can be:
7824888
Label-based Extraction: loc[]
Arguments to .loc can be:
Row Labels
Column Labels
7824888
Label-based Extraction: loc[]
Arguments to .loc can be:
df.loc[[1, 2]]
x
df.loc[[1, 2], ['Type', 'Height', 'Year Built']]
x
Row Labels
Column Labels
7824888
Label-based Extraction: loc[]
Arguments to .loc can be:
df.loc[2:3]
x
df.loc[:, 'Landmark':'Height']
x
Row Labels
Column Labels
7824888
Label-based Extraction: loc[]
Arguments to .loc can be:
df.loc[:, 'Landmark']
x
df.loc[2]
x
x
df.loc[0, 'Type']
Row Labels
Column Labels
7824888
Context-Dependent Selection
That is, [] is context sensitive.
7824888
Context-Dependent Selection
df[1:3]
x
7824888
Context-Dependent Selection
df[['Landmark', 'Year Built']]
x
x
7824888
Context-Dependent Selection
df['Year Built']
x
7824888
Filtering Data
7824888
Boolean Array for Filtering a DataFrame
We learned to extract data according to its integer position (.iloc) or its label (.loc)
What if we want to extract rows that satisfy a given condition?
df['Height'] > 50
df[df['Height'] > 50]
df.loc[df['Height'] > 50]
equivalently
Produces a Boolean series
7824888
Combining Boolean Series for Filtering
Boolean Series can be combined using various operators, allowing filtering of results by multiple criteria.
df[(df['Height'] > 50) & (df['Type'] == 'Library')]
7824888
Bitwise Operators
Symbol | Usage | Meaning |
~ | ~p | Negation of p |
| | p | q | p OR q |
& | p & q | p AND q |
^ | p ^ q | p XOR q (exclusive or) |
7824888
DataFrame Modification
7824888
Adding and Modifying Columns
Adding a column is easy:
df['Experience'] = [2, 5, 1, 8, 4]
df['Height_Increase'] = df['Height'] * 0.1
7824888
Dropping a Column
df.drop(columns=['Experience'])
print(df)
What happened?
Most DataFrame operations are not in-place.
7824888
Dropping a Column
df.drop(columns=['Experience'], inplace=True)
print(df)
Alternatively we can directly assign:
df = df.drop(columns=['Experience'])
7824888
Sorting DataFrame
df.sort_values(by='Column', ascending=True)
df.sort_values(by=['Col1', 'Col2'], ascending=[True, False])�
7824888
Sorting DataFrame by One Column
df.sort_values(by='Column', ascending=True)
df = df.sort_values(by='Height')
The default value for the ascending argument is True.
7824888
Sorting DataFrame by One Column
df.sort_values(by=['Col1', 'Col2'], ascending=[True, True])
df = df.sort_values(by=['Height', 'Type'], ascending=[True, False]))
7824888
Handling Missing Values
7824888
Handling Missing Values
df_missing.isnull()
7824888
Handling Missing Values
df_missing.dropna()
7824888
Handling Missing Values
df_missing.fillna()
7824888
Aggregation in DataFrame
7824888
Aggregating Data in pandas
df['Height'].mean()
83.4
Basic
sum()
mean()
median()
min()
max()
count()
nunique()– # of unique values
prod()
df['Height'].std()
129.2006191935627
Statistical
std()
var()
sem()– standard error of the mean
skew()
df['Height'].idxmax()
1
Logical and Index-Based
any()– True if any value is True
all()– True if all values are True
first()– first non-null value
last()– last non-null value
idxmin()– index of min value
idxmax()– index of max value
7824888
Grouping Data in pandas
Our goal:
Grouping is a powerful tool to 1) perform large operations, all at once and 2) summarize trends in a dataset.
7824888
Grouping Data in pandas
A .groupby() operation involves some combination of splitting the object, applying a function, and combining the results.
7824888
Grouping Data in pandas
Gate
DataFrameGroupBy
Tower
Plaza
Open
Space
Library
augmented_df.groupby('Type')
.agg(mean)
7824888
Aggregation Functions
What goes inside of .agg( )?Any function that aggregates several values into one summary value.
51
.agg(sum) .agg(np.sum) .agg("sum")
.agg(max) .agg(np.max) .agg("max")
.agg(min) .agg(np.min) .agg("min")
.agg(np.mean) .agg("mean")
.agg("first")
.agg("last")
In-Built Python Functions
NumPy Functions
In-Built pandas Functions
Some commonly-used aggregation functions can even be called directly, without the explicit use of .agg( ):
df.groupby("Type").mean()
7824888
Grouping by Multiple Columns
augmented_df.groupby(['Type', 'Campus'])
.agg('max')
[['Height']]
7824888
Grouping by Multiple Columns: pivot_table
But we have two index in our DataFrame
columns
rows
values
pivot_table =
pd.pivot_table(
augmented_df,
index='Type',
columns='Campus',
values='Height',
aggfunc='max'
)
augmented_df.groupby(['Type', 'Campus'])
.agg('max')
[['Height']]
7824888
Joining DataFrames
7824888
Joining Events and Landmarks
event_data
landmarks
Joining DataFrames allows you to combine data from different sources based on a common key or index.
7824888
Types of Join: Inner Join
event_data
landmarks
result_join_inner = landmarks.join(event_data.set_index('Year'),
on= 'Year Built',
how='inner')
�result_merge_inner = landmarks.merge(event_data,
how='inner',
left_on='Year Built',
right_on='Year')
7824888
Types of Join: Outer Join
event_data
landmarks
result_merge_outer = landmarks.merge(event_data,
how='outer',
left_on='Year Built',
right_on='Year')
no match
no match
7824888
Types of Join
7824888
Join at slido.com�#8761386
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
7824888
What does df['Type'] return?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
7824888
What does df.tail(3) do?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
7824888
df['Type'].unique() returns:
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
7824888
Any questions you have for the first two lectures or input for us?
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
7824888
Visualization
7824888
Why Do We Visualize Data?
Insight: Critical to gaining deeper insights into trends in the data.
Communication: Help convey trends in data to others.
In this class we will explore both but focus more on the first.
7824888
Plotting Libraries
There are a wide range of tools for data visualization in machine learning.
In this course we will focus on Plotly and Weights and Biases.
7824888
Matplotlib
7824888
Matplotlib
import matplotlib.pyplot as plt
Line Plot
Scatter Plot
Bar Plot
Histogram
Box Plot
Heatmap
7824888
Plotly
7824888
Why Plotly?
Many would (correctly) argue you should learn MatplotLib.
However, Matplotlib plots are static making it difficult to interactively slice.
7824888
Three Ways to Plot using Plotly
Easiest: Using pandas built-in plotting functions
Easy + Expressive: Use Plotly Express to construct plots quickly
Advanced: Build plot from graphics objects (like Matplotlib)
7824888
Using pandas built-in Plotting Functions
Configure pandas to use Plotly (done at beginning of notebook).
pd.set_option('plotting.backend', 'plotly')
Call plot directly on your DataFrame:
# Make various kinds of plots 'scatter', 'bar', 'hist'
mpg.plot(kind='scatter', x='weight', y='mpg', color='origin',
title='MPG vs. Weight by Origin',
width=800, height=600)
Notice that we define:
7824888
Using Plotly Express
Very similar to calling pandas .plot functions but with a wide range of plotting capabilities (see tutorials):
import plotly.express as px
px.scatter(mpg, x='weight', y='mpg', color='origin', size='cylinders',
hover_data = mpg.columns,
title='MPG vs. Weight by Origin',
width=800, height=600)�
Here we pass the DataFrame (mpg) into the desired plotting function along with how columns are mapped to visual elements.
7824888
How you Can Learn More
Skim the tutorials (optional but can be fun):
When you can’t figure out how to plot something (or just want to learn more) ask an AI agent.
7824888
Data Tools
Lecture 2
Credit: Joseph E. Gonzalez and Narges Norouzi
Reference Book Chapters: This topic is not covered in the textbook.