1 of 75

Data Tools

Lecture 2

Tools we need for ML – pandas and visualization

EECS 189/289, Spring 2026 @ UC Berkeley

Jennifer Listgarten and Alex Dimakis

All emails should go to: cs189-instructors@berkeley.edu

2 of 75

Why Start with Data?

Data is the foundation of ML

Every model begins and ends with data. It’s the raw material for training and evaluation.

Inputs and Experiments

Success in ML depends on how well we process inputs and outputs from experiments.

Critical Skill for Research & Industry

Whether you join a lab or work in industry, strong data wrangling and visualization skills are essential.

7824888

3 of 75

Roadmap

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

4 of 75

pandas Data Structures

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

5 of 75

pandas

Using pandas, we can:

Arrange data in a tabular format.
Extract useful information filtered by specific conditions.
Operate on data to gain new information.
Apply numerical operations using NumPy to our data.
Perform vectorized computations to speed up our analysis.

pandas is the standard tool across research and industry for working with tabular data.

Stands for "panel data"

7824888

6 of 75

pandas Data Types

In the language of pandas, tables are referred to as DataFrames.

A DataFrame

A Series

An Index

7824888

7 of 75

Series

A Series is a one-dimensional labeled array.
Components of a Series object:

Values: The actual data stored in the Series.
Index: The labels associated with each data point, which allow for easy access and manipulation.

import pandas as pd

welcome_series = pd.Series(["welcome", "to", "CS 189"])

welcome_series.values

welcome_series.index

7824888

8 of 75

DataFrame

Landmark Series

Type Series

Height Series

Year Built Series

You can create a DataFrame:

From a CSV file
Using a dictionary
Using a list and column names
From Series

7824888

9 of 75

Creating a DataFrame

data = {

'Landmark': ['Sather Gate', 'Campanile', 'Doe Library', 'Memorial Glade', 'Sproul Plaza'],

'Type': ['Gate', 'Tower', 'Library', 'Open Space', 'Plaza'],

'Height': [30, 307, 80, 0, 0],

'Year Built': [1910, 1914, 1911, None, 1962]

}

df = pd.DataFrame(data)

event_data = pd.read_csv("data/uc_berkeley_events.csv", index_col='Year')

7824888

10 of 75

Exploring DataFrames

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

11 of 75

Utility Functions for DataFrame

Understanding the structure and content of DataFrame is an essential first step in data analysis. Here are some methods to get a quick overview of DataFrame:

head()and tail(),
info(),
describe(),
sample(),
value_counts(), and
unique().�

7824888

12 of 75

head() and tail()

Use head() to display the first few rows of the DataFrame.

df.head() # default is 5

Use tail() to display the last few rows

df.tail(3)

7824888

13 of 75

shape and size

shape gets the number of rows and columns in the DataFrame.

size gets the total number of elements in the DataFrame.

df.shape

(5, 4)

df.size

7824888

14 of 75

sample()

.sample(n=#)

Randomly sample n rows.

.sample(frac=#)

Randomly sample frac of rows.

.sample(n=#, replace=True)

.sample(n=#, random_state=42)

Allows the same row to appear multiple times.

Using random_state for reproducibility.

To sample a random selection of rows from a DataFrame, we use the .sample() method.

7824888

15 of 75

value_counts() and unique()

Series.value_counts

Counts the number of occurrences of each unique value in a Series.

type_counts = df['Type'].value_counts()

Series.unique

Returns an array of every unique value in a Series.

type_counts = df['Type'].unique()

7824888

16 of 75

Selecting and Retrieving Data

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

17 of 75

Integer-based Extraction: iloc[]

We want to extract data according to its position.

Row position

Column position

0 1 2 3

Python convention: The first position has integer index 0.

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right-hand side of the slice).
A single value.

7824888

18 of 75

Integer-based Extraction: iloc[]

0 1 2 3

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right-hand side of the slice).
A single value.

7824888

19 of 75

Integer-based Extraction: iloc[]

0 1 2 3

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right-hand side of the slice).
A single value.

df.iloc[[1, 2]]

df.iloc[[1, 2], [1, 2, 3]]

7824888

20 of 75

Integer-based Extraction: iloc[]

0 1 2 3

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right-hand side of the slice).
A single value.

df.iloc[2:4]

df.iloc[:, 1:2]

7824888

21 of 75

Integer-based Extraction: iloc[]

0 1 2 3

Arguments to .iloc can be:

A list.
A slice (syntax is exclusive of the right-hand side of the slice).
A single value.

df.iloc[:, 0]

df.iloc[2]

df.iloc[0, 1]

7824888

22 of 75

Label-based Extraction: loc[]

We want to extract data according to its labels.

Row Labels

Column Labels

Arguments to .loc can be:

A list.
A slice (syntax is inclusive of the right-hand side of the slice).
A single value.

7824888

23 of 75

Label-based Extraction: loc[]

Arguments to .loc can be:

A list.
A slice (syntax is inclusive of the right-hand side of the slice).
A single value.

Row Labels

Column Labels

7824888

24 of 75

Label-based Extraction: loc[]

Arguments to .loc can be:

A list.
A slice (syntax is inclusive of the right-hand side of the slice).
A single value.

df.loc[[1, 2]]

df.loc[[1, 2], ['Type', 'Height', 'Year Built']]

Row Labels

Column Labels

7824888

25 of 75

Label-based Extraction: loc[]

Arguments to .loc can be:

A list.
A slice (syntax is inclusive of the right-hand side of the slice).
A single value.

df.loc[2:3]

df.loc[:, 'Landmark':'Height']

Row Labels

Column Labels

7824888

26 of 75

Label-based Extraction: loc[]

Arguments to .loc can be:

A list.
A slice (syntax is inclusive of the right-hand side of the slice).
A single value.

df.loc[:, 'Landmark']

df.loc[2]

df.loc[0, 'Type']

Row Labels

Column Labels

7824888

27 of 75

Context-Dependent Selection

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

That is, [] is context sensitive.

7824888

28 of 75

Context-Dependent Selection

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

df[1:3]

7824888

29 of 75

Context-Dependent Selection

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

df[['Landmark', 'Year Built']]

7824888

30 of 75

Context-Dependent Selection

[] only takes one argument, which may be:

A slice of row numbers.
A list of column labels.
A single column label.

df['Year Built']

7824888

31 of 75

Filtering Data

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

32 of 75

Boolean Array for Filtering a DataFrame

We learned to extract data according to its integer position (.iloc) or its label (.loc)

What if we want to extract rows that satisfy a given condition?

.loc and [ ] also accept Boolean arrays as input.
Rows corresponding to True are extracted; rows corresponding to False are not.

df['Height'] > 50

df[df['Height'] > 50]

df.loc[df['Height'] > 50]

equivalently

Produces a Boolean series

7824888

33 of 75

Combining Boolean Series for Filtering

Boolean Series can be combined using various operators, allowing filtering of results by multiple criteria.

The & operator allows us to apply operand_1 and operand_2
The | operator allows us to apply operand_1 or operand_2

df[(df['Height'] > 50) & (df['Type'] == 'Library')]

7824888

34 of 75

Bitwise Operators

& and | are examples of bitwise operators. They allow us to apply multiple logical conditions.
If p and q are boolean arrays or Series:

Symbol	Usage	Meaning
~	~p	Negation of p
\|	p \| q	p OR q
&	p & q	p AND q
^	p ^ q	p XOR q (exclusive or)

7824888

35 of 75

DataFrame Modification

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

36 of 75

Adding and Modifying Columns

Adding a column is easy:

Use [ ] to reference the desired new column.
Assign this column to a Series or array of the appropriate length.

df['Experience'] = [2, 5, 1, 8, 4]

df['Height_Increase'] = df['Height'] * 0.1

7824888

37 of 75

Dropping a Column

df.drop(columns=['Experience'])

print(df)

What happened?

Most DataFrame operations are not in-place.

7824888

38 of 75

Dropping a Column

df.drop(columns=['Experience'], inplace=True)

print(df)

Alternatively we can directly assign:

df = df.drop(columns=['Experience'])

7824888

39 of 75

Sorting DataFrame

Sorting organizes your data for better analysis.
We use sort_values() to sort by one or more columns in ascending or descending order.
Syntax:

Single column:

df.sort_values(by='Column', ascending=True)

Multiple columns:

df.sort_values(by=['Col1', 'Col2'], ascending=[True, False])�

7824888

40 of 75

Sorting DataFrame by One Column

Single column:

df.sort_values(by='Column', ascending=True)

df = df.sort_values(by='Height')

The default value for the ascending argument is True.

7824888

41 of 75

Sorting DataFrame by One Column

Multiple columns:

df.sort_values(by=['Col1', 'Col2'], ascending=[True, True])

df = df.sort_values(by=['Height', 'Type'], ascending=[True, False]))

7824888

42 of 75

Handling Missing Values

Missing values are a common issue in real-world datasets.
We will explore techniques to:

Detect missing values.
Handle missing values by either removing or imputing them.

7824888

43 of 75

Handling Missing Values

We will explore techniques to:

Detect missing values.

df_missing.isnull()

7824888

44 of 75

Handling Missing Values

We will explore techniques to:

Handle missing values by either removing or imputing them.

df_missing.dropna()

7824888

45 of 75

Handling Missing Values

We will explore techniques to:

Handle missing values by either removing or imputing them.

df_missing.fillna()

7824888

46 of 75

Aggregation in DataFrame

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

47 of 75

Aggregating Data in pandas

We can perform aggregations on our data, such as calculating means, sums, and other summary statistics.

df['Height'].mean()

83.4

Basic

sum()

mean()

median()

min()

max()

count()

nunique()– # of unique values

prod()

df['Height'].std()

129.2006191935627

Statistical

std()

var()

sem()– standard error of the mean

skew()

df['Height'].idxmax()

Logical and Index-Based

any()– True if any value is True

all()– True if all values are True

first()– first non-null value

last()– last non-null value

idxmin()– index of min value

idxmax()– index of max value

7824888

48 of 75

Grouping Data in pandas

Our goal:

Group together rows that fall under the same category.

For example, group together all rows representing a Tower landmark.

Perform an operation that aggregates across all rows in the category.

For example, sum up the average height of all Tower landmarks.

Grouping is a powerful tool to 1) perform large operations, all at once and 2) summarize trends in a dataset.

7824888

49 of 75

Grouping Data in pandas

A .groupby() operation involves some combination of splitting the object, applying a function, and combining the results.

7824888

50 of 75

Grouping Data in pandas

Gate

DataFrameGroupBy

Tower

Plaza

Open

Space

Library

augmented_df.groupby('Type')

.agg(mean)

7824888

51 of 75

Aggregation Functions

What goes inside of .agg( )?Any function that aggregates several values into one summary value.

Common examples:

.agg(sum) .agg(np.sum) .agg("sum")

.agg(max) .agg(np.max) .agg("max")

.agg(min) .agg(np.min) .agg("min")

.agg(np.mean) .agg("mean")

.agg("first")

.agg("last")

In-Built Python Functions

NumPy Functions

In-Built pandas Functions

Some commonly-used aggregation functions can even be called directly, without the explicit use of .agg( ):

df.groupby("Type").mean()

7824888

52 of 75

Grouping by Multiple Columns

augmented_df.groupby(['Type', 'Campus'])

.agg('max')

[['Height']]

7824888

53 of 75

Grouping by Multiple Columns: pivot_table

But we have two index in our DataFrame

columns

rows

values

pivot_table =

pd.pivot_table(

augmented_df,

index='Type',

columns='Campus',

values='Height',

aggfunc='max'

)

augmented_df.groupby(['Type', 'Campus'])

.agg('max')

[['Height']]

7824888

54 of 75

Joining DataFrames

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

55 of 75

Joining Events and Landmarks

event_data

landmarks

Joining DataFrames allows you to combine data from different sources based on a common key or index.

7824888

56 of 75

Types of Join: Inner Join

Inner Join: Returns only the rows with matching keys in both DataFrames.

event_data

landmarks

result_join_inner = landmarks.join(event_data.set_index('Year'),

on= 'Year Built',

how='inner')

�result_merge_inner = landmarks.merge(event_data,

how='inner',

left_on='Year Built',

right_on='Year')

7824888

57 of 75

Types of Join: Outer Join

Outer Join: Returns all rows from both DataFrames, filling missing values with NaN where there is no match.

event_data

landmarks

result_merge_outer = landmarks.merge(event_data,

how='outer',

left_on='Year Built',

right_on='Year')

no match

7824888

58 of 75

Types of Join

Inner Join: Returns only the rows with matching keys in both DataFrames.
Outer Join: Returns all rows from both DataFrames, filling missing values with NaN where there is no match.
Left Join: Returns all rows from the left DataFrame and matching rows from the right DataFrame.
Right Join: Returns all rows from the right DataFrame and matching rows from the left DataFrame.

7824888

59 of 75

Join at slido.com�#8761386

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

7824888

60 of 75

What does df['Type'] return?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

7824888

61 of 75

What does df.tail(3) do?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

7824888

62 of 75

df['Type'].unique() returns:

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

7824888

63 of 75

Any questions you have for the first two lectures or input for us?

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

7824888

64 of 75

Visualization

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

65 of 75

Why Do We Visualize Data?

Insight: Critical to gaining deeper insights into trends in the data.

Communication: Help convey trends in data to others.

In this class we will explore both but focus more on the first.

7824888

66 of 75

Plotting Libraries

There are a wide range of tools for data visualization in machine learning.

Weights and Biases – Commercial service used for tracking training runs and model artifacts.
Matplotlib – Python library commonly used for static plots.
Plotly – Cross-language interactive plotting library.

In this course we will focus on Plotly and Weights and Biases.

7824888

67 of 75

Matplotlib

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

68 of 75

Matplotlib

Matplotlib is a versatile Python library for creating static and publication-quality visualizations.

import matplotlib.pyplot as plt

Line Plot

Scatter Plot

Bar Plot

Histogram

Box Plot

Heatmap

7824888

69 of 75

Plotly

pandas

pandas Data Structures
Exploring DataFrames
Selecting and Retrieving Data in DataFrames
Filtering Data
DataFrame Modification
Aggregation in DataFrame
Joining DataFrames

Visualization

Matplotlib
Plotly

7824888

70 of 75

Why Plotly?

Many would (correctly) argue you should learn MatplotLib.

Gold standard for python visualization since ~2005.
Most paper plots are still made with Matplotlib.
If we had time we would teach both (we may use some in demos).

However, Matplotlib plots are static making it difficult to interactively slice.

Weights and Bias and Plotly are designed for interaction.
Quicker to gain insights – a focus of this class.

7824888

71 of 75

Three Ways to Plot using Plotly

Easiest: Using pandas built-in plotting functions

Great place to start.

Easy + Expressive: Use Plotly Express to construct plots quickly

Like pandas plotting functions but with more options.

Advanced: Build plot from graphics objects (like Matplotlib)

Need to learn some basic Plotly concepts.

7824888

72 of 75

Using pandas built-in Plotting Functions

Configure pandas to use Plotly (done at beginning of notebook).

pd.set_option('plotting.backend', 'plotly')

Call plot directly on your DataFrame:

# Make various kinds of plots 'scatter', 'bar', 'hist'

mpg.plot(kind='scatter', x='weight', y='mpg', color='origin',

title='MPG vs. Weight by Origin',

width=800, height=600)

Notice that we define:

The kind of plot (e.g., 'scatter', 'bar')
How columns are attached to visual elements (e.g., x, y, color, size, shape …)

7824888

73 of 75

Using Plotly Express

Very similar to calling pandas .plot functions but with a wide range of plotting capabilities (see tutorials):

import plotly.express as px

px.scatter(mpg, x='weight', y='mpg', color='origin', size='cylinders',

hover_data = mpg.columns,

title='MPG vs. Weight by Origin',

width=800, height=600)�

Here we pass the DataFrame (mpg) into the desired plotting function along with how columns are mapped to visual elements.

7824888

74 of 75

How you Can Learn More

Skim the tutorials (optional but can be fun):

When you can’t figure out how to plot something (or just want to learn more) ask an AI agent.

Most LLMs are very good at Plotly and Matplotlib.

7824888

1 of 75

2 of 75

3 of 75

4 of 75

5 of 75

6 of 75

7 of 75

8 of 75

9 of 75

10 of 75

11 of 75

12 of 75

13 of 75

14 of 75

15 of 75

16 of 75

17 of 75

18 of 75

19 of 75

20 of 75

21 of 75

22 of 75

23 of 75

24 of 75

25 of 75

26 of 75

27 of 75

28 of 75

29 of 75

30 of 75

31 of 75

32 of 75

33 of 75

34 of 75

35 of 75

36 of 75

37 of 75

38 of 75

39 of 75

40 of 75

41 of 75

42 of 75

43 of 75

44 of 75

45 of 75

46 of 75

47 of 75

48 of 75

49 of 75

50 of 75

51 of 75

52 of 75

53 of 75

54 of 75

55 of 75

56 of 75

57 of 75

58 of 75

59 of 75

60 of 75

61 of 75

62 of 75

63 of 75

64 of 75

65 of 75

66 of 75

67 of 75

68 of 75

69 of 75

70 of 75

71 of 75

72 of 75

73 of 75

74 of 75

75 of 75