Machine Learning Design
Discussion Mini Lecture 2
The Supervised Learning Design Process
CS 189/289A, Fall 2025 @ UC Berkeley
Sara Pohland
Machine Learning Lifecycle
L
M
P
O
LEARNING PROBLEM
PREDICT & EVALUATE
MODEL DESIGN
OPTIMIZATION
What do I want to predict?
What data do I have?
What features should I use?
What model family?
How do I learn the parameters of my model?
How do I make predictions?
How do I assess performance?
(1)
(2)
(3)
(4)
Concepts Covered
Understanding & Preparing our Data
Desired Form of Data
What do we want our data to look like for training?
Dataset:
Input:
Output:
Design Matrix:
Target Vector:
May not be available
Starting Form of Data
What does our data initially look like?
HI, It’s your boss. Im stuck in Nigeria with none money. Please wire to TRWIGB2LXXX SOON.
# Bed | # Bath | … | Location |
4 | 3 | … | Berkeley |
images
text
tabular
audio
time series
genomic
video
1) Investigate the Data
Answer the following about your data:
If you have categorical labels:
If you have quantitative labels:
Practice investigating data: Section 2 of Discussion 2 notebook
2) Clean up the Data
For categorical features:
For numerical features:
Practice cleaning up tabular data: Sec. 2 of Discussion 2 notebook
2) Clean up the Data
For text:
For images:
For all data types:
3) Divide Data into Three Sets
Data
Test
Train
Train - Test
Split
Train - Val.
Split
Train
Val.
80-20*
75-25*
Train: ~60% of data
*These are commonly used splits, but you can adjust these percentages based on your data.
Val.: ~20% of data
Test: ~20% of data
Used to fit the model during training
Used to select model params. during training
Used to estimate performance post-training
Practice splitting data into sets: Section 3 of Discussion 2 notebook
Choosing/Learning Good Features
Generating Features
For text data:
Generating Features
For image data:
Generating Features
For tabular data:
For any data:
Practice engineering features: Section 2 of Discussion 2 notebook
Selecting a Model Family
Determining the ML Paradigm
Do I have observations associated with my data?
Yes, labels
Quantitative
Classification
Categorical
Regression
Yes, rewards
Supervised Learning
Reinforcement Learning
Unsupervised Learning
No
What type of labels?
What is my goal?
Clustering
Group data
Dimensionality Reduction
Reduce features
Examples of ML Problems
Classification
Categorical labels
Clustering
Grouping data
Reinforcement Learning
Rewards
Examples of ML Problems
Regression
Quantitative labels
Dimensionality Reduction
Reducing features
Practice identifying the ML paradigm: Disc. 2 worksheet
Regression Model Families
Simple
Complex
Linear Regression (Lecture 5)
Neural Networks (Lecture 12)
Classification Model Families
Simple
Complex
Logistic Regression (Lecture 7)
Neural Networks (Lecture 12)
Interplay of Features and Model Families
Selecting Features & Model Families
Key Considerations:
Selecting Features & Model Families
Key Considerations:
Key Considerations:
Learning Model Parameters
Scikit-learn is Your Friend!
Using Scikit-learn to Fit a Model
Live, love, learn the fit method 🫶
Using Scikit-learn to Fit a Model
Practice fitting a model: Section 4 of Discussion 2 notebook
Making Model Predictions
Using Scikit-learn to Make Predictions
With fit comes its forever +1, predict
Practice making predictions: Sec. 4 of Discussion 2 notebook
Machine Learning Design
Discussion Mini Lecture 2
Contributors: Sara Pohland
Additional Resources