1 of 31

Machine Learning Design

Discussion Mini Lecture 2

The Supervised Learning Design Process

CS 189/289A, Fall 2025 @ UC Berkeley

Sara Pohland

2 of 31

Machine Learning Lifecycle

L

M

P

O

LEARNING PROBLEM

PREDICT & EVALUATE

MODEL DESIGN

OPTIMIZATION

What do I want to predict?

What data do I have?

What features should I use?

What model family?

How do I learn the parameters of my model?

How do I make predictions?

How do I assess performance?

(1)

(2)

(3)

(4)

3 of 31

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

Concepts Covered

4 of 31

Understanding & Preparing our Data

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

5 of 31

Desired Form of Data

What do we want our data to look like for training?

 

Dataset:

 

Input:

 

Output:

 

 

Design Matrix:

 

Target Vector:

May not be available

 

6 of 31

Starting Form of Data

What does our data initially look like?

HI, It’s your boss. Im stuck in Nigeria with none money. Please wire to TRWIGB2LXXX SOON.

# Bed

# Bath

Location

4

3

Berkeley

images

text

tabular

audio

time series

genomic

video

7 of 31

1) Investigate the Data

Answer the following about your data:

  • How many data samples do I have?
  • What type of data am I working with?
  • What does my data look like?
  • Is my data labeled?

If you have categorical labels:

  • How many unique labels are there?
  • How many samples do you have from each class?

If you have quantitative labels:

  • What is the distribution of the target labels?

Practice investigating data: Section 2 of Discussion 2 notebook

8 of 31

2) Clean up the Data

For categorical features:

  • Remove data with missing values or use mode imputation
  • Use one-hot-encoding (see lecture 3)
    • This avoids artificial ordering that comes from simply converting a categorical features to a numeric value!

For numerical features:

  • Remove data with missing values or use median imputation
  • Normalize each feature (see lecture 3)
    • This avoids placing a disproportionate weight on features with larger magnitudes and improves numerical stability!
  • Consider removing highly correlated variables

Practice cleaning up tabular data: Sec. 2 of Discussion 2 notebook

9 of 31

2) Clean up the Data

For text:

  • Remove punctuation, special characters, stop words, etc.

For images:

  • Reshape images to a fixed size
  • Normalize pixel values

For all data types:

  • Remove “bad” data (low-quality images, anomalies, etc.)
  • Consider removing features that we do not expect/want to help us predict our label/target
    • Features with poor predictive power can introduce unnecessary variance!

10 of 31

3) Divide Data into Three Sets

Data

Test

Train

Train - Test

Split

Train - Val.

Split

Train

Val.

80-20*

75-25*

Train: ~60% of data

*These are commonly used splits, but you can adjust these percentages based on your data.

Val.: ~20% of data

Test: ~20% of data

Used to fit the model during training

Used to select model params. during training

Used to estimate performance post-training

Practice splitting data into sets: Section 3 of Discussion 2 notebook

11 of 31

Choosing/Learning Good Features

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

12 of 31

Generating Features

For text data:

  • Use a Bag-of-Words model
    • i.e., Count the presence of words in a vocabulary (see lecture 3)
    • Usually stop words (e.g., the, is, of…) that contain minimal info are dropped
  • Count the use of punctuation marks
  • Consider text length as a feature
  • Look for specific patterns in text
    • e.g., Presence of URLs, typos, all caps words, etc.

13 of 31

Generating Features

For image data:

  • Generate hand-crafted features
    • e.g., Use edge detectors to count presence of edges, texture descriptors to identify patterns, or CV tools to measure image properties (e.g., brightness)
    • Not common today but could be used to design more interpretable models.
  • Find deep learning representations
    • e.g., Use (some component of) a pretrained model (ResNet, CLIP, etc.) to generate a feature vector that can be used for their ML task.
    • e.g., Use an object detection model to detect the presence of objects that might be relevant to your larger ML task.

14 of 31

Generating Features

For tabular data:

  • Derive features from existing ones
    • e.g., We want to predict worker burnout for staff members who started working in different weeks. Among other things, we’re given the total number of hours each worked and total number of weeks. We also might want to derive the average number of hours worked per week.

For any data:

  • Use domain knowledge to identify what information might be useful in predicting the thing we want.

Practice engineering features: Section 2 of Discussion 2 notebook

15 of 31

Selecting a Model Family

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

16 of 31

Determining the ML Paradigm

Do I have observations associated with my data?

Yes, labels

Quantitative

Classification

Categorical

Regression

Yes, rewards

Supervised Learning

Reinforcement Learning

Unsupervised Learning

No

What type of labels?

What is my goal?

Clustering

Group data

Dimensionality Reduction

Reduce features

17 of 31

Examples of ML Problems

  1. A botanist gives images of plants and tells me what they are. When I see a plant in the wild, I want to label it myself.

Classification

Categorical labels

  1. I want my robot to fold my laundry but don’t know how to encode this. I reward it for every article of clothing it successfully folds.
  1. Before testing medical interventions, I want to group together patients that have similar medical backgrounds.

Clustering

Grouping data

Reinforcement Learning

Rewards

18 of 31

Examples of ML Problems

  1. You give me a list of ratings for all of the movies you’ve watched. I want to predict how you want to rate a movie you haven’t seen.

Regression

Quantitative labels

  1. I have a table of data with 10,000 columns. I want to remove redundant features before training my model.

Dimensionality Reduction

Reducing features

Practice identifying the ML paradigm: Disc. 2 worksheet

19 of 31

Regression Model Families

Simple

Complex

Linear Regression (Lecture 5)

Neural Networks (Lecture 12)

20 of 31

Classification Model Families

Simple

Complex

Logistic Regression (Lecture 7)

Neural Networks (Lecture 12)

21 of 31

Interplay of Features and Model Families

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

22 of 31

Selecting Features & Model Families

Key Considerations:

  • Machine learning is full of trade-offs.
    • More features provide more information to our model, but they introduce additional variance, so we’re more likely to capture spurious data patterns.
    • More complex models can capture more complex input-output relationships, but they’re more likely to overfit to our training set.
    • Larger datasets can help us learn more generalizable models (assuming our dataset is representative of our population), but they require more memory and computational resources to train with them.
    • Engineering new features and increasing model complexity can improve accuracy, but this may also reduce interpretability.

23 of 31

Selecting Features & Model Families

Key Considerations:

  • Feature engineering and model selection are complementary.
    • If we have a simple model (e.g., a linear classifier), we probably want to choose/learn some new, useful features.
    • If we have a complex model (e.g., a neural network), we may want to work with our original, unmodified features.
  • Domain knowledge is critical for training/deploying ML models.
  • When in doubt, run validation to evaluate your options.

Key Considerations:

24 of 31

Learning Model Parameters

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

25 of 31

Scikit-learn is Your Friend!

26 of 31

Using Scikit-learn to Fit a Model

  • Scikit-learn provides dozens of built-in machine learning algorithms and models, called estimators.
  • Each estimator can be fitted to some data using its fit method.

Live, love, learn the fit method 🫶

27 of 31

Using Scikit-learn to Fit a Model

  •  

Practice fitting a model: Section 4 of Discussion 2 notebook

28 of 31

Making Model Predictions

  1. Data Pre-processing
  2. Model Design
    1. Feature Engineering
    2. Model Families
    3. Design Considerations
  3. Fitting a Model
  4. Making a Prediction

29 of 31

Using Scikit-learn to Make Predictions

  • After using Scikit-learn to fit an estimator, it can be used for predicting target values of new data.

With fit comes its forever +1, predict

Practice making predictions: Sec. 4 of Discussion 2 notebook

30 of 31

Machine Learning Design

Discussion Mini Lecture 2

Contributors: Sara Pohland

31 of 31

Additional Resources

  1. Fitting a model and making predictions
  2. We’ll go deeper into other things later in the semester!