1 of 86

Introduction to �Machine Learning

Lecture 1

Welcome to CS189/289

EECS 189/289, Spring 2026 @ UC Berkeley

Jennifer Listgarten and Alex Dimakis

All emails should go to: cs189-instructors@berkeley.edu

2 of 86

Join at slido.com�#5560345

The Slido app must be installed on every computer you’re presenting from

Do not edit�How to change the design

3 of 86

Roadmap

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

4 of 86

Introductions

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

5 of 86

Jennifer Listgarten

Professor at EECS UC Berkeley (2017-now).

Steering committee member of BAIR,
Center for Computational Biology
UC Berkeley Bioengineering program

Research: ML and Molecular Biology

How ML can be used to advance protein engineering
Molecule design for drug discovery�

I also teach: CS294 (ML+Statistics+Biology)
This semester, I’m:

Teaching this course, advising my students, and doing ML+Bio research

6 of 86

Alex Dimakis

Professor at EECS UC Berkeley (2025-now).

Sky Computing Lab and a member of BAIR

Research: Generative AI,�How to train agents, how to create datasets.
E.g. Openthoughts-Agent

I also teach: EE120 (Signals and Systems)

7 of 86

CS 189/289 Goals and Plans

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

8 of 86

Why Are We Excited to Teach CS189?

The field shifted under our feet → we’re redesigning now so we surf, not chase.
We listened to student pain points → we’re tightening the storyline to math → code → experiments, with unified notation and right-sized and reasonable prerequisites.
Classroom work should feed research → projects will connect to live problems so students → collaborators.
Berkeley should set the bar → we aim to produce a shareable resource that others can adopt and adapt.
We are trying MANY NEW THINGS some things will break… but we hope you will enjoy this class.

9 of 86

By the End of CS189, You Should:

Have a rigorous foundation in core ML concepts and algorithms, connecting the math to the methods we use.
Be able to implement, train, and debug standard models in Python/PyTorch and move from data prep to experiments.
Be able to design sound evaluations and document your work so results are reproducible and responsible.
Be prepared for advanced courses and research.

These are ambitious goals!

10 of 86

Accel in Advanced ML Courses

Require understanding of basic concepts:

Model design
Stochastic Gradient Descent
Maximum Likelihood
Loss Structure
Supervised Learning
Unsupervised Learning
Classification
Regression
Regularization and the Bias/Variance Tradeoff
Dimensionality Reduction
Clustering and Density Estimation
Experiment Design and Tracking

…

Advanced ML course:

Computer vision (180/280), NLP (288), RL (285), Statistical Learning Theory (281), and Many Special Topics Courses on AI (194/294)

11 of 86

What Is Machine Learning?

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

12 of 86

What Is Machine Learning (ML)?

Basic Recipe:

Use examples (data) to teach (fit) a model.
Use the model to make decisions.

12

Software systems that improve (learn) through data.

Classic Example: What is Spam?

Difficult to define
Easy to demonstrate

Classic Example: Face detection?

Difficult to program
Easy to demonstrate

13 of 86

What Is �Artificial Intelligence?

Artificial intelligence (AI) refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making.

-- Wikipedia (2025)

The primary technology behind modern AI is�Machine Learning (ML)

14 of 86

Artificial Intelligence Is the Goal�Machine Learning Is the Method

Terms are often used interchangeably (even by experts).

14

If you are selling

Machine Learning then you call it Deep Learning. (2015-2020)
Deep Learning then you call it AI. (2020-2024)
AI you call it an Agent. (2024-Now)

An entrepreneur’s note on AI marketing

15 of 86

You have (probably) already �done Machine Learning

16 of 86

Linear �Regression�is�Machine �Learning

Basic Recipe:

Training: Use data to teach (fit) a model.

Inference: Use the model to make predictions (decisions).

16

Data

Model

Prediction

17 of 86

Overview of the landscape

Terms used:

Data science

Machine Learning

Artificial Intelligence

Big Data

Data Mining

Statistics

18 of 86

Overview of the landscape

Terms used:

Data science

Machine Learning

Artificial Intelligence

Big Data

Data Mining

Statistics

@jeremyjarvis

“A data scientist is a statistician who lives in San Francisco”

@BigDataBorat

Data science is statistics on a Mac.

@josh_wills

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician

(anonymous)

The difference between statistics and data science is about 30k per year.

(anonymous)

It is statistics if it’s done in R.

It is Machine learning if it’s done in Python

It is AI if it’s done in Powerpoint.

19 of 86

Overview of the landscape

Statistics: The very first line of the American Statistical Association’s definition of statistics is “Statistics is the science of learning from data...” Given that the words “data” and “science” appear in the definition, one might assume that data science is just a rebranding of statistics.

Statisticians are not happy they are not getting the research funding and salary bumps involved.

Data science: Broad and modern term. Includes much more software engineering knowledge. Usually done in Python (as opposed to R/Stata/SAS/SPSS/Excel etc). Includes analytics on bigger datasets (e.g. terabytes or petabytes by using tools like Apache Spark, Hadoop Mapreduce which enable distributed processing. Includes data collection and data cleaning pipelines (data engineering/ data wrangling). Connections to database backends and web-serving front-ends. Includes to some extend machine learning and AI as sub-areas.

Machine learning: The more mathematically complex part of data science focused on modeling (as opposed to software). Includes supervised learning, predictive modeling, unsupervised learning (like clustering), text and image understanding.

Artificial Intelligence: Broad and classic term that includes machine learning as a sub-discipline. Allowing computers to do things that humans do when they say they are thinking. Includes perception (image understanding, speech understanding), Language translation, playing games, statistical machine learning techniques but also logic-based symbolic AI, reasoning, planning, and robotics).

Data Mining: Applied version of Machine learning. Includes more large-scale software and performance issues. Also intersects the database community.

Big Data: Focused on scaling data analytics to very large data sets. Part of data science that will hopefully follow the information superhighway and internet surfing to obsolete historical nomenclature.

20 of 86

In terms of research communities:

Statistics

Research published in stats journals. Top venues: Annals of Statistics, JASA, Journal of the Royal Statistical Society.

Data science:

Not a properly defined research community.

Machine learning:

Research published in top ML conferences: NeurIPS, ICML, also more recently ICLR. Also includes KDD (more applied, data mining).

https://iclr.cc/virtual_2020/index.html

Artificial Intelligence:

Includes ML conferences but also AAAI and IJCAI as top venues.

Data Mining: Applied version of Machine learning. Includes more large-scale software and performance issues. Also intersects the database community.

Research published in top Data Mining conferences: KDD, SDM.

Big Data:

Not a properly defined research community.

Engineers who can setup Hadoop/Spark clusters. Can work on data directly on disk and process at massive scale.

21 of 86

A taxonomy for machine learning

We will develop a taxonomy of different machine learning problems.
1. Supervised versus Unsupervised.

22 of 86

Supervised learning: Binary classification

Given a table of training data containing features (x1,x2,..) and a target variable y we want to predict.
Example: Taste-test: Predict if a new beverage will be evaluated as having ‘Great taste’ from a focus group.

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

23 of 86

Supervised learning: Binary classification�Data-driven Taste Test example

Jargon: Every row is called a training set sample.
We have 2 features (Acidity and Sweetness). In practice we may have many more (Color, Carbonation level, .. ). In statistics these are also called covariates, factors or regressors. �(in some stats literature, a factor is a categorical feature, a covariate is a continuous feature)

The target variable y is now binary (good taste or not): Binary classification.

y could have multiple levels: (Poor, mediocre, Good, Great): Multi-label classification

Or y could be a continuous number to predict (Taste score from 1:100): Regression.

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

24 of 86

Supervised learning: Binary classification

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

How we do we make a good prediction rule?

A first idea: Lookup Table

25 of 86

Supervised learning: Binary classification

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

How we do we make a good prediction rule?

A first idea: Lookup Table

That won’t work. We need some kind of rule. Ideally simple. We now discuss two common frameworks for this: Decision Trees & Linear Classifiers

26 of 86

Binary classification with a short tree:�A decision stump�

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 1 predicts
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

S>= 0.75

Predict f(x)=1

Predict f(x)=0

o/w

27 of 86

Binary classification with a short tree:�A decision stump�

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 1 predicts
Bev1	0.8	0.8	1	1
Bev2	0.3	0.25	0	0
Bev3	0.2	0.8	0	1
Bev4	0.3	0.7	0	0
Bev5	0.9	0.7	1	0

S>= 0.75

Predict f(x)=1

Predict f(x)=0

o/w

Accuracy of this model on the training set is:

?/ 5

28 of 86

Binary classification with a short tree:�A decision stump�

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 1 predicts
Bev1	0.8	0.8	1	1
Bev2	0.3	0.25	0	0
Bev3	0.2	0.8	0	1
Bev4	0.3	0.7	0	0
Bev5	0.9	0.7	1	0

S>= 0.75

Predict f(x)=1

Predict f(x)=0

o/w

Accuracy of this model on the training set is:

3/ 5

29 of 86

Partitioning the feature space �

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

S>= 0.75

Predict f(x)=1

Predict f(x)=0

o/w

Lets position the data on this feature space

Acidity

Sweetness

30 of 86

Partitioning the feature space �

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

S>= 0.75

Predict f(x)=1

Predict f(x)=0

o/w

Lets position the data on this feature space

Acidity

Sweetness

31 of 86

Partitioning the feature space �

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

S>= 0.75

Predict f(x)=1

Predict f(x)=0

o/w

Each binary classifier has a decision region:

How it partitions the feature space.

Acidity

Sweetness

S=0.75

32 of 86

Binary classification with depth 2 decision tree:��

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 2 predicts
Bev1	0.8	0.8	1	1
Bev2	0.3	0.25	0	0
Bev3	0.2	0.8	0	0
Bev4	0.3	0.7	0	0
Bev5	0.9	0.7	1	0

A>= 0.5

S>0.750000000

1

Predict f(x)=0

Bev2=0

Bev3=0

Bev4=0

o/w

Accuracy of this model on the training set is:

1/5 = 20% -> 4/5

Loss function we use is the normal 0-1 accuracy.

?

2

o/w

3

Predict f(x)=1

Bev5=1

Predict f(x)=1

Bev1=1

Model 2:

This model splits first on A with threshold 0.5

and then on S with threshold 0.75.

f(x)= yhat =(ideally)= y.

f(A=0.8,S=0.8)= 1

33 of 86

Binary classification with depth 2 decision tree:��

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 2 predicts
Bev1	0.8	0.8	1	1
Bev2	0.3	0.25	0	1
Bev3	0.2	0.8	0	1
Bev4	0.3	0.7	0	1
Bev5	0.9	0.7	1	0

A>= 0.5

S>0.75

1

Predict f(x)=1

o/w

Accuracy of this model on the training set is:

?

2

o/w

3

Predict f(x)=0

Predict f(x)=1

Model 2:

This model splits first on A with threshold 0.5

and then on S with threshold 0.75.

Could you get better training accuracy by labeling leaves differently ? YES.

What is a general algorithm of finding the labels of the leaves that maximize the training accuracy for a given tree ?

Run one epoch of the dataset, place every sample in one leaf, label leaf with majority of training labels.

What is the highest training accuracy you can get?

You can get 100% on this training dataset.

34 of 86

Binary classification with depth 2 decision tree:��

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 2 predicts
Bev1	0.8	0.8	1	1
Bev2	0.3	0.25	0	1
Bev3	0.2	0.8	0	1
Bev4	0.3	0.7	0	1
Bev5	0.9	0.7	1	0

A>= 0.5

S>0.75

leaf 1

o/w

Accuracy of this model on the training set is:

o/w

leaf 2

leaf 3

Model 2:

Acidity

Sweetness

leaf 1

leaf 2

leaf 3

35 of 86

Binary classification with a linear classifier��

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 3 predicts
Bev1	0.8	0.8	1	?
Bev2	0.3	0.25	0	?
Bev3	0.2	0.8	0	?
Bev4	0.3	0.7	0	?
Bev5	0.9	0.7	1	?

Compute the predictions of this model.

Draw the decision boundary

Model 3:

f(A,S) = 1 if A+S -1 ≥ 0

0 otherwise

Acidity

Sweetness

36 of 86

Binary classification with a linear classifier��

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 3 predicts
Bev1	0.8	0.8	1	?
Bev2	0.3	0.25	0	?
Bev3	0.2	0.8	0	?
Bev4	0.3	0.7	0	?
Bev5	0.9	0.7	1	?

Compute the predictions of this model.

Draw the decision boundary

Model 3:

f(A,S) = 1 if A+S -1 ≥ 0

0 otherwise

Acidity

Sweetness

37 of 86

Binary classification with a linear classifier��

	Acidity(A)	Sweetness (S)	y=‘Great taste’?	Model 3 predicts
Bev1	0.8	0.8	1	?
Bev2	0.3	0.25	0	?
Bev3	0.2	0.8	0	?
Bev4	0.3	0.7	0	?
Bev5	0.9	0.7	1	?

Compute the predictions of this model.

Draw the decision boundary

Model 4:

f(A,S) = 1 if A+S -1.3 ≥ 0

0 otherwise

Acidity

Sweetness

38 of 86

Training and Test set

The goal of prediction is to predict the future, not the past.
We have a few more beverages with different features and we want to predict if people will like them, without administering a study for each one.

� (Could have billions of combinations of features)

How do we build a prediction model that goes from (Acidity, Sweetness)-> will people like it?

How do we evaluate if the model works well. (Model evaluation)

How do we interpret if works well for the right reasons. (Model interpretability)

We will spend a lot of time covering all these topics

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	0.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev6	0.7	0.7	?
Bev7	0.1	0.1	?
…	…	…	?
…	…	…	?
…	…	…	?

39 of 86

There is always one very bad model

Claim: For any training set (arbitrarily large) There is a model M_BAD (A, S) that has perfect training accuracy and terrible predictive performance. Construct it.

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev6	0.7	0.7	?
Bev7	0.1	0.1	?
…	…	…	?
…	…	…	?
…	…	…	?

40 of 86

There is always one very bad model

Claim: For any training set (arbitrarily large) There is a model M_BAD (A, S) that has perfect training accuracy and terrible predictive performance. Construct it.
Model_Bad(Ainp, Sinp): if Ainp,Sinp in training set row i, output yi (true label for row i)
else: output 0.
Model_bad is memorizing the training set but it cannot generalize.
Ideally we want the simplest model that has good accuracy on the training data.
Occam’s razor: The simplest explanation that fits the data is usually the correct one.

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev1	0.8	0.8	1
Bev2	.3	0.25	0
Bev3	0.2	0.8	0
Bev4	0.3	0.7	0
Bev5	0.9	0.7	1

	Acidity(A)	Sweetness (S)	y=‘Great taste’?
Bev6	0.7	0.7	?
Bev7	0.1	0.1	?
…	…	…	?
…	…	…	?
…	…	…	?

41 of 86

Machine Learning

Data

Big Model

Training

42 of 86

Data

Model

Training

43 of 86

Data

Model

Training

Application

Decision

Query

?

Training

Inference

44 of 86

What kinds of problems are�Machine Learning Problems?

When should I used Machine Learning?

45 of 86

Kinds of Problems

Engineering Problem: Can be solved with a direct, specifiable algorithm or a set of hand-written rules.

Machine Learning Problem: For which it is easy to demonstrate or evaluate the solution but difficult to directly implement.

A Human Problem: The problem cannot be well specified and/or human judgement is required.

🡪 Often require Engineering + ML + Humans 🤝

46 of 86

A Machine Learning Problem

A problem for which it is easy to demonstrate or evaluate the solution but difficult to directly implement.

Determine if a text message is spam

Example:

How do you �define Spam?

Spam is difficult to define and �depends on the receiver.

Easier to demonstrate examples and learn a function to detect spam.

The system learns the desired behavior (e.g., prediction, representation, or a policy) through demonstration or experience.

Machine Learning Solution:

47 of 86

Is Chatting a Machine Learning Problem?

Engage a human in a productive conversation

Example (ChatGPT):

How do you program this?

ELIZA (1966)

Rule based �Conversational System

https://en.wikipedia.org/wiki/ELIZA

Entertaining but �it can’t do your homework.

https://web.njit.edu/~ronkowit/eliza.html

We can demonstrate good conversations.

We can judge good conversations.

48 of 86

Machine Learning�as Learned Function Approximation

Input (X)

Output (Y)

Function

(Model)

Is it Spam?

No (0) / Yes (1)

Text�Message

Model

Parameters

Machine learning becomes the process of “learning” the model parameters from data or interaction with the world.

49 of 86

Learning Settings

Output (Y)

Input (X)

Function

(Model)

Is it Spam?

No (0) / Yes (1)

Text�Message

Model

Parameters

Supervised

(Demonstration)

Unsupervised

Reinforcement

(Reward)

{(X,Y)}

{X}

X, reward(.)

Observe:

50 of 86

Supervised Learning

Trying to learn a relationship between observed {(X,Y)} pairs.

Regression

Classification

Image and Video

Generation

Image Labeling

X: Image

Y: {Hot Dog, …}

X: Prompt

Y: Next Word

Stock

Prediction

X: History

Y: Next Value

X: Prompt+Noise

Y: Pixel Values

51 of 86

Unsupervised Learning

Trying to model the data in the absence of explicit labels.

�Used for visualization and as a step in other ML tasks.

Dimensionality�Reduction

Clustering & �Density Estimation

4

2

Image

Approx. Image

Low-Dimensional

Representation

52 of 86

Reinforcement Learning

Learning from reward signals often with complex multi-step (discrete or continuous) action sequences.

Not covered in this class but a direct extension of topics in this class.

Action: next move,

Reward: Win/Lose

Action: change in joint angles

Reward: Fold quality

Action: next token

Reward: answer quality

53 of 86

Types of supervised learning problems�(Tabular data)

Given features about a house predict the sale price. �(Square feet, number of bedrooms, bathrooms, Zip Code, HasPool, etc)�https://www.kaggle.com/c/house-prices-advanced-regression-techniques
�Given income, education, number of credit cards etc, predict if an individual will default on their loan

Instacart Market Basket analysis (Predict which products will an Instacart consumer purchase again)�https://www.kaggle.com/c/instacart-market-basket-analysis

Predict sales of Walmart weather sensitive products in 40 locations�https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather

Given patient data predict the clinical outcome for a disease.

Predict next year re-hospitalization using hospital data�https://www.kaggle.com/c/hhp

54 of 86

Types of supervised learning problems�(Image data / Computer vision)

Given images of manufactured steel predict there will be defects. �https://www.kaggle.com/c/severstal-steel-defect-detection�
Severstal, a leading steel mining and production company �wants to use machine learning to automatically predict if�manufactured flat sheets have defects.
Here the features are pixels.

55 of 86

Types of supervised learning problems�(Image data / Computer vision)

Given images of manufactured steel predict there will be defects. �(Here features are the pixels of the image, plus usually categorical features)�https://www.kaggle.com/c/severstal-steel-defect-detection�
Dogs vs Cats: Given an image classify if it is an image of a dog or a cat https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

Estimate density of wheat heads from outdoor field images�https://www.kaggle.com/c/global-wheat-detection

56 of 86

Types of supervised learning problems�(Time series data)

Predict an upcoming seizure using EEG recordings
https://www.kaggle.com/c/seizure-prediction

Covid19 Global forecasting per city�https://www.kaggle.com/c/covid19-global-forecasting-week-5/discussion/151461

Speech recognition, Music Genre Classification, Financial time-series forecasting

57 of 86

Types of supervised learning problems�(Natural Language processing (NLP))

Twitter sentiment detection
https://www.kaggle.com/c/tweet-sentiment-extraction
(Here the features are the text of the Tweet)

News Article classification (Politics, Sports, Entertainment)

Detecting toxic content in online conversations�https://www.kaggle.com/c/quora-insincere-questions-classification

58 of 86

A taxonomy for machine learning

Supervised learning:
I.e. there exists a target variable y we are trying to predict.

Splits to:

Binary classification
Multi-class classification
Regression (y continuous)

Unsupervised learning:
There is no y

Ok what to do ?

Clustering
Association analysis�(diapers and beer)
Anomaly detection
Generation of artificial data�(create new features that look like natural ones).

59 of 86

Unsupervised problems:�Clustering

Clustering.
Consider points in feature space (no labels).
Assign points in clusters, so that all similar points are assigned together.

Frequently used algorithms:
K-means
DBSCAN

60 of 86

Unsupervised problems: �Dimensionality Reduction

Dimensionality reduction
Reduce the number of features, to find a lower dimensional representation.
Hopefully represents the important directions in dataset

Frequently used algorithms:
PCA
Sparse PCA, Nonnegative PCA, etc
t-SNE
Deep-learning representation learning

61 of 86

History of Machine Learning

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

62 of 86

History of ML

1950s–60s Early Days	1970s–80s Challenges and Advances	1990s Rise of Statistical ML	2000s Big Data Era	2010s Deep Learning Revolution	Present
Self-learning checkers program (1959) Perceptron (1957)	Decision trees, RL basics, and the rediscovery of NNs	Probabilistic models & statistical learning. Focus on math foundations.	Datasets grew & computation became cheaper Rise of Data Mining & Data Science	Deep learning (2012)	?

63 of 86

History of ML

1950s–60s Early Days	1970s–80s Challenges and Advances	1990s Rise of Statistical ML	2000s Big Data Era	2010s Deep Learning Revolution	Present
Self-learning checkers program (1959) Perceptron (1957)	Decision trees, RL basics, and the rediscovery of NNs	Probabilistic models & statistical learning. Focus on math foundations.	Datasets grew & computation became cheaper Rise of Data Mining & Data Science	Deep learning (2012)	Generative models and Large Language/Large Vision Models

64 of 86

Today – GenAI

UC Berkeley is at the center of the AI revolution:

What does it mean?

Models that can generate the data.

Why is it important?

Unlocking new advanced �general AI abilities.

Will we cover it?

Yes!

…

Get Involved in

Research!

65 of 86

How We teach ML Has Evolved

2006

1996

2023

Probability and Linear Algebra

Deep Learning

(NN + Prob + Lin Alg.)

2013-2022

Click on the books to get free PDF versions of all of them!

66 of 86

Teaching CS189 w/ Bishop’s Latest Book

We will follow the books notation and concepts

Book should be a helpful resource

Issues with the book (for this class):

Heavy emphasis on probabilistic framing
Some material is out-of-scope for this class
Doesn’t cover coding related activities

Each lecture will have a list of textbook sections that we covered and you’re STRONGLY encouraged to read the textbook!

67 of 86

The ML Process (Lifecycle)

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

68 of 86

ML Lifecycle

L

M

P

O

LEARNING PROBLEM

PREDICT & EVALUATE

MODEL DESIGN

OPTIMIZATION

69 of 86

ML Lifecycle

Target:

What do I want to predict? What is the machine learning task?

Objective:

How would I evaluate success? What loss should I use?

Data:

What data do I have?
Data representation? Feature Engineering?
Training/Test split

L

LEARNING PROBLEM

This stage is about framing the real-world question into something a machine learning model can answer.

70 of 86

ML Lifecycle

L

M

LEARNING PROBLEM

MODEL DESIGN

Target
Objective
Data

Choose an Design an appropriate model

Choose and design an appropriate model.

Model family/Architecture
Hypothesis space
Inductive biases / Assumptions

71 of 86

ML Lifecycle

L

M

O

LEARNING PROBLEM

MODEL DESIGN

OPTIMIZATION

Target
Objective
Data

Model family/Architecture
Hypothesis space
Inductive biases / Assumptions

Adjusting the model’s parameters to minimize error using optimization algorithms.

Define a loss
Choose an optimization method (gradient descent, etc.)
Manage regularization and overfitting

72 of 86

ML Lifecycle

L

M

P

O

LEARNING PROBLEM

PREDICT & EVALUATE

MODEL DESIGN

OPTIMIZATION

Target
Objective
Data

Model family/Architecture
Hypothesis space
Inductive biases / Assumptions

Test how well the

Test how well the model performs.

Evaluate predictions based on evaluation metrics

73 of 86

Teach ML�“Backwards”

We are going to …

74 of 86

Classic Machine Learning Classes

Model

Algorithm

Owl

Cat

Application

Classic Machine Learning Class

This Machine Learning Class

75 of 86

Teaching Machine Learning Differently

Model

Algorithm

Owl

Cat

Application

Classic Machine Learning Class

This Machine Learning Class

76 of 86

Greater Focus on Application Framing

When should I use machine learning?
How do I frame a machine learning problem?
How do I prepare my data?
How do I train a model?
How do I evaluate the model?

Model

Algorithm

Owl

Cat

Application

Classic Machine Learning Class

This Machine Learning Class

77 of 86

Greater Focus on ML Engineering

You will learn to use tools for ML:

pandas (https://pandas.pydata.org/) for data manipulation.
Plotly (https://plotly.com/) and Matplotlib (https://matplotlib.org/) for data visualization.
Scikit-learn (https://scikit-learn.org/stable/) for classic machine learning tasks.
PyTorch (https://pytorch.org/) and HuggingFace (https://huggingface.co/) for neural network development.
Weights-and-Biases (https://wandb.ai/site/) for experiment management.

We will work in the Google Colab environment, but you will also be able to use your own tools if you prefer.

78 of 86

Logistics

Introductions
CS 189/289 Goals and Plans
What Is Machine Learning?

Definition
History

The ML Process (Lifecycle)
Logistics

79 of 86

Course Map at a Glance

1-2

3-6

7-8

9

15-16

Introduction and ML Mechanics

Supervised Core

k-means/EM;

regression → classification;

GD

Neural Networks

NN Fundamentals:

backprop

non-linearity

regularization

Midterm Week

Applications

Guest Lecture

More advanced applications

10-11

Advanced Architectures

CNN

RNN

Transformer

LLM

12-14

Advanced Topics

Generative Models

Autoencoder

Dimensionality Reduction

80 of 86

Assessment Cadence

5 Homeworks, each spans ~3 weeks, with two parts released together:

Part 1 (Warmup) → introduces tools;
Part 2 (Main) → deeper application.

Mix of written + coding on Gradescope;

Some public sanity tests and hidden correctness tests.
Collaboration = discuss ideas, but write/code individually.
Open use of GenAI

No Vibe Coding/Writing!! You must understand everything you submit!

Due 11:59 pm

No HW drops.
You get 10 total slip days across the 10 HW deadlines;
max 4 per HW; slip days auto-apply (1 minute late = 1 day).

81 of 86

Prerequisites

CS189

82 of 86

Prerequisites

Official Prerequisite:

MATH 53 and MATH 54; and COMPSCI 70 (or equivalent).

What you need:

Good programming skills in Python

Implement functions and classes
Debug programs

Probability and statistics

Work with conditional probabilities and Bayes rule and compute expectations
Familiarity with uniform, Bernoulli, and Gaussian distributions

Linear algebra and Calculus

Understand eigenvectors and eigenvalues
Compute gradients, apply the chain rule

83 of 86

Course Staff

CS189

84 of 86

Course Platforms - Askademia

85 of 86

Course Platforms - Askademia

86 of 86

Introduction to �Machine Learning

Lecture 1

Credit: Joseph E. Gonzalez, Narges Norouzi

Reference Book Chapters: Chapter 1 (Section 1.1)