Introduction to �Machine Learning
Lecture 1
Welcome to CS189/289
EECS 189/289, Spring 2026 @ UC Berkeley
Jennifer Listgarten and Alex Dimakis
All emails should go to: cs189-instructors@berkeley.edu
Join at slido.com�#5560345
The Slido app must be installed on every computer you’re presenting from
Do not edit�How to change the design
Roadmap
Introductions
CS 189/289 Goals and Plans
Why Are We Excited to Teach CS189?
By the End of CS189, You Should:
These are ambitious goals!
Accel in Advanced ML Courses
Require understanding of basic concepts:
…
Advanced ML course:
What Is Machine Learning?
What Is Machine Learning (ML)?
Basic Recipe:
12
Software systems that improve (learn) through data.
Classic Example: What is Spam?
Classic Example: Face detection?
What Is �Artificial Intelligence?
Artificial intelligence (AI) refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making.
-- Wikipedia (2025)
The primary technology behind modern AI is�Machine Learning (ML)
Artificial Intelligence Is the Goal�Machine Learning Is the Method
Terms are often used interchangeably (even by experts).
14
If you are selling
An entrepreneur’s note on AI marketing
You have (probably) already �done Machine Learning
Linear �Regression�is�Machine �Learning
Basic Recipe:
Training: Use data to teach (fit) a model.
Inference: Use the model to make predictions (decisions).
16
Data
Model
Prediction
Overview of the landscape
Terms used:
Data science
Machine Learning
Artificial Intelligence
Big Data
Data Mining
Statistics
Overview of the landscape
Terms used:
Data science
Machine Learning
Artificial Intelligence
Big Data
Data Mining
Statistics
@jeremyjarvis
“A data scientist is a statistician who lives in San Francisco”
@BigDataBorat
Data science is statistics on a Mac.
@josh_wills
Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician
(anonymous)
The difference between statistics and data science is about 30k per year.
(anonymous)
It is statistics if it’s done in R.
It is Machine learning if it’s done in Python
It is AI if it’s done in Powerpoint.
Overview of the landscape
Statistics: The very first line of the American Statistical Association’s definition of statistics is “Statistics is the science of learning from data...” Given that the words “data” and “science” appear in the definition, one might assume that data science is just a rebranding of statistics.
Statisticians are not happy they are not getting the research funding and salary bumps involved.
Data science: Broad and modern term. Includes much more software engineering knowledge. Usually done in Python (as opposed to R/Stata/SAS/SPSS/Excel etc). Includes analytics on bigger datasets (e.g. terabytes or petabytes by using tools like Apache Spark, Hadoop Mapreduce which enable distributed processing. Includes data collection and data cleaning pipelines (data engineering/ data wrangling). Connections to database backends and web-serving front-ends. Includes to some extend machine learning and AI as sub-areas.
Machine learning: The more mathematically complex part of data science focused on modeling (as opposed to software). Includes supervised learning, predictive modeling, unsupervised learning (like clustering), text and image understanding.
Artificial Intelligence: Broad and classic term that includes machine learning as a sub-discipline. Allowing computers to do things that humans do when they say they are thinking. Includes perception (image understanding, speech understanding), Language translation, playing games, statistical machine learning techniques but also logic-based symbolic AI, reasoning, planning, and robotics).
Data Mining: Applied version of Machine learning. Includes more large-scale software and performance issues. Also intersects the database community.
Big Data: Focused on scaling data analytics to very large data sets. Part of data science that will hopefully follow the information superhighway and internet surfing to obsolete historical nomenclature.
In terms of research communities:
Statistics
Research published in stats journals. Top venues: Annals of Statistics, JASA, Journal of the Royal Statistical Society.
Data science:
Not a properly defined research community.
Machine learning:
Research published in top ML conferences: NeurIPS, ICML, also more recently ICLR. Also includes KDD (more applied, data mining).
https://iclr.cc/virtual_2020/index.html
Artificial Intelligence:
Includes ML conferences but also AAAI and IJCAI as top venues.
Data Mining: Applied version of Machine learning. Includes more large-scale software and performance issues. Also intersects the database community.
Research published in top Data Mining conferences: KDD, SDM.
Big Data:
Not a properly defined research community.
Engineers who can setup Hadoop/Spark clusters. Can work on data directly on disk and process at massive scale.
A taxonomy for machine learning
Supervised learning: Binary classification
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
Supervised learning: Binary classification�Data-driven Taste Test example
The target variable y is now binary (good taste or not): Binary classification.
y could have multiple levels: (Poor, mediocre, Good, Great): Multi-label classification
Or y could be a continuous number to predict (Taste score from 1:100): Regression.
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
Supervised learning: Binary classification
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
How we do we make a good prediction rule?
A first idea: Lookup Table
Supervised learning: Binary classification
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
How we do we make a good prediction rule?
A first idea: Lookup Table
That won’t work. We need some kind of rule. Ideally simple. We now discuss two common frameworks for this: Decision Trees & Linear Classifiers
Binary classification with a short tree:�A decision stump�
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 1 predicts |
Bev1 | 0.8 | 0.8 | 1 | |
Bev2 | 0.3 | 0.25 | 0 | |
Bev3 | 0.2 | 0.8 | 0 | |
Bev4 | 0.3 | 0.7 | 0 | |
Bev5 | 0.9 | 0.7 | 1 | |
S>= 0.75
Predict f(x)=1
Predict f(x)=0
o/w
Binary classification with a short tree:�A decision stump�
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 1 predicts |
Bev1 | 0.8 | 0.8 | 1 | 1 |
Bev2 | 0.3 | 0.25 | 0 | 0 |
Bev3 | 0.2 | 0.8 | 0 | 1 |
Bev4 | 0.3 | 0.7 | 0 | 0 |
Bev5 | 0.9 | 0.7 | 1 | 0 |
S>= 0.75
Predict f(x)=1
Predict f(x)=0
o/w
Accuracy of this model on the training set is:
?/ 5
Binary classification with a short tree:�A decision stump�
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 1 predicts |
Bev1 | 0.8 | 0.8 | 1 | 1 |
Bev2 | 0.3 | 0.25 | 0 | 0 |
Bev3 | 0.2 | 0.8 | 0 | 1 |
Bev4 | 0.3 | 0.7 | 0 | 0 |
Bev5 | 0.9 | 0.7 | 1 | 0 |
S>= 0.75
Predict f(x)=1
Predict f(x)=0
o/w
Accuracy of this model on the training set is:
3/ 5
Partitioning the feature space �
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
S>= 0.75
Predict f(x)=1
Predict f(x)=0
o/w
Lets position the data on this feature space
Acidity
Sweetness
Partitioning the feature space �
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
S>= 0.75
Predict f(x)=1
Predict f(x)=0
o/w
Lets position the data on this feature space
Acidity
Sweetness
Partitioning the feature space �
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
S>= 0.75
Predict f(x)=1
Predict f(x)=0
o/w
Each binary classifier has a decision region:
How it partitions the feature space.
Acidity
Sweetness
S=0.75
Binary classification with depth 2 decision tree:��
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 2 predicts |
Bev1 | 0.8 | 0.8 | 1 | 1 |
Bev2 | 0.3 | 0.25 | 0 | 0 |
Bev3 | 0.2 | 0.8 | 0 | 0 |
Bev4 | 0.3 | 0.7 | 0 | 0 |
Bev5 | 0.9 | 0.7 | 1 | 0 |
A>= 0.5
S>0.750000000
1
Predict f(x)=0
Bev2=0
Bev3=0
Bev4=0
o/w
Accuracy of this model on the training set is:
1/5 = 20% -> 4/5
Loss function we use is the normal 0-1 accuracy.
?
2
o/w
3
Predict f(x)=1
Bev5=1
Predict f(x)=1
Bev1=1
Model 2:
This model splits first on A with threshold 0.5
and then on S with threshold 0.75.
f(x)= yhat =(ideally)= y.
f(A=0.8,S=0.8)= 1
Binary classification with depth 2 decision tree:��
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 2 predicts |
Bev1 | 0.8 | 0.8 | 1 | 1 |
Bev2 | 0.3 | 0.25 | 0 | 1 |
Bev3 | 0.2 | 0.8 | 0 | 1 |
Bev4 | 0.3 | 0.7 | 0 | 1 |
Bev5 | 0.9 | 0.7 | 1 | 0 |
A>= 0.5
S>0.75
1
Predict f(x)=1
o/w
Accuracy of this model on the training set is:
?
2
o/w
3
Predict f(x)=0
Predict f(x)=1
Model 2:
This model splits first on A with threshold 0.5
and then on S with threshold 0.75.
Could you get better training accuracy by labeling leaves differently ? YES.
What is a general algorithm of finding the labels of the leaves that maximize the training accuracy for a given tree ?
Run one epoch of the dataset, place every sample in one leaf, label leaf with majority of training labels.
What is the highest training accuracy you can get?
You can get 100% on this training dataset.
Binary classification with depth 2 decision tree:��
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 2 predicts |
Bev1 | 0.8 | 0.8 | 1 | 1 |
Bev2 | 0.3 | 0.25 | 0 | 1 |
Bev3 | 0.2 | 0.8 | 0 | 1 |
Bev4 | 0.3 | 0.7 | 0 | 1 |
Bev5 | 0.9 | 0.7 | 1 | 0 |
A>= 0.5
S>0.75
leaf 1
o/w
Accuracy of this model on the training set is:
o/w
leaf 2
leaf 3
Model 2:
Acidity
Sweetness
leaf 1
leaf 2
leaf 3
Binary classification with a linear classifier��
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 3 predicts |
Bev1 | 0.8 | 0.8 | 1 | ? |
Bev2 | 0.3 | 0.25 | 0 | ? |
Bev3 | 0.2 | 0.8 | 0 | ? |
Bev4 | 0.3 | 0.7 | 0 | ? |
Bev5 | 0.9 | 0.7 | 1 | ? |
Compute the predictions of this model.
Draw the decision boundary
Model 3:
f(A,S) = 1 if A+S -1 ≥ 0
0 otherwise
Acidity
Sweetness
Binary classification with a linear classifier��
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 3 predicts |
Bev1 | 0.8 | 0.8 | 1 | ? |
Bev2 | 0.3 | 0.25 | 0 | ? |
Bev3 | 0.2 | 0.8 | 0 | ? |
Bev4 | 0.3 | 0.7 | 0 | ? |
Bev5 | 0.9 | 0.7 | 1 | ? |
Compute the predictions of this model.
Draw the decision boundary
Model 3:
f(A,S) = 1 if A+S -1 ≥ 0
0 otherwise
Acidity
Sweetness
Binary classification with a linear classifier��
| Acidity(A) | Sweetness (S) | y=‘Great taste’? | Model 3 predicts |
Bev1 | 0.8 | 0.8 | 1 | ? |
Bev2 | 0.3 | 0.25 | 0 | ? |
Bev3 | 0.2 | 0.8 | 0 | ? |
Bev4 | 0.3 | 0.7 | 0 | ? |
Bev5 | 0.9 | 0.7 | 1 | ? |
Compute the predictions of this model.
Draw the decision boundary
Model 4:
f(A,S) = 1 if A+S -1.3 ≥ 0
0 otherwise
Acidity
Sweetness
Training and Test set
� (Could have billions of combinations of features)
How do we build a prediction model that goes from (Acidity, Sweetness)-> will people like it?
How do we evaluate if the model works well. (Model evaluation)
How do we interpret if works well for the right reasons. (Model interpretability)
We will spend a lot of time covering all these topics
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | 0.3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev6 | 0.7 | 0.7 | ? |
Bev7 | 0.1 | 0.1 | ? |
… | … | … | ? |
… | … | … | ? |
… | … | … | ? |
There is always one very bad model
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | .3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev6 | 0.7 | 0.7 | ? |
Bev7 | 0.1 | 0.1 | ? |
… | … | … | ? |
… | … | … | ? |
… | … | … | ? |
There is always one very bad model
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev1 | 0.8 | 0.8 | 1 |
Bev2 | .3 | 0.25 | 0 |
Bev3 | 0.2 | 0.8 | 0 |
Bev4 | 0.3 | 0.7 | 0 |
Bev5 | 0.9 | 0.7 | 1 |
| Acidity(A) | Sweetness (S) | y=‘Great taste’? |
Bev6 | 0.7 | 0.7 | ? |
Bev7 | 0.1 | 0.1 | ? |
… | … | … | ? |
… | … | … | ? |
… | … | … | ? |
Machine Learning
Data
Big Model
Training
Data
Model
Training
Training
Data
Model
Training
Application
Decision
Query
?
Training
Inference
What kinds of problems are�Machine Learning Problems?
When should I used Machine Learning?
Kinds of Problems
Engineering Problem: Can be solved with a direct, specifiable algorithm or a set of hand-written rules.
Machine Learning Problem: For which it is easy to demonstrate or evaluate the solution but difficult to directly implement.
A Human Problem: The problem cannot be well specified and/or human judgement is required.
🡪 Often require Engineering + ML + Humans 🤝
A Machine Learning Problem
A problem for which it is easy to demonstrate or evaluate the solution but difficult to directly implement.
Determine if a text message is spam
Example:
How do you �define Spam?
Spam is difficult to define and �depends on the receiver.
Easier to demonstrate examples and learn a function to detect spam.
The system learns the desired behavior (e.g., prediction, representation, or a policy) through demonstration or experience.
Machine Learning Solution:
Is Chatting a Machine Learning Problem?
Engage a human in a productive conversation
Example (ChatGPT):
How do you program this?
ELIZA (1966)
Rule based �Conversational System
Entertaining but �it can’t do your homework.
We can demonstrate good conversations.
We can judge good conversations.
Machine Learning�as Learned Function Approximation
Input (X)
Output (Y)
Function
(Model)
Is it Spam?
No (0) / Yes (1)
Text�Message
Model
Parameters
Machine learning becomes the process of “learning” the model parameters from data or interaction with the world.
Learning Settings
Output (Y)
Input (X)
Function
(Model)
Is it Spam?
No (0) / Yes (1)
Text�Message
Model
Parameters
Supervised
(Demonstration)
Unsupervised
Reinforcement
(Reward)
{(X,Y)}
{X}
X, reward(.)
Observe:
Supervised Learning
Trying to learn a relationship between observed {(X,Y)} pairs.
Regression
Classification
Image and Video
Generation
Image Labeling
X: Image
Y: {Hot Dog, …}
X: Prompt
Y: Next Word
Stock
Prediction
X: History
Y: Next Value
X: Prompt+Noise
Y: Pixel Values
Unsupervised Learning
Trying to model the data in the absence of explicit labels.
�Used for visualization and as a step in other ML tasks.
Dimensionality�Reduction
Clustering & �Density Estimation
4
2
Image
Approx. Image
Low-Dimensional
Representation
Reinforcement Learning
Learning from reward signals often with complex multi-step (discrete or continuous) action sequences.
Not covered in this class but a direct extension of topics in this class.
Action: next move,
Reward: Win/Lose
Action: change in joint angles
Reward: Fold quality
Action: next token
Reward: answer quality
Types of supervised learning problems�(Tabular data)
Types of supervised learning problems�(Image data / Computer vision)
Types of supervised learning problems�(Image data / Computer vision)
Types of supervised learning problems�(Time series data)
Types of supervised learning problems�(Natural Language processing (NLP))
A taxonomy for machine learning
Splits to:
Ok what to do ?
Unsupervised problems:�Clustering
Unsupervised problems: �Dimensionality Reduction
History of Machine Learning
History of ML
1950s–60s Early Days | 1970s–80s Challenges and Advances | 1990s Rise of Statistical ML | 2000s Big Data Era | 2010s Deep Learning Revolution | Present |
Self-learning checkers program (1959) Perceptron (1957) | Decision trees, RL basics, and the rediscovery of NNs | Probabilistic models & statistical learning. Focus on math foundations. | Datasets grew & computation became cheaper Rise of Data Mining & Data Science | Deep learning (2012) | ? |
History of ML
1950s–60s Early Days | 1970s–80s Challenges and Advances | 1990s Rise of Statistical ML | 2000s Big Data Era | 2010s Deep Learning Revolution | Present |
Self-learning checkers program (1959) Perceptron (1957) | Decision trees, RL basics, and the rediscovery of NNs | Probabilistic models & statistical learning. Focus on math foundations. | Datasets grew & computation became cheaper Rise of Data Mining & Data Science | Deep learning (2012) | Generative models and Large Language/Large Vision Models |
Today – GenAI
UC Berkeley is at the center of the AI revolution:
What does it mean?
Models that can generate the data.
Why is it important?
Unlocking new advanced �general AI abilities.
Will we cover it?
Yes!
…
Get Involved in
Research!
How We teach ML Has Evolved
2006
1996
2023
Probability and Linear Algebra
Deep Learning
(NN + Prob + Lin Alg.)
2013-2022
Click on the books to get free PDF versions of all of them!
Teaching CS189 w/ Bishop’s Latest Book
We will follow the books notation and concepts
Issues with the book (for this class):
Each lecture will have a list of textbook sections that we covered and you’re STRONGLY encouraged to read the textbook!
The ML Process (Lifecycle)
ML Lifecycle
L
M
P
O
LEARNING PROBLEM
PREDICT & EVALUATE
MODEL DESIGN
OPTIMIZATION
ML Lifecycle
L
LEARNING PROBLEM
This stage is about framing the real-world question into something a machine learning model can answer.
ML Lifecycle
L
M
LEARNING PROBLEM
MODEL DESIGN
Choose an Design an appropriate model
Choose and design an appropriate model.
ML Lifecycle
L
M
O
LEARNING PROBLEM
MODEL DESIGN
OPTIMIZATION
Adjusting the model’s parameters to minimize error using optimization algorithms.
ML Lifecycle
L
M
P
O
LEARNING PROBLEM
PREDICT & EVALUATE
MODEL DESIGN
OPTIMIZATION
Test how well the
Test how well the model performs.
Evaluate predictions based on evaluation metrics
Teach ML�“Backwards”
We are going to …
Classic Machine Learning Classes
Model
Algorithm
Owl
Cat
Application
Classic Machine Learning Class
This Machine Learning Class
Teaching Machine Learning Differently
Model
Algorithm
Owl
Cat
Application
Classic Machine Learning Class
This Machine Learning Class
Greater Focus on Application Framing
Model
Algorithm
Owl
Cat
Application
Classic Machine Learning Class
This Machine Learning Class
Greater Focus on ML Engineering
You will learn to use tools for ML:
We will work in the Google Colab environment, but you will also be able to use your own tools if you prefer.
Logistics
Course Map at a Glance
1-2
3-6
7-8
9
15-16
Introduction and ML Mechanics
Supervised Core
k-means/EM;
regression → classification;
GD
Neural Networks
NN Fundamentals:
backprop
non-linearity
regularization
Midterm Week
Applications
Guest Lecture
More advanced applications
10-11
Advanced Architectures
CNN
RNN
Transformer
LLM
12-14
Advanced Topics
Generative Models
Autoencoder
Dimensionality Reduction
Assessment Cadence
Prerequisites
CS189
Prerequisites
Course Staff
CS189
Course Platforms - Askademia
Course Platforms - Askademia
Introduction to �Machine Learning
Lecture 1
Credit: Joseph E. Gonzalez, Narges Norouzi
Reference Book Chapters: Chapter 1 (Section 1.1)