1 of 67

Binary Cross-Entropy Loss

Session 7

ACM AI + ACM TeachLA

Slides Link:

https://teachla.uclaacm.com/resources

2 of 67

What is your favorite city?

3 of 67

Recap: Bayes’ Theorem

4 of 67

Bayes’ Review:

Problem:

What is the probability that Grogu is a Star Wars fan (h) given that he has watched The Mandalorian (D)?

5 of 67

h: Grogu is a Star Wars fan (given) D: Grogu has watched The Mandalorian

Data Collected:

30% of the overall population has watched The Mandalorian.
95% of Star Wars fans have watched The Mandalorian.
Jason estimates ⅕ of the people at this gathering are Star Wars fans.

= 0.3

= 0.95

= 0.2

6 of 67

h: Grogu is a Star Wars fan (given) D: Grogu has watched The Mandalorian

The probability Grogu is a Star Wars fan given that he has watched The Mandalorian

(0.95) (0.2)

(0.3)

0.63, 63%

7 of 67

Bayes’ Theorem and AI/ML?

8 of 67

Connection to ML

In Bayesian learning, we want to find the best hypothesis to fit our data

Just like finding the right function to fit our data in linear/logistic regression!

9 of 67

Maximum a Posteriori Hypothesis

The best hypothesis for our data is the hypothesis with the maximum probability given our data

Ex: which type of ad has the highest chance of being clicked based on given user

We can use the function argmax to find the most probable hypothesis

10 of 67

Maximum a Posteriori Hypothesis

We call this value the maximum a posteriori hypothesis or h_map

11 of 67

Maximum a Posteriori Hypothesis

We call this value the maximum a posteriori hypothesis or h_map

Using Bayes’ Theorem:

12 of 67

Maximum a Posteriori Hypothesis

We call this value the maximum a posteriori hypothesis or h_map

Using Bayes’ Theorem :

Since P(D) is constant with respect to h, we don’t have to account for it when we’re trying to maximize P(h|D)

13 of 67

Why can we get rid of P(D)?

P(D) is constant.
Let’s say we have a function that picks the largest number out of a group of numbers. If we have 5 numbers as shown below what is the largest number?

Number 1	Number 2	Number 3	Number 4	Number 5
2	4	6	8	9

What if we multiplied each number by 5?

10

40

60

80

90

14 of 67

Maximum a Posteriori Hypothesis

We call this value the maximum a posteriori hypothesis or h_map

Using Bayes’ Theorem :

Since P(D) is constant with respect to h, we don’t have to account

15 of 67

What if our data is uniformly distributed?

In a uniformly distributed dataset, all outcomes/hypotheses are equally likely

Ex. flipping heads or tails

P(heads) = P(tails) = ½

Ex. picking a red ball from a bag of 2 red balls, 2 blue balls, and 2 green balls

P(red) = P(blue) = P(green) = ⅓

What can we conclude about P(h) for any given hypothesis in these examples?

16 of 67

What if our data is uniformly distributed?

P(h) is constant!
What do we do with constant values in our argmax function?

17 of 67

Maximum Likelihood Estimation

For uniform distributions, we can define the maximum likelihood hypothesis as

Solving this equation is called maximum likelihood estimation (MLE)

18 of 67

Last week, we learned about Bayes’ Theorem.

We can use Bayes’ Theorem to measure the probability of hypothesis h occurring given data D.

h : hypothesis (an event)

D : data (background information/event)

19 of 67

Last week, we learned about Bayes’ Theorem.

REMEMBER: Bayes’ Theorem is trying to find the probability of our hypothesis given some data

Do we want our probability to be high or low?

Hypothesis: I am funny

Data: My comedy tik tok got over 100k likes

20 of 67

Let’s Go Through an Example!

Problem:

Given that Grogu has watched The Mandalorian (D) is he a Star Wars fan or not?

What are our hypotheses?

h₁ = Star Wars fan

h₂ = not a Star Wars fan

21 of 67

Data Collected

30% of the overall population has watched The Mandalorian.
95% of Star Wars fans have watched The Mandalorian.
10% of non-Star Wars fans have watched The Mandalorian.
We estimate that ⅕ of the people at this gathering are Star Wars fans.

= 0.3

= 0.95

= 0.1

= 0.2

= 0.8

h₁ = Star Wars fan

h₂ = not a Star Wars fan

22 of 67

Bayes’ Review:

Problem:

What is the probability that Grogu is a Star Wars fan (h) given that he has watched The Mandalorian (D)?

23 of 67

We need 3 things….

Probability that Grogu watched the Mandalorian given he is a Star Wars fan

Probability that Grogu is a Star Wars Fan

Probability that Grogu watched the Mandalorian

Probability that Grogu is a Star Wars Fan given that he has watched the Mandalorian

=0.3

=0.95

=0.2

24 of 67

h: Grogu is a Star Wars fan (given) D: Grogu has watched The Mandalorian

The probability Grogu is a Star Wars fan given that he has watched The Mandalorian

(0.95) (0.2)

(0.3)

0.63, 63%

25 of 67

Rate your understanding! Bayes’ Theorem

How much do you understand Bayes’ Theorem?

1 2 3 4 5 6 7 8 9 10

If someone mentioned Bayes’ Theorem I wouldn’t know what they were talking about

I know what Bayes’ Theorem is used for and I know how to solve for P(h|D)

I know what Bayes’ Theorem is, but I wouldn’t be able to solve a Bayes’ Theorem problem

26 of 67

Maximum a Posteriori Hypothesis(MAP)

Let’s Say now that we have 2 Hypothesis and want to find the most likely one
We call this value the maximum a posteriori hypothesis or h_map

Since P(D) is constant with respect to h, we don’t have to account

27 of 67

Maximum Likelihood Estimation(MLE)

Let’s say our distribution is uniform, P(h) is constant

Solving this equation is called maximum likelihood estimation (MLE)

28 of 67

Review

= 0.3

= 0.95

= 0.1

= 0.2

= 0.8

= 0.95 (0.2) = 0.19

= 0.1 (0.8) = 0.08

We predict that hypothesis 1 is correct!

29 of 67

Rate your understanding! MAP and MLE

How much do you understand Maximum a Posteriori Hypothesis and Maximum Likelihood Estimation?

1 2 3 4 5 6 7 8 9 10

If someone mentioned MLE or MAP I wouldn’t know what they were talking about

I know what MAP and MLE are used for and I can solve for them

I know what MAP and MLE are but I don’t know how to do problems with them

30 of 67

*Short* Recap of ML Classification

31 of 67

Training Data

Frog

Rabbit

Frog

Rabbit

Inputs (X)

Labels (Y)

Given

ML Classification Model

(Logistic Regression)

Using data, Model learns the relationship between Inputs & Labels

Rabbit ✓

Rabbit X

Frog ✓

Then we use a loss function to measure performance of model

Use Gradient Descent to improve our model based off loss function

Model Prediction

Inputs (X)

32 of 67

Steps for Machine Learning

Start with data: inputs (X) and labels (Y)

Have the model learn relationship/mapping of inputs and labels (Gradient Descent)

Continuously evaluate the model’s loss, and repeat step 2 to improve based off the loss function

33 of 67

Rate your understanding! ML Framework

How much do you understand Machine Learning Framework?

1 2 3 4 5 6 7 8 9 10

im not sure whats going on

I completely get the jist of it

I get an idea of what they are, but still have questions

34 of 67

Binary Cross-Entropy (BCE) Loss

We can use MLE to prove the formula for BCE loss used in logistic regression!

Let’s think back to the green and red marbles!

35 of 67

Some terms to recall

MLE

Maximum Likelihood Estimation
Used to find best Hypothesis provided some Data
*Finding a probability (or set of probs.) that maximize output

BCE

Binary Cross Entropy Loss
Used to calculate loss for logistic regression (classification)

Entropy

The uncertainty of a distribution (bag of marbles)

36 of 67

Entropy

We define entropy as the measure of uncertainty (or uniqueness) associated with a given distribution.

High entropy

Low entropy

High entropy

37 of 67

Group of items are quite dissimilar

(Lots of similarity here!)

Therefore HIGH entropy

38 of 67

Rate your understanding! Entropy

How much do you understand Entropy?

1 2 3 4 5 6 7 8 9 10

im not sure whats going on

I completely get the jist of it

I get an idea of what they are, but still have questions

39 of 67

Entropy

For classification problems, we can use entropy to measure the amount of loss of our model

Recall: in classification models, we draw a line to separate groups

40 of 67

Goal for Machine Learning (Classification)

Start with data: (X) and labels (Y)

Classification model (eg: Logistic Regression) draws boundary line to separate based on labels

Have the model learn relationship/mapping of inputs and labels (Gradient Descent)

Evaluate the line based on loss function, repeat 2) and redraw line until loss is low

--- Blue (Y)

--- Orange (Y)

--- Binary Cross Entropy Loss

41 of 67

Binary Cross Entropy (BCE)

BCE: tells us how good our model (separating line) is

BCE measures amount of dissimilarity (uncertainty)

42 of 67

Rate your understanding! Classification and BCE

How much do you understand Classification and Binary Cross Entropy (BCE) Loss?

1 2 3 4 5 6 7 8 9 10

im not sure whats going on

I completely get the jist of it

I get an idea of what they are, but still have questions

43 of 67

* Where a is our model’s output, and y is the true value.

Binary Cross-Entropy Loss

Cross-entropy is a measure of the difference between two probability distributions.

In a binary classification problem, we can represent the cross-entropy loss as such:

44 of 67

* Where a is our model’s output, and y is the true value (label).

Binary Cross-Entropy Loss

45 of 67

Binary Cross Entropy (BCE)

BCE measures amount of dissimilarity (uncertainty)

Cross-entropy is a measure of the difference between two probability distributions.

46 of 67

How do MLE and BCE tie together?

-- Maximum Likelihood Estimation

-- Binary Cross Entropy [Loss]

47 of 67

Binary Cross-Entropy Loss

Consider a cat image classifier; what is our input and output?
Our training examples are (1 represents a cat, 0 represents not a cat):

is either

1

0

48 of 67

Binary Cross-Entropy Loss

We assume all our training examples are independent from each other

When we say P(x,y) we mean the probability of the ordered pair existing in our dataset
Recall: if A and B are independent, then P(A, B) = P(A and B) = P(A) P(B)

This means the probability of this ordered pair existing in our training examples

49 of 67

Binary Cross-Entropy Loss

Hmmm… how can we break down ?
From conditional probability, we have two options:

event- we know we are looking at an image of a cat or not a cat

event- a certain image exists

50 of 67

Binary Cross-Entropy Loss

Hmmm… how can we break down ?
From conditional probability, we have two options:

For MLE we choose option 1, not option 2

1

51 of 67

Binary Cross-Entropy Loss

Raise your hand if you think:

1

52 of 67

Binary Cross-Entropy Loss

Why not ?

Translates to: what is the probability of this image existing given that we know we’re looking at an image of a cat?

53 of 67

Binary Cross-Entropy Loss

Computing the probability of an image existing given the label is unnatural and nearly impossible to do

Images vary a lot; images that differ by even 1 pixel are different
Can you think of any other reasons computing the following is absurd?

We actually do know the probability distribution of labels :)

54 of 67

Binary Cross-Entropy Loss

Why is better?

Notice that is exactly what our logistic regression model outputs (after the sigmoid activation, before we threshold)!�

1

a

1

55 of 67

Binary Cross-Entropy Loss

Wait, what?? Let’s break down the formula for

For simplicity, we assume all are identically-distributed; this is just a fancy way of saying the have the same distribution

(image is a cat)

(image is not a cat)

56 of 67

Binary Cross-Entropy Loss

We want to train our model to maximize the probability of seeing all our training data

57 of 67

Binary Cross-Entropy Loss

We want to train our model to maximize the probability of seeing all our training data

Just like in Maximum Likelihood Estimation (MLE), we can get rid of the constant C because it is positive and it doesn’t change the model which maximizes the probability

58 of 67

Binary Cross-Entropy Loss

Wait, so what are we doing here? We are simply finding a mathematical formula that will maximize our model outputting the correct data points

Just like in Maximum Likelihood Estimation (MLE), we can get rid of the constant C because it is positive and it doesn’t change the model which maximizes the probability

(using binary cross entropy loss)

59 of 67

Binary Cross-Entropy Loss

Just like in Maximum Likelihood Estimation (MLE), we can get rid of the constant C because it is positive and it doesn’t change the model which maximizes the probability

60 of 67

Binary Cross-Entropy Loss

Ew, products are kinda gross… sums are much nicer

But, how do we go to sums? (hint: log (AB) = log A + log B)

61 of 67

Binary Cross-Entropy Loss

But, why is taking the log valid? log is monotonic (i.e. strictly increasing), so it doesn’t change the model which maximizes the probability

62 of 67

Binary Cross-Entropy Loss

Oh my god… we’re finally (almost) done

We’re not huge fans of maximization though... we invented gradient descent to minimize functions

63 of 67

Binary Cross-Entropy Loss

We can use multiplication by -1 to turn the maximization problem into a minimization problem

64 of 67

Binary Cross-Entropy Loss

So in summary…

And that’s why we minimize Binary Cross-Entropy Loss to train our logistic regression models!

We aim to maximize the probability of observing our training data using Maximum Likelihood Estimation

(using mathematical magic but not really)

65 of 67

Binary Cross-Entropy Loss

Some benefits of using BCE loss to train logistic regression models:

Empirically better than Mean Squared Error (MSE) loss
Easier to optimize than Mean Squared Error (MSE) loss
BCE loss function is convex - there exists a global minimum

66 of 67

Rate your understanding! Mathematical Derivation of BCE

How much do you understand the mathematical derivation of Binary Cross Entropy (BCE) Loss?

1 2 3 4 5 6 7 8 9 10

If someone mentioned BCE I wouldn’t know what they are talking about

I know what BCE is, and I got a jist of what the math meant

I know what BCE is used for, but the math just showed confuses me

67 of 67

Thanks!

Fill out our Feedback Form: https://tinyurl.com/aimlworld

- -

Check out our Github

Read our Blog

ACM AI