1 of 26

Basic Probability and Distributions ( Part 1)

CMSC 320 - Introduction to Data Science

Fardina Alam

2 of 26

Topics we will cover:

Chapter 6 in : https://ffalam.github.io/CMSC320TextBook/

Probability Theory:

    • Basic concepts (events, sample space, probability axioms)
    • Conditional probability
    • Bayes' theorem
    • Law of Probability
    • Expected Value

Probability Distribution

    • Probability distributions (discrete and continuous)
    • Common distributions (e.g., normal, binomial, Poisson)

CLT Theorem

2

3 of 26

Why Probability Matters in Data Science

A method for decision making in the presence of uncertainty.

Probability is a mindset, not just math

  • Data is uncertain
  • We reason about likelihoods, not certainties
  • Every model makes probabilistic assumptions
  • Probability supports decision-making under uncertainty

In Data Science, we often make predictions amidst uncertainty.

Probability helps us make clear decisions under uncertainty.

Basic Probability formula: The probability P of an event A happening:x

Classic Example: What is the chances of rolling a one (Number 1) on the dice?

There are six different outcomes. (sample space)

Probabilities are between 0 and 1�

4 of 26

Probability Theory and Data Science

4

Applications of Probability in Data Science

  • Predictive Modeling: Uses probability to forecast future events (e.g., predicting customer churn).
  • Machine Learning: Many algorithms rely on probabilistic models (e.g., Naive Bayes, Bayesian networks).
  • Statistical Inference: Helps make data-driven decisions by estimating parameters and testing hypotheses (e.g., A/B testing).
  • Practical Examples:
    • Dice Rolls: Modeling the probability of different outcomes (e.g., rolling a die).
    • Coin Flips: Understanding the chances of heads or tails (e.g., flipping a coin).

5 of 26

Key Concepts in Probability

  • Sample space (Ω): all possible outcomes
  • Event (A): outcomes of interest
  • Probabilities are assigned to events�

Example: Coin flip → Ω = {H, T}, Event = Heads

From Events to Data Science

  • Events: what can happen (buy / not buy)
  • Random variables: numerical outcomes
  • Distributions: behavior over many observations

Focus: patterns, not single outcomes

Probability Distributions: Functions that describe the likelihood of different outcomes. Helps in understanding the spread of data.

6 of 26

Example: Random Variable and Probability Distributions

A random variable represents the outcomes of a random event.

6

E.g: Imagine rolling two dice.

Let X = sum of two dice

Possible values: 2–12

Each value has a probability

P(X=2)= 1/36

P(X=3)= 2/36 (rolling a 1 and a 2 or rolling a 2 and a 1)

P(X=4)= 3/36

P(X=5)= 4/36

P(X=6)= 5/36

P(X=7)= 6/36

P(X=8)= 5/36

P(X=9)= 4/36

P(X=10)= 3/36

P(X=11)= 2/36

P(X=12)= 1/36

Probability Distributions: Describe the likelihood of different outcomes occurring.

7 of 26

Conditional Probability

Probability of A given B

  • Information changes probabilities�

Idea: “What is the chance of A once B is known?”Remember: Most real-world data is conditional.

Example: A bag has 5 red and 5 blue marbles.� Among the red marbles, 3 are shiny.

What is the probability of picking a shiny marble given that it is red?

  • P (AB) = What is the probability of picking a shiny red marble ?
  • P (B) = What is the probability of picking a red marble?

P (A|B)?

= (3/10) / (5/10) = ⅗ = 0.6 = 60%

= 3/10

= 5/10

8 of 26

Bayes Rule

Bayes' Rule Used to update probabilities when new information is available.

  • P(A | B): Probability of event A given event B (posterior probability).
  • P(B | A): Probability of event B given event A (likelihood).
  • P(A): Prior probability of event A (prior belief).
  • P(B): Probability of event B (evidence or normalizing constant).

Example:

In spam detection:

  • A = Email is spam
  • B = Email contains the word "discount"

Bayes' Rule updates our belief on whether the email is spam based on the word "discount."

9 of 26

Example: Picnic Day

9

Scenario: You want to find the chance of rain during the day given that the morning is cloudy. You have the following probabilities:

  1. Probability of Rain (P(Rain)) = 10% (because only 3 out of 30 days are rainy).

  1. Probability of Cloud, given that Rain happens (P(Cloud∣Rain)) = 50% (because 50% of rainy days start off cloudy).
  2. Probability of Cloudy morning ( P(Cloud)) = 40% (because 40% of all days start off cloudy).

What is the chance of rain during the day?

Now, use Bayes' Theorem:

Or a 12.5% chance of rain. Not too bad, let's have a picnic!

10 of 26

Law of Total Probability

Sometimes events can happen in multiple ways.

  • Break a complex probability into simpler cases
  • Add probabilities across all possible scenarios�

“The Law of Total Probability’ computes the probability of an event by considering all possible scenarios that cover the event.”

Where Bi (B2, B2….Bn) are mutually exclusive and exhaustive events.

Takeaway: Helps compute probabilities when causes are hidden.

Example (Dice):

  • Choose Dice 1 (70%) → P(6) = 1/6
  • Choose Dice 2 (30%) → P(6) = 1/2

P(6) = (1/6)(0.7) + (1/2)(0.3) = 0.283

** Mutually exclusive & exhaustive: Events that don’t overlap and cover all possibilities.

11 of 26

Example

Example (Two Bags of Marbles):

  • Bag 1: 6 white, 4 black → P(Black∣B1)=4/10
  • Bag 2: 3 white, 7 black → P(Black∣B2)=7/10

Case 1: Equal Bag Selection

Now suppose I put the two bags in a box. If I close my eyes, grab a bag from the box, and then grab a marble from the bag, what is the probability that it is black?

Case 2: Unequal Bag Selection (Bag 1 twice as likely):

now suppose that the first bag is much larger than the second bag. so that when I reach into the box I’m twice as likely to grab the first bag as the second. What is the probability of grabbing a black marble?

SOLUTION: P(B1)=1/2, P(B2)=½

P(Black)=

P(Black)=1/2·4/10 + 1/2·7/10 = 11/20 = 0.55

SOLUTION: P(B1)=2/3, P(B2)=1/3

P(Black)=2/3·4/10 + 1/3·7/10 = 15/30 = 1/2

12 of 26

Independence

Two events A and B are independent if knowing B does not change the probability of A.

  • P(A given B)= P(A | B) = P(A)

Ex:

  • Outcomes of two coin tosses
  • Passing a class ⟂ drinking Coke before the exam
  • Relationship success ⟂ zodiac sign
  • Drawing a red card ⟂ not sleeping well�

13 of 26

Conditional Independence

Two variables A and B are conditionally independent given C if knowing B adds no new information about A once C is known.

  • P(A | B, C) = P(A | C)

Equivalently:

  • P(A ∩ B | C) = P(A | C) · P(B | C)

Example A: has a cough, B: has a fever, C: has a cold

Idea: Once we know the person has a cold, cough and fever behave independently.

Why It Matters

  • Simplifies models (e.g., Naive Bayes)
  • Removes unnecessary dependencies
  • Faster, more efficient modeling

Three probabilities:

P(Rain), P(Dog Barks), P(Cat Runs)

  • P(Dog Barks | Rain) > P(Dog Barks)
  • P(Cat Runs | Dog Barks) > P(Cat Runs)

Rain and Cat Runs are not independent.

Rain

Dog Barks

Cat Runs

What if you already know the dog is barking 🐶?

  • Suppose you observe rain
  • Will P(Cat Runs) increase?
  • Why or why not?

No!

No longer influenced

by Rain!

14 of 26

Conditional Independence Example

What if you already know the dog is barking?

  • Suppose you observe rain.
  • Will P(Cat Runs) increase? No!
  • Why? P(Dog Barks)=1 → No longer influenced by Rain!

Rain and Cat are thus conditionally independent given that Dog Barks

  • P(Cat Runs | Rain Dog Barks) = P(Cat Runs | Dog Barks)

Rain

Dog Barks

Cat Runs

Does not depend on rain!

Consider three variables A, B, and C, If the distribution of A given B and C, is such that it does not depend on the value of B then

P(A | B ∩ C) = P(A | C) → no B! (A is conditionally independent of B given C)

15 of 26

Expected Value (Average Behavior)

Expected value summarizes long-run behavior.

  • Not a guaranteed outcome
  • The average result over many repetitions
  • Can be positive or negative�

In Data Science:

  • Helps predict average outcomes like revenue, user clicks, or risk.

Remember: Expected value drives loss, reward, and optimization.

16 of 26

Someone offers for you to go on a game show. On this gameshow, there is:

  • A 5% chance you will be given a million dollars
  • A 95% chance they will hit you with sticks, causing $10,000 worth of medical bills (the hits are not particularly bad, your insurance just doesn’t cover check-ups)

Your feelings about being hit with sticks aside, should you go on the game show?

Expected value of the game show:

(1,000,000 * .05) + (-10,000 * .95) = 40500

Net positive!

16

Expected Value: Example

You are playing a game where you spin a wheel. The wheel has the following sets of rewards:

  • 20% chance you win nothing
  • 30% chance you win a ten dollar gift card
  • 40% chance you win a twenty dollar gift card
  • 10% chance you win a ten dollar gift card

On average, you can expect to win $12 per spin over a large number of spins.

EV=(.2 * 0) + (.3 * 10) + (.4 * 20) + (.1 * 10) = 0+3+8+1 =12

17 of 26

Probability Distributions and their types

Distributions describe how values are spread.

  • Discrete: countable values (0, 1, 2, …)
    1. Bernoulli Distribution
    2. Binomial Distribution
    3. Poisson Distribution
    4. Zero-Inflated Poisson Distribution
  • Continuous: smooth ranges of values
    • Uniform Distribution
    • Normal ( Gaussian) Distribution
    • Many more..

Shape tells us about the data-generating process�Remember: Distribution choice comes before modeling.

Understanding how your data is distributed can tell you a lot about the process generating the data.

  • The nature of your distribution affects which statistical tools you use

18 of 26

1(a,b). Bernoulli and Binomial Distribution

Binary outcome: Success (1) or Failure (0)

Bernoulli Distribution

  • One trial
  • Success probability = p

P(X = 1) = p

P(X = 0) = 1 − p

Example: one coin flip, spam vs not spam

Binomial Distribution Number of successes (k) in n independent Bernoulli trials�

X ~ Binomial(n, p)

Example: heads in 10 flips, correct answers on a quiz

Key idea: Binomial = sum of independent Bernoulli trials

Example: Tossing a coin 10 times:

  • N =10, p=0.5

19 of 26

1a. Poisson Distribution

What it models: The number of times an event occurs in a fixed time or space, when events happen randomly at a constant average rate.

Assumptions

  • Events occur independently
  • Average rate λ is constant
  • Two events cannot occur at the exact same instant

P(X =k)

Where,

  • P(X=k) = the probability of observing exactly 𝑘 (0,1,2,3,.....) events.
  • λ= average number of events per interval (the rate parameter)
  • E = Euler’s constant ≈ 2.71828

We can say, X follows a Poisson distribution with parameter λ

Example

Emails arrive at an average rate of 3 per hour. What is the probability of receiving exactly 2 emails in one hour?

  • λ (lambda) = 3 = average number of events in the given time interval
  • Time interval = 1 hour
  • k = 2 (exactly 2 emails)

Shape: Skewed right for small λ, becomes more symmetric as λ increases.

20 of 26

Zero-Inflated Poisson (ZIP) Distribution

Used for count data with more zeros than expected.

Components

  • Poisson component: models the count process
  • Inflation component: models excess zeros (e.g., Bernoulli)

Why ZIP?

  • Standard Poisson underestimates zeros
  • ZIP captures structural zeros + counts�

Examples

  • Insurance claims
  • Hospital visits
  • Product defects

Oftentimes, poisson distributions can have a spike at zero.

21 of 26

2a. The Uniform Distribution

Uniform Distribution

  • All outcomes are equally likely
  • Values occur within a fixed interval

Example:� Rolling a fair six-sided die

21

22 of 26

2b. Normal (Gaussian) Distribution

Bell-shaped and symmetric

  • Defined by mean (μ) and standard deviation (σ)
  • Mean = Median = Mode
  • Rule: 68%–95%–99.7% within 1σ, 2σ, 3σ
  • Mean = Median = Mode = μ

Data is symmetrically distributed with no skew

It is symmetric around its mean and is defined by two parameters: the mean (μ) and the standard deviation (σ).

Example: Heights of people, measurement errors.

Averages are common, extremes are rare.

23 of 26

Parameters of the Normal Distribution: μ & σ

Mean (μ):

  • Shifts the curve left or right
  • Controls the center of the distribution�

Standard Deviation (σ):

  • Controls the spread of the curve
  • Smaller σ → narrower, taller
  • Larger σ → wider, shorter

24 of 26

Problem Solving: Finding percentages

Example 1: The time taken to travel between two regional cities is approximately normally distributed with a mean of 70 minutes and a standard deviations of two minutes

Q: What is the percentage of travel times that are between 66 minutes and 72 minutes?

24

70

72

74

76

68

66

64

time(min)

  • Mean 70
  • Sx 2
  • % ?

%= 13.5+34+34= 81.5

TRY YOURSELF: The volume of a cup of soup serve via machine is normally distributed with a mean of 240 mL and a standard deviation of 5 mL. A fast store used this machine to serve 160 cups of soup

Ques: What number of these cups of soup are expected to contain less than 230 ml?

25 of 26

Standard Normal (Z) Distribution

A special normal distribution with:

  • Mean μ = 0�Standard deviation σ = 1

Used to compare values from different distributions by converting them to z-scores.

Z-Scores: Any normal distribution can be transformed into a standard normal distribution using:

Interpretation

  • Z = 0 → at the mean
  • Z > 0 → above the mean
  • Z < 0 → below the mean

Example Given: Mean μ = 100, Standard deviation σ = 15, Student score X = 115�Z = (115 − 100) / 15 = 1

Meaning: The score is 1 standard deviation above the mean, indicating better-than-average performance.

26 of 26

The Central Limit Theorem (CLT):

For a sufficiently large sample size nnn, the distribution of the sample mean is approximately normal, regardless of the population’s original distribution

If we sample a distribution a bunch of times, the set of sample means is normally distributed.

Key ideas: Applies to sample means, not individual values

  • Larger n → better normal approximation�