Statistics for Data Science/Machine Learning
Waterloo Data Science Club
Sept 26, 2019
Frieda Rong
Intro to
What is Statistics?
Statistics in data science
A/B testing
Missing data imputation
Model selection
Interpretation
Prediction
Original source: https://twitter.com/hadleywickham/status/784780387180425217
Overview
Probability
Basics
a discrete/categorical random variable X is equal to x with some probability P(X=x)
Examples
probability mass function (pmf, discrete), probability density function (pdf, continuous), cumulative distribution function P(X <= x)
continuous
“the event X = x occurs”
Essential concepts in probability
The expectation of some random variable X is
Statistical quantities
Expectations in general:
Essential concepts in probability
Two or more random variables: joint probability distribution
Example: joint probability table P(X=x,Y=y)
Summing over a given random variable = “marginalizing”
Covariance between two random variables:
Correlation:
correlation does not imply causation!
Advanced linear algebra fast fact:
Bayes’ Theorem
Conditional probability
Example
A patient goes to see a doctor. The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?
Bayes’ Theorem
Conditional probability
Example
A patient goes to see a doctor. The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?
Intuitive “answer”: 99%. Correct answer: 50%. Try calculating it yourself! http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability6.html
Essential probability distributions
The Normal distribution
Fast facts
Statistics
What we talk about when we talk about statistics
Frequentist statistics
the probability of an event is the limit of the frequency with which an event occurs as the number of samples n increases to infinity
Classical hypothesis testing: p-values
Bayesian statistics
the probability of an event reflects our subjective belief in its likelihood of occurring
bring your own prior distribution
update your prior based on samples to obtain the posterior distribution
One internet person’s take:
(Another internet person’s take:)
Geoff Hinton Facts
In Jan 2012, I [EN: Yann Lecun] made a post on Google+ reproducing a dialog that occured between Geoff Hinton and Radford Neal, while Radford was giving a talk at a CIfAR workshop in 2004:
- Radford Neal: I don't necessarily think that the Bayesian method is the best thing to do in all cases...
- Geoff Hinton: Sorry Radford, my prior probability for you saying this is zero, so I couldn't hear what you said.
Source: Yann Lecun http://yann.lecun.com/ex/fun/
Hypothesis Testing
Example: suppose I flip a coin 1000 times and it shows up heads 300 times. Is the coin fair?
Null hypothesis: the coin is fair (p = 0.5)
Alternative hypothesis: the coin is biased
Compute test statistic using observed and expected values
Determine the p-value: the probability of obtaining a value at least as extreme
Accept/reject the null hypothesis at some statistical significance level (for example, reject if p < 0.05)
800 scientists agree: p-values are confusing
Statistical Inference
is the process of estimating quantities of a general population based on a sample
Bayesian statistics
Examples of beta distributions for different parameter settings
Statistical Modelling
The science of fitting a model to data
Example: supervised machine learning
Examples of statistical models
Case Study: Linear regression in Moneyball
Between Aug to Sep 2002, Oakland Athletics won a historic 20 consecutive games. Their secret? Sports analytics. (Dataset on Kaggle)
Case Study: Linear regression in Moneyball cont.
Target run distance: >= 135
The key to increasing runs scored and decreasing runs allowed (ie increasing run distance)?
The math behind linear regression
As an optimization problem: minimizing the least squares loss function
where
Probabilistically:
Assume Gaussian noise with zero mean, find the maximum likelihood estimators for the model parameters and variance of the noise where
These give the same solution for the model parameters 𝜃!
Logistic Regression
Our classifier is:
Using the sigmoid function:
Graph of the sigmoid function
where
LASSO
Where to go from here?
Textbooks:
Other links: