1 of 29

Statistics for Data Science/Machine Learning

Waterloo Data Science Club

Sept 26, 2019

Frieda Rong

Intro to

2 of 29

What is Statistics?

Descriptive statistics

Summarizing or describing a dataset
Measures of spread, measures of central tendency - mean, median, mode, ranges, quantiles

Statistical Inference

Estimating unknown parameters of a population given a sampled dataset

Statistical modelling

Mathematical models encoding a set of statistical assumptions to reflect the data-generating process

Nonparametric statistics

Doesn’t make assumptions on parameters of a distribution - eg. a histogram

Experimental design

Surveying methods

Probability

Distributions

3 of 29

Statistics in data science

A/B testing

Missing data imputation

Model selection

Interpretation

Prediction

Original source: https://twitter.com/hadleywickham/status/784780387180425217

4 of 29

Overview

Probability

Bayes’ theorem
Important distributions

Frequentist vs Bayesian statistics
Hypothesis testing
Statistical inference
Statistical modelling

Linear regression
Logistic regression

5 of 29

Probability

6 of 29

Basics

a discrete/categorical random variable X is equal to x with some probability P(X=x)

Examples

a fair coin toss
the number of heads before the first tail when flipping a fair coin repeatedly
the temperature in Waterloo tomorrow

probability mass function (pmf, discrete), probability density function (pdf, continuous), cumulative distribution function P(X <= x)

continuous

“the event X = x occurs”

7 of 29

Essential concepts in probability

The expectation of some random variable X is

Statistical quantities

mean
variance (standard deviation squared)

Expectations in general:

8 of 29

Essential concepts in probability

Two or more random variables: joint probability distribution

Example: joint probability table P(X=x,Y=y)

Summing over a given random variable = “marginalizing”

Independence, dependence, covariance matrix, pairwise correlations

9 of 29

Covariance between two random variables:

If X and Y are independent, then Cov(X,Y) = 0 - but the converse isn’t necessarily true!

Correlation:

correlation always lies in the interval [-1, 1]

correlation does not imply causation!

Advanced linear algebra fast fact:

covariance matrices are exactly those matrices which are symmetric and positive semi-definite (all eigenvalues are non-negative)

10 of 29

Bayes’ Theorem

Conditional probability

Example

A patient goes to see a doctor. The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?

11 of 29

Bayes’ Theorem

Conditional probability

Example

A patient goes to see a doctor. The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?

Intuitive “answer”: 99%. Correct answer: 50%. Try calculating it yourself! http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability6.html

12 of 29

Essential probability distributions

Binomial Bin(n,p)

Categorical distribution (more specifically, binary)
Large scale approximation by Normal

Multinomial
Normal
Geometric, Poisson, Negative-binomial, Exponential, Gamma
Student’s t, Chi-squared

13 of 29

The Normal distribution

Fast facts

Central Limit Theorem (loosely speaking)

is the distribution with maximum entropy for all real-valued distributions with the same mean and variance
has rotational symmetry

14 of 29

Statistics

15 of 29

What we talk about when we talk about statistics

Frequentist statistics

the probability of an event is the limit of the frequency with which an event occurs as the number of samples n increases to infinity

Classical hypothesis testing: p-values

Bayesian statistics

the probability of an event reflects our subjective belief in its likelihood of occurring

bring your own prior distribution

update your prior based on samples to obtain the posterior distribution

16 of 29

17 of 29

One internet person’s take:

(Another internet person’s take:)

Source: https://www.fharrell.com/post/journey/

Geoff Hinton Facts

In Jan 2012, I [EN: Yann Lecun] made a post on Google+ reproducing a dialog that occured between Geoff Hinton and Radford Neal, while Radford was giving a talk at a CIfAR workshop in 2004:

- Radford Neal: I don't necessarily think that the Bayesian method is the best thing to do in all cases...

- Geoff Hinton: Sorry Radford, my prior probability for you saying this is zero, so I couldn't hear what you said.

Source: Yann Lecun http://yann.lecun.com/ex/fun/

18 of 29

Hypothesis Testing

Example: suppose I flip a coin 1000 times and it shows up heads 300 times. Is the coin fair?

Null hypothesis: the coin is fair (p = 0.5)

Alternative hypothesis: the coin is biased

Compute test statistic using observed and expected values

Determine the p-value: the probability of obtaining a value at least as extreme

Accept/reject the null hypothesis at some statistical significance level (for example, reject if p < 0.05)

19 of 29

800 scientists agree: p-values are confusing

Source: https://www.nature.com/articles/d41586-019-00857-9

20 of 29

Statistical Inference

is the process of estimating quantities of a general population based on a sample

Point estimates

Maximum likelihood estimate (MLE), Maximum a posteriori (MAP) (Bayesian perspective)

Interval estimates
Confidence intervals

Bayesian perspective: credible intervals

21 of 29

Bayesian statistics

Prior

Uninformative priors

Likelihood
Posterior
Conjugate priors

Choice of distribution for the prior based on distribution of random variable being modelled so that the posterior is the same family of distributions as the prior
Beta-binomial model
Dirichlet-multinomial
Gamma-Poisson

Negative-Binomial is a mixture of Gamma-Poisson
Number of customers
Hospital stay length

Examples of beta distributions for different parameter settings

22 of 29

Statistical Modelling

23 of 29

The science of fitting a model to data

Feature engineering
Feature selection

Example: supervised machine learning

dataset

Examples of statistical models

Linear regression
Logistic regression for binary classification
Naive Bayes classification, Latent Dirichlet Allocation in natural language processing

24 of 29

Case Study: Linear regression in Moneyball

Between Aug to Sep 2002, Oakland Athletics won a historic 20 consecutive games. Their secret? Sports analytics. (Dataset on Kaggle)

25 of 29

Case Study: Linear regression in Moneyball cont.

Target run distance: >= 135

The key to increasing runs scored and decreasing runs allowed (ie increasing run distance)?

on-base percentage (percentage of time a player gets on base)
slugging percentage (how far a player gets around the bases on his turn).

26 of 29

The math behind linear regression

As an optimization problem: minimizing the least squares loss function

where

Probabilistically:

Assume Gaussian noise with zero mean, find the maximum likelihood estimators for the model parameters and variance of the noise where

These give the same solution for the model parameters 𝜃!

27 of 29

Logistic Regression

Binary classification
Is a generalized linear model
Sigmoid function:

Our classifier is:

Using the sigmoid function:

Graph of the sigmoid function

where

28 of 29

LASSO

Statistical regularization and variable selection technique
Proposed by a former Statistics & Computer Science Waterloo alum!
Add an L1 norm penalty term

L1 vs L2 regularization: L1 encourages sparsity

L2 loss places heavier penalty on large terms

29 of 29

Where to go from here?

Textbooks:

All of Statistics by Larry Wasserman
Elements of Statistical Learning (ESLII)
Introduction to Statistical Learning (“baby statistics”)