1 of 29

Statistics for Data Science/Machine Learning

Waterloo Data Science Club

Sept 26, 2019

Frieda Rong

Intro to

2 of 29

What is Statistics?

  • Descriptive statistics
    • Summarizing or describing a dataset
    • Measures of spread, measures of central tendency - mean, median, mode, ranges, quantiles
  • Statistical Inference
    • Estimating unknown parameters of a population given a sampled dataset
  • Statistical modelling
    • Mathematical models encoding a set of statistical assumptions to reflect the data-generating process
  • Nonparametric statistics
    • Doesn’t make assumptions on parameters of a distribution - eg. a histogram
  • Experimental design
    • Surveying methods
  • Probability
    • Distributions

3 of 29

Statistics in data science

A/B testing

Missing data imputation

Model selection

Interpretation

Prediction

4 of 29

Overview

  • Probability
    • Bayes’ theorem
    • Important distributions
  • Frequentist vs Bayesian statistics
  • Hypothesis testing
  • Statistical inference
  • Statistical modelling
    • Linear regression
    • Logistic regression

5 of 29

Probability

6 of 29

Basics

a discrete/categorical random variable X is equal to x with some probability P(X=x)

Examples

  • a fair coin toss
  • the number of heads before the first tail when flipping a fair coin repeatedly
  • the temperature in Waterloo tomorrow

probability mass function (pmf, discrete), probability density function (pdf, continuous), cumulative distribution function P(X <= x)

continuous

“the event X = x occurs”

7 of 29

Essential concepts in probability

The expectation of some random variable X is

Statistical quantities

  • mean
  • variance (standard deviation squared)

Expectations in general:

8 of 29

Essential concepts in probability

Two or more random variables: joint probability distribution

Example: joint probability table P(X=x,Y=y)

Summing over a given random variable = “marginalizing”

  • Independence, dependence, covariance matrix, pairwise correlations

9 of 29

Covariance between two random variables:

  • If X and Y are independent, then Cov(X,Y) = 0 - but the converse isn’t necessarily true!

Correlation:

  • correlation always lies in the interval [-1, 1]

correlation does not imply causation!

Advanced linear algebra fast fact:

  • covariance matrices are exactly those matrices which are symmetric and positive semi-definite (all eigenvalues are non-negative)

10 of 29

Bayes’ Theorem

Conditional probability

Example

A patient goes to see a doctor. The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?

11 of 29

Bayes’ Theorem

Conditional probability

Example

A patient goes to see a doctor. The doctor performs a test with 99 percent reliability--that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?

Intuitive “answer”: 99%. Correct answer: 50%. Try calculating it yourself! http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability6.html

12 of 29

Essential probability distributions

  • Binomial Bin(n,p)
    • Categorical distribution (more specifically, binary)
    • Large scale approximation by Normal
  • Multinomial
  • Normal
  • Geometric, Poisson, Negative-binomial, Exponential, Gamma
  • Student’s t, Chi-squared

13 of 29

The Normal distribution

Fast facts

  • Central Limit Theorem (loosely speaking)

  • is the distribution with maximum entropy for all real-valued distributions with the same mean and variance
  • has rotational symmetry

14 of 29

Statistics

15 of 29

What we talk about when we talk about statistics

Frequentist statistics

the probability of an event is the limit of the frequency with which an event occurs as the number of samples n increases to infinity

Classical hypothesis testing: p-values

Bayesian statistics

the probability of an event reflects our subjective belief in its likelihood of occurring

bring your own prior distribution

update your prior based on samples to obtain the posterior distribution

16 of 29

17 of 29

One internet person’s take:

(Another internet person’s take:)

Geoff Hinton Facts

In Jan 2012, I [EN: Yann Lecun] made a post on Google+ reproducing a dialog that occured between Geoff Hinton and Radford Neal, while Radford was giving a talk at a CIfAR workshop in 2004:

- Radford Neal: I don't necessarily think that the Bayesian method is the best thing to do in all cases...

- Geoff Hinton: Sorry Radford, my prior probability for you saying this is zero, so I couldn't hear what you said.

18 of 29

Hypothesis Testing

Example: suppose I flip a coin 1000 times and it shows up heads 300 times. Is the coin fair?

Null hypothesis: the coin is fair (p = 0.5)

Alternative hypothesis: the coin is biased

Compute test statistic using observed and expected values

Determine the p-value: the probability of obtaining a value at least as extreme

Accept/reject the null hypothesis at some statistical significance level (for example, reject if p < 0.05)

19 of 29

800 scientists agree: p-values are confusing

20 of 29

Statistical Inference

is the process of estimating quantities of a general population based on a sample

  • Point estimates
    • Maximum likelihood estimate (MLE), Maximum a posteriori (MAP) (Bayesian perspective)
  • Interval estimates
  • Confidence intervals
    • Bayesian perspective: credible intervals

21 of 29

Bayesian statistics

  • Prior
    • Uninformative priors
  • Likelihood
  • Posterior
  • Conjugate priors
    • Choice of distribution for the prior based on distribution of random variable being modelled so that the posterior is the same family of distributions as the prior
    • Beta-binomial model
    • Dirichlet-multinomial
    • Gamma-Poisson
      • Negative-Binomial is a mixture of Gamma-Poisson
      • Number of customers
      • Hospital stay length

Examples of beta distributions for different parameter settings

22 of 29

Statistical Modelling

23 of 29

The science of fitting a model to data

  • Feature engineering
  • Feature selection

Example: supervised machine learning

  • dataset

Examples of statistical models

  • Linear regression
  • Logistic regression for binary classification
  • Naive Bayes classification, Latent Dirichlet Allocation in natural language processing

24 of 29

Case Study: Linear regression in Moneyball

Between Aug to Sep 2002, Oakland Athletics won a historic 20 consecutive games. Their secret? Sports analytics. (Dataset on Kaggle)

25 of 29

Case Study: Linear regression in Moneyball cont.

Target run distance: >= 135

The key to increasing runs scored and decreasing runs allowed (ie increasing run distance)?

  • on-base percentage (percentage of time a player gets on base)
  • slugging percentage (how far a player gets around the bases on his turn).

26 of 29

The math behind linear regression

As an optimization problem: minimizing the least squares loss function

where

Probabilistically:

Assume Gaussian noise with zero mean, find the maximum likelihood estimators for the model parameters and variance of the noise where

These give the same solution for the model parameters 𝜃!

27 of 29

Logistic Regression

  • Binary classification
  • Is a generalized linear model
  • Sigmoid function:

Our classifier is:

Using the sigmoid function:

Graph of the sigmoid function

where

28 of 29

LASSO

  • Statistical regularization and variable selection technique
  • Proposed by a former Statistics & Computer Science Waterloo alum!
  • Add an L1 norm penalty term
    • L1 vs L2 regularization: L1 encourages sparsity
      • L2 loss places heavier penalty on large terms

29 of 29

Where to go from here?

Textbooks:

  • All of Statistics by Larry Wasserman
  • Elements of Statistical Learning (ESLII)
  • Introduction to Statistical Learning (“baby statistics”)

Other links: