1 of 67

1

Applied Data Analysis (CS401)

Robert West

Lecture 5

Read the Stats Carefully

2018/10/18

2 of 67

Announcements

2

Homework 1

Peer reviews due tomorrow (Fri) 23:59
Grades released Wed, Oct 24

Homework 2

Due on Wed, Oct 24, 23:59

Tomorrow’s lab session:

Project introduction
Cluster introduction
HW2 office hours
DataCluedo, installment 2

3 of 67

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec5-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Where is Waldo?
…

4 of 67

4

5 of 67

5

6 of 67

Overview

Descriptive statistics
Sampling and uncertainty
Relating two variables
Hypothesis testing

7 of 67

ADA won’t cover the basic of stats!

You know these things from prerequisite courses

But stats are a key ingredient of data analysis

Today: some highlights and common pitfalls

7

8 of 67

8

Descriptive statistics

9 of 67

Descriptive statistics

9

10 of 67

Mean, variance, and normal distribution

The (arithmetic) mean of a set of values is just the average of the values.

Variance a measure of the width of a distribution. Specifically, the variance is the mean squared deviation of points from the mean:

The standard deviation (std) is the square root of variance.

The normal distribution is completely characterized by mean and std.

Standard deviation

mean

11 of 67

Robust statistics

A statistic is said to be robust if it is not sensitive to outliers

Min, max, mean, std are not robust

Median, quartiles (and others) are robust

Check these Wikipedia pages

11

x

12 of 67

Heavy-tailed distributions

Some distributions are all about the “outliers”
E.g., power laws:

Very very large values are rare, “but not very rare”
Body size vs. city size
For k <= 2: infinite mean
For k <= 3: infinite variance
Don’t report mean/variance for power-law-distributed data!
Use robust statistics (e.g., median, “80/20 rule”, etc.)

13 of 67

Distributions

13

link

14 of 67

Distributions

Some other important distributions:

Poisson: the distribution of counts that occur at a certain “rate”; e.g., number of visits to a given website in a fixed time interval.
Exponential: the interval between two such events.
Binomial/multinomial: The number of counts of events (e.g., coin flips = heads) out of n trials.
Power-law/Zipf/Pareto/Yule: e.g., frequencies of different terms in a document; city size

You should understand the distribution of your data before applying any model!

15 of 67

“Dear data, where are you from?”

Visual inspection for ruling out certain distributions:�e.g., when it’s asymmetric, the data cannot be normal. The histogram gives even more information.

Statistical tests:

Goodness-of-fit tests
Kolmogorov-Smirnoff test
Normality tests

Box plot

(Smoothed) histogram

Quantile-quantile (QQ) plots

16 of 67

Recognizing a power law

Log-log axes

F(x) = 4 x^–²

17 of 67

17

Sampling and uncertainty

18 of 67

Measurement on samples

Datasets are samples from an underlying distribution.

We are most interested in measures on the entire population, but we have access only to a sample of it.

That makes measurement hard:

Sample measurements have variance:

variation between samples

Sample measurements have bias:

systematic variation from the

population value.

19 of 67

Sampling is tricky

Sometimes need to subsample for computational efficiency

Uniformly at random
Stratified sampling (e.g., equal number per quantile)
Importance sampling

Sometimes need to subsample during data collection

E.g., Google Flu Trends: only Google users
Careful: bias!

20 of 67

20

21 of 67

21

22 of 67

So you have a biased dataset...

“Found data”
Observational studies
Entire lecture on this (in 2 weeks from today)

23 of 67

Who likes Snickers better?

24 of 67

Who likes Snickers better?

Standard

deviation

in red

25 of 67

Who likes Snickers better?

Standard

deviation

in red

26 of 67

Quantifying uncertainty

Whenever you report a statistic, you need to quantify how certain you are in it!
Even a complete dataset is a sample
Entire field: hypothesis testing (later today)
All plots should have error bars!

27 of 67

Error bars

Can represent many things�(→ always explain; always question):

Standard deviation
Standard error of the mean: sd/sqrt(n)
Confidence intervals

Parametric (ugh!)
Non-parametric (yay!): bootstrap resampling

28 of 67

Bootstrap resampling

29 of 67

29

Relating two variables

30 of 67

Pearson’s correlation coefficient

“Amount of linear dependence”

More general:

Rank correlation, e.g., Spearman’s correlation coefficient
Mutual information

30

31 of 67

Anscombe’s quartet

31

sales

32 of 67

Anscombe’s triplet

32

time

sales

Company A

Company C

Company B

33 of 67

Anscombe’s quartet

33

34 of 67

Anscombe’s quartet

Illustrates the importance of looking at a set of data graphically before starting to analyze

Highlights the inadequacy of basic statistical properties for describing realistic datasets

35 of 67

35

Good solutions:

Visual reasoning
Spearman rank correlation (rather than Pearson correlation)

Incorrect solutions:

“No difference”
Pearson correlation
Slope of linear regression

(but nearly identical for all plots)

median(sales)
stdev(sales)
max(sales) - min(sales)
avg(sales/time)

(not taking order into account)

avg((sales - time)²)
(avg(sales) - avg(time))²
“Outliers imply lower dependence”
“Quadratic dependence (B) counts more than linear dependence”

time

sales

Company A

Company C

Company B

WEAKEST DEPENDENCE

STRONGEST DEPENDENCE

36 of 67

Correlation coefficients are tricky!

http://guessthecorrelation.com/
Correlation != causation (cf. lecture in 2 weeks)�http://www.tylervigen.com/spurious-correlations�

36

37 of 67

Ice cream sales vs. deaths by drowning, anyone?

37

Ice cream sales

Temperature

Deaths by drowning

Number of swimmers

38 of 67

UC Berkeley gender bias (?)

Admission figures from 1973

38

39 of 67

Simpson’s paradox

When a trend appears in different groups of data but disappears or reverses when these groups are combined -- beware of aggregates!

From the previous example, women tended to apply to competitive departments with low rates of admission

39

40 of 67

40

Hypothesis testing

41 of 67

Rhine’s paradox

Autonomy Corp

Joseph Rhine was a parapsychologist in the 1950’s�(founder of the Journal of Parapsychology and the�Parapsychological Society, an affiliate of the AAAS).

He ran an experiment where subjects had to guess whether 10 hidden cards were red or blue.

He found that about 1 person in 1000 had ESP (Extrasensory perception), i.e. they could guess the color of all 10 cards!

Q: Do you agree?

42 of 67

Rhine’s paradox

Autonomy Corp

He called back the “psychic” subjects and had them do the same test again. They all failed.

He concluded that the act of telling psychics that they have psychic abilities causes them to lose them…

43 of 67

Which of the following statements about p-values are true?

43

44 of 67

Preliminaries

A huge subject; can take entire classes on it
A black art; many people hate it
cf. Bayesian vs. frequentist debate (a.k.a. war)
Need to understand basics even if you don’t use it yourself
Never use it without understanding exactly what you’re doing

45 of 67

Hypothesis testing

Reasoning:

Flip a coin 100 times; 40 heads; “Is the coin fair?”
Null hypothesis: “yes”; alternative hypothesis: “no”
“How likely would I be to see 40 or fewer heads if the null hypothesis were true?”
If this probability is large, the null hypothesis suffices (and is thus not rejected)
Otherwise, keep experimenting

46 of 67

Hypothesis testing

Idea: Gain (weak and indirect) support for a hypothesis H_Aby ruling out a null hypothesis H₀

A test statistic is some measurement we can make on the data which is likely to be large under H_Abut small under H₀

47 of 67

Hypothesis testing

Gain (weak and indirect) support for a hypothesis H_A(the coin is biased) by means of disproving a null hypothesis H₀ (the coin is fair).
A test statistic is some measurement we can make on the data which is likely to be large under H_Abut small under H₀. �the number of heads after k coin tosses (one sided)�the difference between number of heads and k/2 (two-sided)

Note: tests can be either one-tailed or two-tailed. Here a two-tailed test is convenient because it treats very large and very small counts of heads the same way.

48 of 67

Hypothesis testing

Another example:

Two samples a and b, normally distributed, from A and B.
Null hypothesis H₀: mean(A) = mean(B)�test statistic is: s = mean(a) – mean(b).
Under H₀, s has mean zero and is normally distributed*.
But it is “large” if the two means are different.

* Because the sum of two independent, normally distributed variables is also normally distributed.

49 of 67

Hypothesis testing

s = mean(a) – mean(b) is our test statistic;�the null hypothesis H₀ is “mean(A) = mean(B)”

We reject H₀ if Pr(S > s | H₀) < α, i.e., if the probability of a statistic value at least as large as s is small.
Pr(S > s | H₀) is the infamous p-value; α is called significance level
α is a suitable “small” probability, say 0.05.
α directly controls the false-positive rate (probability of rejecting H₀although it is true): higher α → higher false-positive rate
As we make α smaller, the false-negative rate increases (probability of not rejecting H₀although it is false)
Common values for α: 0.05, 0.02, 0.01, 0.005, 0.001

50 of 67

Two-tailed significance

When the p-value is less than 5% (p < .05), we reject the null hypothesis

51 of 67

Select your tests

Testing is a bit like finding the right recipe based on these ingredients:

i) Question, ii) Data type, iii) Sample size, iv) Variance known?, v) Variance of several groups equal?

Good news: Plenty of tables available!

http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm (with examples in R, SAS, Stata, SPSS)

http://sites.stat.psu.edu/~ajw13/stat500_su_res/notes/lesson14/images/summary_table.pdf

51

52 of 67

52

53 of 67

p-values

Widely used in all sciences
They are widely misunderstood!
Don’t dare to use them if you don’t understand them! (example)
Large p means that even under a null hypothesis your data would be quite likely
This tells you nothing about the alt. hypothesis

53

54 of 67

p-values

Historically, not meant as a method for formally�deciding whether a hypothesis is true or not
Rather, an informal tool for assessing a particular result
Low p-value means: the simple null hypothesis doesn’t explain the data, so keep looking for other explanations!
p = 0.05 means: if you repeat experiment 20 times, you’ll see extreme data even under null hypothesis → you might have “lucked out”
Look at the y-axis, not just the p-value!

54

55 of 67

p-values

Important to understand what p-values are
Maybe even more important to understand what they are not…
Read this paper: A Dirty Dozen: 12 P-Value Misconceptions

55

56 of 67

“p-value hacking”

56

57 of 67

57

58 of 67

Alternative approach: Bayes factors

See here
Great (and amusing) explanation of difference between hypothesis-testing approach and Bayesian approach:�Chapter 37 in MacKay’s book on “Information Theory, Inference, and Learning Algorithms”

58

59 of 67

Multiple-hypothesis testing

If you perform experiments over and over, you’re bound to find something

Significance level must be adjusted down when performing multiple hypothesis tests!

59

60 of 67

60

P(detecting no effect when there is none)

P(detecting no effect when there is none, on every experiment)

61 of 67

Family-wise Error Rate Corrections

61

62 of 67

62

63 of 67

Feedback

63

Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec5-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
Where is Waldo?
…

64 of 67

Non-Parametric Tests

All the tests so far are parametric tests that assume the data are normally distributed, and that the samples are independent of each other and all have the same distribution (IID).

They may be arbitrarily inaccurate if those assumptions are not met.

Always make sure your data satisfies the assumptions of the test you’re using, i.e. watch out for:

Outliers – will corrupt many tests that use variance estimates.
Correlated values as samples, e.g. if you repeated measurements on the same subject.
Skewed distributions – give invalid results.

65 of 67

Non-parametric tests

These tests make no assumption about the distribution of the input data, and can be used on very general datasets:

K-S test (today)
Permutation tests and Bootstrap confidence intervals (we will see them in the following lectures)

66 of 67

K-S test

The K-S (Kolmogorov-Smirnov) test is a very useful test for checking whether two (continuous or discrete) distributions are the same.

In the one-sided test, an observed distribution (e.g. some observed values or a histogram) is compared against a reference distribution (e.g., power-law)

In the two-sided test, two observed distributions are compared.

The K-S statistic is just the max�distance between the CDFs of�the two distributions.

67 of 67

K-S test

The K-S test can be used to test whether a data sample has a normal distribution or not.

Thus it can be used as a sanity check for any common parametric test (which assumes normally-distributed data).

It can also be used to compare distributions of data values in a large data pipeline: Most errors will distort the distribution of a data parameter and a K-S test can detect this.