1
Applied Data Analysis (CS401)
Robert West
Lecture 5
Read the Stats Carefully
2018/10/18
Announcements
2
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec5-feedback
4
5
Overview
ADA won’t cover the basic of stats!
You know these things from prerequisite courses
But stats are a key ingredient of data analysis
Today: some highlights and common pitfalls
7
8
Descriptive statistics
Descriptive statistics
9
Mean, variance, and normal distribution
The (arithmetic) mean of a set of values is just the average of the values.
Variance a measure of the width of a distribution. Specifically, the variance is the mean squared deviation of points from the mean:
The standard deviation (std) is the square root of variance.
The normal distribution is completely characterized by mean and std.
Standard deviation
mean
Robust statistics
A statistic is said to be robust if it is not sensitive to outliers
Min, max, mean, std are not robust
Median, quartiles (and others) are robust
11
x
Heavy-tailed distributions
Distributions
13
Distributions
Some other important distributions:
You should understand the distribution of your data before applying any model!
“Dear data, where are you from?”
Box plot
(Smoothed) histogram
Quantile-quantile (QQ) plots
Recognizing a power law
Log-log axes
F(x) = 4 x–2
17
Sampling and uncertainty
Measurement on samples
Datasets are samples from an underlying distribution.
We are most interested in measures on the entire population, but we have access only to a sample of it.
That makes measurement hard:
variation between samples
systematic variation from the
population value.
Sampling is tricky
20
21
So you have a biased dataset...
Who likes Snickers better?
Who likes Snickers better?
Standard
deviation
in red
Who likes Snickers better?
Standard
deviation
in red
Quantifying uncertainty
Error bars
Bootstrap resampling
29
Relating two variables
Pearson’s correlation coefficient
30
Anscombe’s quartet
31
sales
Anscombe’s triplet
32
time
time
time
sales
sales
sales
Company A
Company C
Company B
Anscombe’s quartet
33
Anscombe’s quartet
Illustrates the importance of looking at a set of data graphically before starting to analyze
Highlights the inadequacy of basic statistical properties for describing realistic datasets
34
35
Good solutions:
Incorrect solutions:
(but nearly identical for all plots)
(not taking order into account)
time
time
time
sales
sales
sales
Company A
Company C
Company B
WEAKEST DEPENDENCE
STRONGEST DEPENDENCE
Correlation coefficients are tricky!
36
Ice cream sales vs. deaths by drowning, anyone?
37
Ice cream sales
Temperature
Deaths by drowning
Number of swimmers
UC Berkeley gender bias (?)
Admission figures from 1973
38
Simpson’s paradox
When a trend appears in different groups of data but disappears or reverses when these groups are combined -- beware of aggregates!
From the previous example, women tended to apply to competitive departments with low rates of admission
39
40
Hypothesis testing
Rhine’s paradox
Autonomy Corp
Joseph Rhine was a parapsychologist in the 1950’s�(founder of the Journal of Parapsychology and the�Parapsychological Society, an affiliate of the AAAS).
He ran an experiment where subjects had to guess whether 10 hidden cards were red or blue.
He found that about 1 person in 1000 had ESP (Extrasensory perception), i.e. they could guess the color of all 10 cards!
Q: Do you agree?
Rhine’s paradox
Autonomy Corp
He called back the “psychic” subjects and had them do the same test again. They all failed.
He concluded that the act of telling psychics that they have psychic abilities causes them to lose them…
Which of the following statements about p-values are true?
43
Preliminaries
Hypothesis testing
Reasoning:
Hypothesis testing
Hypothesis testing
Hypothesis testing
* Because the sum of two independent, normally distributed variables is also normally distributed.
Hypothesis testing
Two-tailed significance
When the p-value is less than 5% (p < .05), we reject the null hypothesis
Select your tests
Testing is a bit like finding the right recipe based on these ingredients:
i) Question, ii) Data type, iii) Sample size, iv) Variance known?, v) Variance of several groups equal?
Good news: Plenty of tables available!
http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm (with examples in R, SAS, Stata, SPSS)
http://sites.stat.psu.edu/~ajw13/stat500_su_res/notes/lesson14/images/summary_table.pdf
51
52
p-values
53
p-values
54
p-values
55
“p-value hacking”
56
57
Alternative approach: Bayes factors
58
Multiple-hypothesis testing
If you perform experiments over and over, you’re bound to find something
Significance level must be adjusted down when performing multiple hypothesis tests!
59
60
P(detecting no effect when there is none)
P(detecting no effect when there is none, on every experiment)
Family-wise Error Rate Corrections
61
62
Feedback
63
Give us feedback on this lecture here: https://go.epfl.ch/ada2018-lec5-feedback
Non-Parametric Tests
All the tests so far are parametric tests that assume the data are normally distributed, and that the samples are independent of each other and all have the same distribution (IID).
They may be arbitrarily inaccurate if those assumptions are not met.
Always make sure your data satisfies the assumptions of the test you’re using, i.e. watch out for:
Non-parametric tests
These tests make no assumption about the distribution of the input data, and can be used on very general datasets:
K-S test
The K-S (Kolmogorov-Smirnov) test is a very useful test for checking whether two (continuous or discrete) distributions are the same.
In the one-sided test, an observed distribution (e.g. some observed values or a histogram) is compared against a reference distribution (e.g., power-law)
In the two-sided test, two observed distributions are compared.
The K-S statistic is just the max�distance between the CDFs of�the two distributions.
K-S test
The K-S test can be used to test whether a data sample has a normal distribution or not.
Thus it can be used as a sanity check for any common parametric test (which assumes normally-distributed data).
It can also be used to compare distributions of data values in a large data pipeline: Most errors will distort the distribution of a data parameter and a K-S test can detect this.