1
Applied Data Analysis (CS401)
Lecture 4
Describing data
2 Oct 2024
Maria Brbic / Robert West
Announcements
2
Feedback
3
Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec4-feedback
Overview of today’s lecture
4
ADA won’t cover the basics of stats!
You know these things from prerequisite courses
But stats is a key ingredient of data analysis
Today: some highlights and common pitfalls
5
6
Part 1
Descriptive statistics
Descriptive statistics
7
Means: micro- vs. macro-average
8
micro-average
group averages →
macro-average (a.k.a. “grand mean”)
(= average of group averages)
Robust statistics
A statistic is said to be robust if it is not sensitive to extreme values
Min, max, mean, std are not robust
Median, quartiles (and others) are robust
9
x
Heavy-tailed distributions
10
Generalized means [Wikipedia]
11
Distributions
12
Distributions
Some important distributions:
You should understand the distribution of your data before applying any model!
13
“Dear data, where are you from?”
14
Box plot
(Smoothed) histogram
Recognizing a power law
15
Log-log axes
F(x) = 10000 x–2
Who likes Snickers better?
16
17
Part 2
Quantifying uncertainty
Who likes Snickers better?
18
Who likes Snickers better?
19
Standard deviation
mean
Who likes Snickers better?
20
Standard deviation
mean
Be certain to quantify your uncertainty!
21
22
How to quantify uncertainty?
Approach 1:
Hypothesis testing
23
THINK FOR A MINUTE:
Which of these statements�about p-values are true?
(Feel free to discuss with your neighbor.)
24
POLLING TIME
Hypothesis testing: intro
25
Autonomy Corp
Joseph B. Rhine was a parapsychologist in the 1950’s�(founder of the Journal of Parapsychology and the�Parapsychological Society, an affiliate of the AAAS).
He ran an experiment where subjects had to guess whether 10 hidden cards were red or blue.
He found that about 1 person in 1000 had ESP (“extrasensory perception”), i.e., they could guess the color of all 10 cards!
Q: Do you agree?
Hypothesis testing: intro
26
Autonomy Corp
Okay… But what happened to Joseph Rhine?
He called back the “psychic” subjects and had them do the same test again. They all failed.
He concluded that the act of telling psychics that they have psychic abilities causes them to lose them…
If there is no real effect, how likely is that I observe data as extreme as I observe?
Hypothesis testing
27
Commercial break
28
The logic of hypothesis testing
29
The logic of hypothesis testing
30
Coin example
31
Coin example (cont’d)
32
Selecting the right test
There are many statistical tests (see next slide)�Although they differ in their details, the basic logic is always the same (previous slides)
The right choice of test depends on multiple factors (here a selection):
Good news: Plenty of advice available (p.t.o.)
33
34
Remarks on p-values
35
Remarks on p-values
36
Ronald Fisher
Remarks on p-values
37
38
Alternative approach: Bayes factors
39
40
How to quantify uncertainty?
Approach 2:
Confidence intervals
Confidence intervals: idea
Who likes Snickers better?
41
Confidence intervals: definition
42
𝜇0
𝜇0
m
How to compute confidence intervals?
43
Confidence intervals: another view
44
𝜇0
𝜇0
m
𝜇
m1
m2
m3
m4
mi
(1-γ)/2 (e.g., 2.5%)
Prob. mass γ (e.g., 95%)
Non-parametric CIs: bootstrap resampling
45
m1
m2
m3
m4
Error bars
46
Multiple-hypothesis testing
47
48
P(detecting no effect when there is none)
P(detecting no effect when there is none, on every experiment)
k:
Family-wise error rate corrections
49
50
Part 3
Relating two variables
Pearson’s correlation coefficient
51
Correlation coefficients are tricky!
52
Anscombe’s quartet
53
Anscombe’s quartet
Illustrates the importance of looking at a set of data graphically before starting to analyze
Highlights the inadequacy of basic statistical properties for describing realistic datasets
54
UC Berkeley gender bias (?)
Admission figures from 1973
55
Engineering
Male
Female
Arts & humanities
Average
Admission rate
Simpson’s paradox
When a trend appears in different groups of data but disappears or reverses when these groups are combined -- beware of aggregates!
In the previous example, women tended to apply to competitive departments with low rates of admission
56
Summary
57
Feedback
58
Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec4-feedback