1 of 58

1

Applied Data Analysis (CS401)

Lecture 4

Describing data

2 Oct 2024

Maria Brbic / Robert West

2 of 58

Announcements

2

  • Project milestone P1 due this Fri Oct 4th 23:59
    • Remember: we won’t answer questions in final 24h
  • Homework H1 to be released this Fri Oct 4th, due Fri Oct 18th
  • Friday’s lab session:
    • One single room: CO1
    • Exercises on topic of this lecture (Exercise 3)
  • Reminder to solve quizzes after the lecture!
  • Exam date: Jan 14th at 3:15pm

3 of 58

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec4-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

4 of 58

Overview of today’s lecture

  • Part 1: Descriptive statistics
  • Part 2: Quantifying uncertainty
  • Part 3: Relating two variables

4

5 of 58

ADA won’t cover the basics of stats!

You know these things from prerequisite courses

But stats is a key ingredient of data analysis

Today: some highlights and common pitfalls

5

6 of 58

6

Part 1

Descriptive statistics

7 of 58

Descriptive statistics

7

8 of 58

Means: micro- vs. macro-average

8

micro-average

group averages →

macro-average (a.k.a. “grand mean”)

(= average of group averages)

9 of 58

Robust statistics

A statistic is said to be robust if it is not sensitive to extreme values

Min, max, mean, std are not robust

Median, quartiles (and others) are robust

Check these Wikipedia pages

9

x

10 of 58

Heavy-tailed distributions

  • Some distributions are all about the extreme values
  • E.g., power laws (see last lecture):
    • Very very large values are rare, “but not very rare”
    • Body size vs. city size
    • For k <= 3: infinite variance
    • For k <= 2: infinite variance, infinite mean
    • Don’t report (arithmetic) mean/variance for power-law- distributed data!
    • Use robust statistics (e.g., median, quantiles, etc.)�or geometric mean (p.t.o.)

10

11 of 58

Generalized means [Wikipedia]

  • Common trick: transform data into a different space (via function f), take mean there, then transform back into the original space (via f–1):
  • f(x) = x, f–1(x) = x “arithmetic mean”
  • f(x) = log(x), f–1(x) = exp(x) “geometric mean”
  • f(x) = 1/x, f–1(x) = 1/x “harmonic mean”
  • f(x) = x2, f–1(x) = √x “root mean square”

11

12 of 58

Distributions

12

13 of 58

Distributions

Some important distributions:

  • Normal: see previous slides
  • Poisson: the distribution of counts that occur at a certain “rate”; e.g., number of visits to a given website in a fixed time interval.
  • Exponential: the interval between two such events.
  • Binomial/multinomial: The number of “successes” (e.g., coin flips = heads) out of n trials.
  • Power-law/Zipf/Pareto/Yule: e.g., frequencies of different terms in a document; city size

You should understand the distribution of your data before applying any model!

13

14 of 58

“Dear data, where are you from?”

  • Visual inspection for ruling out certain distributions:�e.g., when histogram/box plot is asymmetric (even for large sample size), the data cannot be normal

  • Statistical tests:
    • Goodness-of-fit tests
    • Kolmogorov-Smirnoff test
    • Normality tests

14

Box plot

(Smoothed) histogram

15 of 58

Recognizing a power law

15

Log-log axes

F(x) = 10000 x2

16 of 58

Who likes Snickers better?

16

17 of 58

17

Part 2

Quantifying uncertainty

18 of 58

Who likes Snickers better?

  • Most straightforward descriptive statistic to answer this question:
  • Mean for each group (women, men)

18

19 of 58

Who likes Snickers better?

19

Standard deviation

mean

20 of 58

Who likes Snickers better?

20

Standard deviation

mean

21 of 58

Be certain to quantify your uncertainty!

  • Finite samples introduce uncertainty
    • Even a complete dataset is a finite sample!
  • Whenever you report a statistic, you need to quantify how certain you are in it!
  • We will discuss two ways of quantifying uncertainty:�(1) Hypothesis testing�(2) Confidence intervals
  • All plots should have error bars!

21

22 of 58

22

How to quantify uncertainty?

Approach 1:

Hypothesis testing

23 of 58

23

THINK FOR A MINUTE:

Which of these statements�about p-values are true?

(Feel free to discuss with your neighbor.)

24 of 58

24

POLLING TIME

  • “Which of these statements about p-values are true?”
  • Scan QR code or go to�https://go.epfl.ch/ada2024-lec4-poll

25 of 58

Hypothesis testing: intro

25

Autonomy Corp

Joseph B. Rhine was a parapsychologist in the 1950’s�(founder of the Journal of Parapsychology and the�Parapsychological Society, an affiliate of the AAAS).

He ran an experiment where subjects had to guess whether 10 hidden cards were red or blue.

He found that about 1 person in 1000 had ESP (“extrasensory perception”), i.e., they could guess the color of all 10 cards!

Q: Do you agree?

26 of 58

Hypothesis testing: intro

26

Autonomy Corp

Okay… But what happened to Joseph Rhine?

He called back the “psychic” subjects and had them do the same test again. They all failed.

He concluded that the act of telling psychics that they have psychic abilities causes them to lose them…

 

If there is no real effect, how likely is that I observe data as extreme as I observe?

27 of 58

Hypothesis testing

  • A huge subject; can take entire classes on it
  • Many people don’t like it�- cf. Bayesian vs. frequentist debate (a.k.a. war)
  • Need to understand basics even if you don’t use it yourself
  • Never use it without understanding exactly what you’re doing

27

28 of 58

Commercial break

28

29 of 58

The logic of hypothesis testing

  • Flip a coin 100 times; outcome: 40 heads; “Is the coin fair?”
  • Null hypothesis: “yes”; alternative hypothesis: “no”
  • “How likely would I be to see an outcome at least this extreme (i.e., ≤ 40 heads) if the null hypothesis were true (i.e., if the coin were fair, i.e., if we expect 50 heads)?”
  • If this probability is large, the null hypothesis suffices to explain the data (and is thus not rejected)
  • Otherwise, dig deeper in order to understand your data

29

30 of 58

The logic of hypothesis testing

30

  • Idea: Gain (weak and indirect) support for a hypothesis HA by ruling out a null hypothesis H0���
  • by inspecting a test-statistic: a measurement made on the data that is likely to be large under HA but small under H0

31 of 58

Coin example

31

  • Idea: Gain (weak and indirect) support for a hypothesis HA by ruling out a null hypothesis H0
    • H0: “the coin is fair” (simplest hypothesis, cf. Occam’s razor)
    • HA: “the coin is not fair (a.k.a. biased)”�
  • by inspecting a test-statistic: a measurement made on the data that is likely to be large under HA but small under H0
    • e.g., number of heads after 100 coin tosses (1-tailed)
    • e.g., abs(50 - number of heads after 100 coin tosses) (2-tailed)

32 of 58

Coin example (cont’d)

  • Null hypothesis H0: “the coin is fair”, i.e., “probability of heads = 0.5”
  • Test statistic s: abs(50 - # of empirically observed heads after 100 coin tosses)
  • Pr(S | H0): probability distribution of test statistic, assuming that H0 is true�
  • Decision rule: reject H0 if Pr(S ≥ s | H0) < α,�i.e., if the probability of deviating from 50 heads at least as much as empirically observed is small
    • Pr(S ≥ s | H0) = “p-value”
    • α = “significance level”
  • α controls “false-rejection rate” (probability of rejecting H0 although it is true)
    • You as the data analyst choose α (common values: 5%, 1%, 0.5%, 0.1%)
    • Higher α → higher false-rejection rate

32

33 of 58

Selecting the right test

There are many statistical tests (see next slide)�Although they differ in their details, the basic logic is always the same (previous slides)

The right choice of test depends on multiple factors (here a selection):

  • Question
  • Data type (continuous vs. categorical; dimensionality; number of outcomes)
  • Sample size
  • When comparing two samples: same population or different populations?
  • Parametric assumptions about distribution of test statistic under null hypothesis?�(Less important for large sample sizes, due to central limit theorem)

Good news: Plenty of advice available (p.t.o.)

33

34 of 58

34

35 of 58

Remarks on p-values

  • Widely used in all sciences
  • They are widely misunderstood!
  • Don’t use them if you don’t understand them!
  • Large p means that even under a simple null hypothesis your data would be quite likely
  • This tells you nothing about the alternative hypothesis

35

36 of 58

Remarks on p-values

  • Historically, not meant as a method for formally�deciding whether a hypothesis is true or not
  • Rather, an informal tool for assessing a particular result
  • Low p-value means: “The simple null hypothesis doesn’t explain the data, so keep looking for other explanations!”
  • p = 0.05 means: if you repeat experiment 20 times, you’ll see extreme data even under null hypothesis → you might have “lucked out”
  • Look at the effect size (“y-axis”), not just the p-value!

36

Ronald Fisher

37 of 58

Remarks on p-values

  • Important to understand what p-values are
  • Maybe even more important to understand what they are not…
  • Read this paper: A Dirty Dozen: 12 P-Value Misconceptions

37

38 of 58

38

39 of 58

Alternative approach: Bayes factors

  • See here
  • Great (and amusing) explanation of difference between hypothesis-testing approach and Bayesian approach:�Chapter 37 in MacKay’s (free) book on “Information Theory, Inference, and Learning Algorithms”

39

40 of 58

40

How to quantify uncertainty?

Approach 2:

Confidence intervals

41 of 58

Confidence intervals: idea

Who likes Snickers better?

  • Confidence interval (CI)�= a range of estimates for the parameter of interest (e.g., mean) that seems reasonable given the observed data
  • Confidence level γ ⇒ “γ CI”� (often γ = 95% ⇒ 95% CI)

41

42 of 58

Confidence intervals: definition

  • 𝜇: true value of parameter of interest
  • m: empirical estimate of parameter of interest
  • CIs and hypothesis testing are tightly connected:
    • γ CI contains those values 𝜇0 for which the null hypothesis “H0: 𝜇 = 𝜇0” cannot be rejected at significance level 1−γ

42

𝜇0

𝜇0

m

43 of 58

How to compute confidence intervals?

  • Parametric methods assume that the test statistic follows a known (typically Normal) distribution�→ Need to verify that this is actually true! Ugh…�
  • Non-parametric methods make no assumptions about the distribution of the test statistic. They instead work by sampling the empirical data.�→ Yay!

43

44 of 58

Confidence intervals: another view

  • If we were to repeat the data collection N → ∞ independent times, we’d obtain N estimates of 𝜇: m1, …, mN
  • Average of mi’s will approach the true 𝜇 (by law of large numbers)
  • For a fraction γ of the N repetitions, mi lies within the γ CI around 𝜇
  • → May estimate CI from histogram of mi’s

44

𝜇0

𝜇0

m

𝜇

m1

m2

m3

m4

mi

(1-γ)/2 (e.g., 2.5%)

Prob. mass γ (e.g., 95%)

45 of 58

Non-parametric CIs: bootstrap resampling

45

m1

m2

m3

m4

46 of 58

Error bars

  • An important use case for CIs
  • But be careful! Error bars can�potentially represent many things:
    • Confidence intervals (CIs)
    • Standard deviation (std)
    • Standard error of the mean: std/sqrt(n)
  • → Always ask, always tell what the CIs represent!

46

47 of 58

Multiple-hypothesis testing

  • If you perform experiments over and over, you’re bound to find something
  • If you consider “at least one positive outcome” to be the manifestation of an underlying effect: Significance level must be adjusted down when performing multiple hypothesis tests!

47

48 of 58

48

P(detecting no effect when there is none)

P(detecting no effect when there is none, on every experiment)

k:

49 of 58

Family-wise error rate corrections

49

50 of 58

50

Part 3

Relating two variables

51 of 58

Pearson’s correlation coefficient

  • “Amount of linear dependence”

  • More general:
    • Rank correlation, e.g., Spearman’s correlation coefficient
    • Mutual information

51

52 of 58

Correlation coefficients are tricky!

52

53 of 58

Anscombe’s quartet

53

54 of 58

Anscombe’s quartet

Illustrates the importance of looking at a set of data graphically before starting to analyze

Highlights the inadequacy of basic statistical properties for describing realistic datasets

More on Wikipedia

54

55 of 58

UC Berkeley gender bias (?)

Admission figures from 1973

55

Engineering

Male

Female

Arts & humanities

Average

Admission rate

56 of 58

Simpson’s paradox

When a trend appears in different groups of data but disappears or reverses when these groups are combined -- beware of aggregates!

In the previous example, women tended to apply to competitive departments with low rates of admission

56

57 of 58

Summary

  • Understand your data with descriptive statistics
    • Choose the right stats based on type of distribution
  • Be certain to quantify your uncertainty
    • Hypothesis testing
    • Confidence intervals (preferred!)
    • Careful when performing multiple tests (apply correction)
  • Relating 2 variables to one another
    • Correlation != causation
    • Even tricker with >2 variables (→ next lecture!)

57

58 of 58

Feedback

58

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec4-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?