1 of 53

1

Applied Data Analysis (CS401)

Maria Brbić

Lecture 5

Regression for disentangling data

08 Oct 2025

2 of 53

Announcements

2

  • Project milestone P1 feedback to be released next week
    • We are in the process of grading!
  • Project milestone P2 released, due Nov 5th 23:59
  • Work on Homework 1
    • Not graded, but great practice for the exam
  • Final exam has been scheduled: Tue Jan 13th 15:15–18:15
  • Friday’s lab session:
    • Exercise on regression analysis (Exercise 4)
  • Indicative course feedback is being collected (until Sun Oct 12th)

3 of 53

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec5-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

4 of 53

4

Linear

regression

5 of 53

Credits

  • Much of the material in this lecture is based on Andrew Gelman and Jennifer Hill’s great book “Data Analysis Using Regression and Multilevel/Hierarchical Models”, available for free here
  • For a neat and gentle written intro to linear regression, especially check out chapters 3 and 4

5

6 of 53

What you should already know about linear regression

6

POLLING TIME

  • “How familiar are you with linear regression?”
  • Scan QR code or go to�https://go.epfl.ch/ada2025-lec5-poll

7 of 53

Linear regression as you know it

  • Given: n data points (Xi, yi), where Xi is k-dimensional vector of predictors (a.k.a. features) of i-th data point, and yi is scalar outcome
  • Goal: find the optimal coefficient vector 𝛽 = (𝛽1, …, 𝛽k) for approximating the yi’s as a linear function of the Xi’s:���where 𝜖i are error terms that should be as small as possible
  • Xi1 usually the constant 1 (by def) ⇒ 𝛽1 a constant intercept

7

Scalar product (a.k.a. dot product) of 2 vectors

8 of 53

Example with one predictor

8

X

y

𝛽1: intercept

𝛽2: slope

y𝛽1 + 𝛽2X

8

9 of 53

Linear regression as you know it

  • Given: n data points (Xi, yi), where Xi is k-dimensional vector of predictors (a.k.a. features), and yi is scalar outcome, of i-th data point
  • Goal: find the optimal coefficient vector 𝛽 = (𝛽1, …, 𝛽k) for approximating the y’s as a linear function of the X’s:���where 𝜖i are error terms that should be as small as possible
  • Xi1 usually the constant 1 → 𝛽1 a constant intercept

9

?

10 of 53

Optimality criterion: least squares

  • Intuitively, want errors 𝜖i to be as small as possible
  • Technically, want sum of squared errors as small as possible�⇔ find such that we minimize

10

11 of 53

Use cases of regression

  • Prediction: use fitted model to estimate outcome y for a new X not seen during model fitting (if you’ve seen regression before, then probably in the context of prediction)
  • Descriptive data analysis: compare average outcomes across subgroups of data (today!)
  • Causal modeling: understand how outcome y changes when you manipulate predictors X (next lecture is about causality, although not primarily using regression)

11

12 of 53

Regression as comparison of�average outcomes

12

13 of 53

Example with one binary predictor Xi

  • Xi = mom_hs = “Did mother finish high school?” ∈ {0, 1}
  • yi = kid_score = child’s score on cognitive test ∈ [0, 140]

yi = 𝛽1 + 𝛽2Xi + 𝜖i .

kid_score = 78 + 12 · mom_hs + error

13

0

1

mom_hs

kid_score

20 60 100 140

No

Yes

mean kid_score for moms who didn’t finish high school: 78

mean kid_score for moms who finished high school: 78 + 12 = 90

14 of 53

One binary predictor Xi:�Interpretation of fitted parameters 𝛽

yi = 𝛽1 + 𝛽2Xi + 𝜖i .

  • Intercept 𝛽1: mean outcome for data points i with Xi = 0
  • Slope 𝛽2: difference in mean outcomes between data points with Xi = 1 and data points with Xi = 0
  • Reason: means minimize least-squares criterion:�∑ni=1 (yim)2 is minimized w.r.t. m when�–2 ∑ni=1 (yim) = 0, i.e., when m = (1/n) ∑ni=1 yi

14

15 of 53

One binary predictor Xi:�Interpretation of fitted parameters 𝛽

yi = 𝛽1 + 𝛽2Xi + 𝜖i .

  • Intercept 𝛽1: mean outcome for data points i with Xi = 0
  • Slope 𝛽2: difference in mean outcomes between data points with Xi = 1 and data points with Xi = 0
  • Reason: means minimize least-squares criterion:�∑ni=1 (yim)2 is minimized w.r.t. m when�–2 ∑ni=1 (yim) = 0, i.e., when m = (1/n) ∑ni=1 yi

15

So why not just compute the two means separately and then compare them?

What a mean monkey!

16 of 53

Example with one continuous predictor Xi

  • Xi = mom_iq = mother’s IQ score ∈ [70, 140]
  • yi = kid_score = child’s score on cognitive test ∈ [0, 140]

yi = 𝛽1 + 𝛽2Xi + 𝜖i .

kid_score = 26 + 0.6 · mom_iq + error

16

mom_iq

kid_score

0 50 100

estimated (hypothetical) mean kid_score for moms with IQ = 0: 26

estimated mean kid_score for moms with IQ = 100: 26 + 0.6 · 100 = 86

0 50 100 150

17 of 53

One continuous predictor Xi:�Interpretation of fitted parameters 𝛽

yi = 𝛽1 + 𝛽2Xi + 𝜖i .

  • Intercept 𝛽1: estimated mean outcome for data points i�with Xi = 0
  • Slope 𝛽2: difference in estimated mean outcomes between data points whose Xi’s differ by 1
  • Why “estimated”? → e.g.,�
  • NB: for binary predictor, we got “exact” instead of “estimated”

17

18 of 53

Example with multiple predictors

  • (Xi1 = 1 = constant)
  • Xi2 = mom_hs = “Did mother finish high school?” ∈ {0, 1}
  • Xi3 = mom_iq = mother’s IQ score ∈ [70, 140]
  • yi = kid_score = child’s score on cognitive test ∈ [0, 140]

yi = 𝛽1 + 𝛽2Xi2 + 𝛽3Xi3 + 𝜖i .

kid_score = 26 + 6 · mom_hs + 0.6 · mom_iq + error

No

Yes

18

19 of 53

Example with multiple predictors

kid_score = 26 + 6 · mom_hs + 0.6 · mom_iq + error

mom_iq

kid_score

20 60 100 140

80 100 120 140

kids of moms who didn’t finish high school:

intercept = 26

slope = 0.6

kids of moms who finished high school:

intercept = 26 + 6 = 32

slope = 0.6

19

20 of 53

Example with interaction of predictors

  • Xi2 = mom_hs = “Did mother finish high school?” ∈ {0, 1}
  • Xi3 = mom_iq = mother’s IQ score ∈ [70, 140]
  • yi = kid_score = child’s score on cognitive test ∈ [0, 140]

yi = 𝛽1 + 𝛽2Xi2 + 𝛽3Xi3 + 𝛽4Xi2Xi3 + 𝜖i .

kid_score = −11 + 51 · mom_hs + 1.1 · mom_iq − 0.5 · mom_hs · mom_iq + error

No

Yes

20

21 of 53

Example with interaction of predictors

kid_score = −11 + 51 · mom_hs + 1.1 · mom_iq − 0.5 · mom_hs · mom_iq + error

mom_iq

kid_score

20 60 100 140

80 100 120 140

kids of moms who didn’t finish high school:

intercept = −11

slope = 1.1

kids of moms who finished high school:

intercept = −11 + 51 = 40

slope = 1.1 − 0.5 = 0.6

21

22 of 53

So why not just compute the two means separately and then compare them?

22

23 of 53

So why not just compute the two means separately and then compare them?

avg kid_score

90

avg kid_score

90

avg kid_score

78

avg kid_score

78

Mom finished high school

Mom�didn’t finish high school

Mom drives Mercedes

Mom doesn’t drive Mercedes

990

women

10

women

10

women

990

women

Mom finished high school

Mom�didn’t finish high school

Mom drives Mercedes

Mom doesn’t drive Mercedes

23

24 of 53

24

THINK FOR A MINUTE:

What is the mean outcome for Mercedes-driving moms vs. for non-Mercedes-driving moms?�Compare the two means! What does the comparison tell you about the link between Mercedes-driving and kid_score?

(Feel free to discuss with your neighbor.)

25 of 53

  • Mean kid_score for Mercedes drivers: 0.99 · 90 + 0.01 · 78 ≈ 90
  • Mean kid_score for non-Mercedes drivers: 0.01 · 90 + 0.99 · 78 ≈ 78
  • But really driving Mercedes makes no difference (for fixed high-school predictor)!
  • Root of evil: correlation between finishing high school and driving Mercedes
  • Regression to the rescue: kid_score = 78 + 12 · mom_hs + 0 · mercedes + error

mean kid_score

90

mean kid_score

90

mean kid_score

78

mean kid_score

78

Mom finished high school

Mom�didn’t finish high school

Mercedes

990

women

10

women

10

women

990

women

Mom finished high school

Mom�didn’t finish high school

No Mercedes

Mercedes

No Mercedes

Aha!

25

26 of 53

Course eval (“indicative feedback”) open until �Sun Oct 12th �Go to https://isa.epfl.ch now!

26

27 of 53

Quantifying uncertainty

27

28 of 53

Quantifying uncertainty

  • Statistical software gives you more than just coefficients 𝛽:

p-value: probability of estimating such an extreme coefficient if the true coefficient were zero�(= null hypothesis)

Aha!

28

29 of 53

Residuals and R2

  • Residual for data point i: estimation error on data point i:

  • Mean of residuals = 0�(total overestimation = total underestimation)
  • Variance of residuals�= avg squared distance of predicted value from observed value�= “unexplained variance
  • Fraction of variance explained by the model:

Variance of�outcomes y

29

residual

30 of 53

Residuals and R2

  • Residual for data point i: estimation error on data point i:

  • Mean of residuals = 0�(total overestimation = total underestimation)
  • Variance of residuals�= avg squared distance of predicted value from observed value�= “unexplained variance”
  • Fraction of variance explained by the model:

Variance of�outcomes y

Aha!

30

31 of 53

Coefficient of determination: R2

R2 = 0.147

R2 = 0.865

31

32 of 53

Coefficient of determination: R2

32

33 of 53

Coefficient of determination: R2

R2 = 0.67 everywhere!

33

34 of 53

Assumptions made in regression modeling

34

35 of 53

Assumptions for regression modeling

  1. Validity:
    1. Outcome measure should accurately reflect the phenomenon of interest
    2. Model should include all relevant predictors
    3. Model should generalize to cases to which it will be applied

35

36 of 53

Assumptions for regression modeling (2)

  1. Linearity:

��But very flexible: we require linearity in predictors (not necessarily in raw inputs); predictors can be arbitrary functions of raw inputs, e.g.,�- logarithms, polynomials, reciprocals, … �- interactions (i.e., products) of multiple inputs�- discretization of raw inputs, coded as indicator variables

36

37 of 53

Assumptions for regression modeling (3)

  1. Independence of errors: no interaction between data points
  2. Equal variance of errors
  3. Normality (Gaussianity) of errors

less important�in practice

37

38 of 53

Transformations of predictors and outcomes

38

39 of 53

Transformations of predictors

  • When we apply linear transformations to predictors, the model remains “equally good”:
    • The fitted coefficients may change, but predicted outcomes and model fit (R2) won’t change
  • For instance,

39

40 of 53

Mean-centering of predictors

  • Compute the mean value of a predictor over all data points, and subtract it from each value of that predictor:�XijXij − mean(X1j, …, Xnj)
  • ⇒ the predictor Xij now has mean 0

-100 -50 0 50

mean kid_score for moms with mean IQ: 86

0 50 100

mom_iq

kid_score

0 50 100

(hypothetical) mean kid_score for moms with IQ = 0: 26

0 50 100 150

40

41 of 53

After mean-centering of predictors, …

… you have a convenient interpretation of coefficients 𝛽j of main predictors (i.e., non-interaction predictors):

  • j = 1 (i.e., intercept):
    • Estimated mean outcome when each predictor has its mean value
  • j > 1:
    • Model w/o interactions: estimated mean increase in outcome y for each unit increase in Xij
    • Model with interactions: estimated mean increase in outcome y for each unit increase in Xij when each other predictor has its mean value

41

42 of 53

Standardization via z-scores

  • First mean-center all predictors, then divide them by their standard deviations:Xij ← [Xij − mean(X1j, …, Xnj)] / sd(X1j, …, Xnj)
  • Resulting values are called “z-scores
  • All predictors now have the same units:�distance (in terms of standard deviations) from the mean
  • This lets us compare coefficients for predictors with previously incomparable units of measurement, e.g., IQ score vs. earnings in Swiss francs vs. height in centimeters

42

43 of 53

Logarithmic outcomes

  • Practical: makes sense if the outcome y�follows a heavy-tailed distribution
  • Only works for non-negative outcomes
  • Theoretical: turns an additive model�into a multiplicative model:

43

44 of 53

Logarithmic outcomes: Interpreting coefficients

  • An additive increase of 1 in predictor X·1 is associated with a multiplicative increase of B1 := exp(b1) in the outcome
  • If b1 ≈ 0, we can immediately interpret b1 (without needing to exponentiate it first to get B1!) as the relative increase in outcomes, since exp(b1) ≈ 1 + b1
  • E.g., b1 = 0.05 ⇒ B1 = exp(b1) ≈ 1.05�⇒ “+1 in predictor X·1” is associated with “+5% in outcome”

44

45 of 53

Going beyond linear regression for comparing means

45

46 of 53

Beyond linear regression:�generalized linear models

  • Logistic regression: binary outcomes
  • Poisson regression: non-negative integer outcomes (e.g., counts)

46

47 of 53

Beyond comparing means; or, A taste of causality: “Difference in differences”

  • Two groups: P, S
  • At time 2, group P receives a treatment, group S doesn’t
  • Question: Did the treatment have an effect? If so, how large was it?
  • P and S don’t start out the same at time 1
  • There is a temporal “baseline effect” even w/o treatment

47

48 of 53

Beyond comparing means; or, A taste of causality: “Difference in differences” (2)

  • Elegant linear model with binary predictors:�yit = a + b · treatedi + c · time2t� + d · (treatedi · time2t) + errori
  • d = treatment effect
  • All of this with one single regression!
  • You get quantification of uncertainty (significance) for free!�

48

a

b

c

d

49 of 53

Beyond comparing means; or, A taste of causality: “Difference in differences” (2)

  • Elegant linear model with binary predictors:�yit = a + b · treatedi + c · time2t� + d · (treatedi · time2t) + error
  • d = treatment effect
  • All of this with one single regression!
  • You get quantification of uncertainty (significance) for free!�

49

a

b

c

d

What a treat!

50 of 53

Summary

  • Linear regression as a tool for comparing means across subgroups of data
  • How? Read group means off from fitted coefficients
  • Advantages over plain comparison of means “by hand”:
    • Accounting for correlations among predictors
    • Quantification of uncertainty (significance) “for free”
    • Additive vs. multiplicative model: all it takes is a log
  • Caveat emptor:
    • Model must be appropriately specified, else nonsense results → stay critical, run diagnostics (e.g., R2, data viz)

50

51 of 53

Feedback

51

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec5-feedback

  • What did you (not) like about this lecture?
  • What was (not) well explained?
  • On what would you like more (fewer) details?

52 of 53

Credits

  • Much of the material in this lecture is based on Andrew Gelman and Jennifer Hill’s great book “Data Analysis Using Regression and Multilevel/Hierarchical Models”, available for free here
  • For a neat and gentle written intro to linear regression, especially check out chapters 3 and 4

52

53 of 53

Bonus: Logarithmic outcomes and predictors

Interpretation of coefficient of logarithmic predictor:

  • Multiplicative increase by 1% in predictor X·1 is associated with a multiplicative increase by b1% in the outcome
  • Why?
    • log(y) = a + b log(X) ⇒ y = exp(a) * Xb
    • Multiplying X by a factor c multiplies y by a factor of cb
    • cb ≈ 1 + b*(c–1) for c ≈ 1 (hint: Taylor approximation!)
    • Example when using c = 1.01 (i.e., increase by 1%):�b = 2 ⇒ increasing X by 1% increases y by 2%

53