1 of 53

1

Applied Data Analysis (CS401)

Maria Brbić

Lecture 5

Regression for disentangling data

08 Oct 2025

2 of 53

Announcements

2

Project milestone P1 feedback to be released next week

We are in the process of grading!

Project milestone P2 released, due Nov 5^th 23:59
Work on Homework 1

Not graded, but great practice for the exam

Final exam has been scheduled: Tue Jan 13^th 15:15–18:15
Friday’s lab session:

Exercise on regression analysis (Exercise 4)

Indicative course feedback is being collected (until Sun Oct 12^th)

3 of 53

Feedback

3

Give us feedback on this lecture here: https://go.epfl.ch/ada2024-lec5-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
…

4 of 53

4

Linear

regression

5 of 53

Credits

Much of the material in this lecture is based on Andrew Gelman and Jennifer Hill’s great book “Data Analysis Using Regression and Multilevel/Hierarchical Models”, available for free here
For a neat and gentle written intro to linear regression, especially check out chapters 3 and 4

5

6 of 53

What you should already know about linear regression

6

POLLING TIME

“How familiar are you with linear regression?”
Scan QR code or go to�https://go.epfl.ch/ada2025-lec5-poll

7 of 53

Linear regression as you know it

Given: n data points (X_i, y_i), where X_i is k-dimensional vector of predictors (a.k.a. features) of i-th data point, and y_i is scalar outcome
Goal: find the optimal coefficient vector 𝛽 = (𝛽₁, …, 𝛽_k) for approximating the y_i’s as a linear function of the X_i’s:��where 𝜖_i are error terms that should be as small as possible
X_i₁ usually the constant 1 (by def) ⇒ 𝛽₁ a constant intercept

7

Scalar product (a.k.a. dot product) of 2 vectors

8 of 53

Example with one predictor

8

X

y

𝛽₁: intercept

𝛽₂: slope

y ≈ 𝛽₁ + 𝛽₂X

8

9 of 53

Linear regression as you know it

Given: n data points (X_i, y_i), where X_i is k-dimensional vector of predictors (a.k.a. features), and y_i is scalar outcome, of i-th data point
Goal: find the optimal coefficient vector 𝛽 = (𝛽₁, …, 𝛽_k) for approximating the y’s as a linear function of the X’s:��where 𝜖_i are error terms that should be as small as possible
X_i₁ usually the constant 1 → 𝛽₁ a constant intercept

9

?

10 of 53

Optimality criterion: least squares

Intuitively, want errors 𝜖_i to be as small as possible
Technically, want sum of squared errors as small as possible�⇔ find such that we minimize

10

11 of 53

Use cases of regression

Prediction: use fitted model to estimate outcome y for a new X not seen during model fitting (if you’ve seen regression before, then probably in the context of prediction)
Descriptive data analysis: compare average outcomes across subgroups of data (today!)
Causal modeling: understand how outcome y changes when you manipulate predictors X (next lecture is about causality, although not primarily using regression)

11

12 of 53

Regression as comparison of�average outcomes

12

13 of 53

Example with one binary predictor X_i

X_i = mom_hs = “Did mother finish high school?” ∈ {0, 1}
y_i = kid_score = child’s score on cognitive test ∈ [0, 140]

y_i = 𝛽₁ + 𝛽₂X_i + 𝜖_i_.

kid_score = 78 + 12 · mom_hs + error

13

0

1

mom_hs

kid_score

20 60 100 140

No

Yes

mean kid_score for moms who didn’t finish high school: 78

mean kid_score for moms who finished high school: 78 + 12 = 90

14 of 53

One binary predictor X_i:�Interpretation of fitted parameters 𝛽

y_i = 𝛽₁ + 𝛽₂X_i + 𝜖_i_.

Intercept 𝛽₁: mean outcome for data points i with X_i = 0
Slope 𝛽₂: difference in mean outcomes between data points with X_i = 1 and data points with X_i = 0
Reason: means minimize least-squares criterion:�∑ⁿ_i₌₁ (y_i – m)² is minimized w.r.t. m when�–2 ∑ⁿ_i₌₁ (y_i – m) = 0, i.e., when m = (1/n) ∑ⁿ_i₌₁ y_i

14

15 of 53

One binary predictor X_i:�Interpretation of fitted parameters 𝛽

y_i = 𝛽₁ + 𝛽₂X_i + 𝜖_i_.

Intercept 𝛽₁: mean outcome for data points i with X_i = 0
Slope 𝛽₂: difference in mean outcomes between data points with X_i = 1 and data points with X_i = 0
Reason: means minimize least-squares criterion:�∑ⁿ_i₌₁ (y_i – m)² is minimized w.r.t. m when�–2 ∑ⁿ_i₌₁ (y_i – m) = 0, i.e., when m = (1/n) ∑ⁿ_i₌₁ y_i

15

So why not just compute the two means separately and then compare them?

What a mean monkey!

16 of 53

Example with one continuous predictor X_i

X_i = mom_iq = mother’s IQ score ∈ [70, 140]
y_i = kid_score = child’s score on cognitive test ∈ [0, 140]

y_i = 𝛽₁ + 𝛽₂X_i + 𝜖_i_.

kid_score = 26 + 0.6 · mom_iq + error

16

mom_iq

kid_score

0 50 100

estimated (hypothetical) mean kid_score for moms with IQ = 0: 26

estimated mean kid_score for moms with IQ = 100: 26 + 0.6 · 100 = 86

0 50 100 150

17 of 53

One continuous predictor X_i:�Interpretation of fitted parameters 𝛽

y_i = 𝛽₁ + 𝛽₂X_i + 𝜖_i_.

Intercept 𝛽₁: estimated mean outcome for data points i�with X_i = 0
Slope 𝛽₂: difference in estimated mean outcomes between data points whose X_i’s differ by 1
Why “estimated”? → e.g.,�
NB: for binary predictor, we got “exact” instead of “estimated”

17

18 of 53

Example with multiple predictors

(X_i₁ = 1 = constant)
X_i₂ = mom_hs = “Did mother finish high school?” ∈ {0, 1}
X_i₃ = mom_iq = mother’s IQ score ∈ [70, 140]
y_i = kid_score = child’s score on cognitive test ∈ [0, 140]

y_i = 𝛽₁ + 𝛽₂X_i₂ + 𝛽₃X_i₃ + 𝜖_i_.

kid_score = 26 + 6 · mom_hs + 0.6 · mom_iq + error

No

Yes

18

19 of 53

Example with multiple predictors

kid_score = 26 + 6 · mom_hs + 0.6 · mom_iq + error

mom_iq

kid_score

20 60 100 140

80 100 120 140

kids of moms who didn’t finish high school:

intercept = 26

slope = 0.6

kids of moms who finished high school:

intercept = 26 + 6 = 32

slope = 0.6

19

20 of 53

Example with interaction of predictors

X_i₂ = mom_hs = “Did mother finish high school?” ∈ {0, 1}
X_i₃ = mom_iq = mother’s IQ score ∈ [70, 140]
y_i = kid_score = child’s score on cognitive test ∈ [0, 140]

y_i = 𝛽₁ + 𝛽₂X_i₂ + 𝛽₃X_i₃ + 𝛽₄X_i₂X_i₃ + 𝜖_i_.

kid_score = −11 + 51 · mom_hs + 1.1 · mom_iq − 0.5 · mom_hs · mom_iq + error

No

Yes

20

21 of 53

Example with interaction of predictors

kid_score = −11 + 51 · mom_hs + 1.1 · mom_iq − 0.5 · mom_hs · mom_iq + error

mom_iq

kid_score

20 60 100 140

80 100 120 140

kids of moms who didn’t finish high school:

intercept = −11

slope = 1.1

kids of moms who finished high school:

intercept = −11 + 51 = 40

slope = 1.1 − 0.5 = 0.6

21

22 of 53

So why not just compute the two means separately and then compare them?

22

23 of 53

So why not just compute the two means separately and then compare them?

avg kid_score 90	avg kid_score 90
avg kid_score 78	avg kid_score 78

Mom finished high school

Mom�didn’t finish high school

Mom drives Mercedes

Mom doesn’t drive Mercedes

990 women	10 women
10 women	990 women

Mom finished high school

Mom�didn’t finish high school

Mom drives Mercedes

Mom doesn’t drive Mercedes

23

24 of 53

24

THINK FOR A MINUTE:

What is the mean outcome for Mercedes-driving moms vs. for non-Mercedes-driving moms?�Compare the two means! What does the comparison tell you about the link between Mercedes-driving and kid_score?

(Feel free to discuss with your neighbor.)

25 of 53

Mean kid_score for Mercedes drivers: 0.99 · 90 + 0.01 · 78 ≈ 90
Mean kid_score for non-Mercedes drivers: 0.01 · 90 + 0.99 · 78 ≈ 78
But really driving Mercedes makes no difference (for fixed high-school predictor)!
Root of evil: correlation between finishing high school and driving Mercedes
Regression to the rescue: kid_score = 78 + 12 · mom_hs + 0 · mercedes + error

mean kid_score 90	mean kid_score 90
mean kid_score 78	mean kid_score 78

Mom finished high school

Mom�didn’t finish high school

Mercedes

990 women	10 women
10 women	990 women

Mom finished high school

Mom�didn’t finish high school

No Mercedes

Mercedes

No Mercedes

Aha!

25

26 of 53

Course eval (“indicative feedback”) open until �Sun Oct 12^th �Go to https://isa.epfl.ch now!

26

Instructions: https://www.epfl.ch/education/teaching/teaching-support/resources-for-students/#indicativefeedback

27 of 53

Quantifying uncertainty

27

28 of 53

Quantifying uncertainty

Statistical software gives you more than just coefficients 𝛽:

p-value: probability of estimating such an extreme coefficient if the true coefficient were zero�(= null hypothesis)

Aha!

28

29 of 53

Residuals and R²

Residual for data point i: estimation error on data point i:

Mean of residuals = 0�(total overestimation = total underestimation)
Variance of residuals�= avg squared distance of predicted value from observed value�= “unexplained variance”
Fraction of variance explained by the model:

Variance of�outcomes y

29

residual

30 of 53

Residuals and R²

Residual for data point i: estimation error on data point i:

Mean of residuals = 0�(total overestimation = total underestimation)
Variance of residuals�= avg squared distance of predicted value from observed value�= “unexplained variance”
Fraction of variance explained by the model:

Variance of�outcomes y

Aha!

30

31 of 53

Coefficient of determination: R²

R² = 0.147

R² = 0.865

31

32 of 53

Coefficient of determination: R²

32

33 of 53

Coefficient of determination: R²

R² = 0.67 everywhere!

33

34 of 53

Assumptions made in regression modeling

34

35 of 53

Assumptions for regression modeling

Validity:

Outcome measure should accurately reflect the phenomenon of interest
Model should include all relevant predictors
Model should generalize to cases to which it will be applied

35

36 of 53

Assumptions for regression modeling (2)

Linearity:

��But very flexible: we require linearity in predictors (not necessarily in raw inputs); predictors can be arbitrary functions of raw inputs, e.g.,�- logarithms, polynomials, reciprocals, … �- interactions (i.e., products) of multiple inputs�- discretization of raw inputs, coded as indicator variables

36

37 of 53

Assumptions for regression modeling (3)

Independence of errors: no interaction between data points
Equal variance of errors
Normality (Gaussianity) of errors

less important�in practice

37

38 of 53

Transformations of predictors and outcomes

38

39 of 53

Transformations of predictors

When we apply linear transformations to predictors, the model remains “equally good”:

The fitted coefficients may change, but predicted outcomes and model fit (R²) won’t change

For instance,

39

40 of 53

Mean-centering of predictors

Compute the mean value of a predictor over all data points, and subtract it from each value of that predictor:�X_ij ← X_ij − mean(X₁_j, …, X_nj)
⇒ the predictor X_ij now has mean 0

-100 -50 0 50

mean kid_score for moms with mean IQ: 86

0 50 100

mom_iq

kid_score

0 50 100

(hypothetical) mean kid_score for moms with IQ = 0: 26

0 50 100 150

40

41 of 53

After mean-centering of predictors, …

… you have a convenient interpretation of coefficients 𝛽_j of main predictors (i.e., non-interaction predictors):

j = 1 (i.e., intercept):

Estimated mean outcome when each predictor has its mean value

j > 1:

Model w/o interactions: estimated mean increase in outcome y for each unit increase in X_ij
Model with interactions: estimated mean increase in outcome y for each unit increase in X_ij when each other predictor has its mean value

41

42 of 53

Standardization via z-scores

First mean-center all predictors, then divide them by their standard deviations:�X_ij ← [X_ij − mean(X₁_j, …, X_nj)] / sd(X₁_j, …, X_nj)
Resulting values are called “z-scores”
All predictors now have the same units:�distance (in terms of standard deviations) from the mean
This lets us compare coefficients for predictors with previously incomparable units of measurement, e.g., IQ score vs. earnings in Swiss francs vs. height in centimeters

42

43 of 53

Logarithmic outcomes

Practical: makes sense if the outcome y�follows a heavy-tailed distribution
Only works for non-negative outcomes
Theoretical: turns an additive model�into a multiplicative model:

43

44 of 53

Logarithmic outcomes: Interpreting coefficients

An additive increase of 1 in predictor X_·₁ is associated with a multiplicative increase of B₁ := exp(b₁) in the outcome
If b₁ ≈ 0, we can immediately interpret b₁ (without needing to exponentiate it first to get B₁!) as the relative increase in outcomes, since exp(b₁) ≈ 1 + b₁
E.g., b₁ = 0.05 ⇒ B₁ = exp(b₁) ≈ 1.05�⇒ “+1 in predictor X_·₁” is associated with “+5% in outcome”

44

45 of 53

Going beyond linear regression for comparing means

45

46 of 53

Beyond linear regression:�generalized linear models

Logistic regression: binary outcomes
Poisson regression: non-negative integer outcomes (e.g., counts)

46

47 of 53

Beyond comparing means; or, A taste of causality: “Difference in differences”

Two groups: P, S
At time 2, group P receives a treatment, group S doesn’t
Question: Did the treatment have an effect? If so, how large was it?
P and S don’t start out the same at time 1
There is a temporal “baseline effect” even w/o treatment

47

48 of 53

Beyond comparing means; or, A taste of causality: “Difference in differences” (2)

Elegant linear model with binary predictors:�y_it = a + b · treated_i + c · time2_t� + d · (treated_i · time2_t) + error_i
d = treatment effect
All of this with one single regression!
You get quantification of uncertainty (significance) for free!�

48

a

b

c

d

49 of 53

Beyond comparing means; or, A taste of causality: “Difference in differences” (2)

Elegant linear model with binary predictors:�y_it = a + b · treated_i + c · time2_t� + d · (treated_i · time2_t) + error
d = treatment effect
All of this with one single regression!
You get quantification of uncertainty (significance) for free!�

49

a

b

c

d

What a treat!

50 of 53

Summary

Linear regression as a tool for comparing means across subgroups of data
How? Read group means off from fitted coefficients
Advantages over plain comparison of means “by hand”:

Accounting for correlations among predictors
Quantification of uncertainty (significance) “for free”
Additive vs. multiplicative model: all it takes is a log

Caveat emptor:

Model must be appropriately specified, else nonsense results → stay critical, run diagnostics (e.g., R², data viz)

50

51 of 53

Feedback

51

Give us feedback on this lecture here: https://go.epfl.ch/ada2025-lec5-feedback

What did you (not) like about this lecture?
What was (not) well explained?
On what would you like more (fewer) details?
…

52 of 53

Credits

Much of the material in this lecture is based on Andrew Gelman and Jennifer Hill’s great book “Data Analysis Using Regression and Multilevel/Hierarchical Models”, available for free here
For a neat and gentle written intro to linear regression, especially check out chapters 3 and 4

52

53 of 53

Bonus: Logarithmic outcomes and predictors

Interpretation of coefficient of logarithmic predictor:

Multiplicative increase by 1% in predictor X_·₁ is associated with a multiplicative increase by b₁% in the outcome
Why?

log(y) = a + b log(X) ⇒ y = exp(a) * X^b
Multiplying X by a factor c multiplies y by a factor of c^b
c^b ≈ 1 + b*(c–1) for c ≈ 1 (hint: Taylor approximation!)
Example when using c = 1.01 (i.e., increase by 1%):�b = 2 ⇒ increasing X by 1% increases y by 2%

53