1 of 48

The General Linear Model - Regression Part 1

Dr Andrew J. Stewart

E: drandrewjstewart@gmail.com

T: @ajstewart_lang

G: ajstewartlang

2 of 48

Remember Correlation?

  • Sometimes we’re interested in the possible relationship between two variables such as:
    • Time spent studying and performance on an exam
  • Perhaps there’s:
    • A positive correlation between the two where more time spent studying correlates with better exam performance
    • A negative correlation between the two where more time spent study correlates with worse exam performance
    • No correlation between the two variables where time spent studying doesn’t correlate with exam performance

3 of 48

4 of 48

Remember variance?

It’s the measure of the amount by which data associated with a variable vary from the mean of that variable…

If two variables covary, then when one variable deviates from the mean, we expect the other variable to deviate from its mean in a similar way.

5 of 48

Let’s take a close look at the data in this panel:

6 of 48

The horizontal lines represent the mean for each variable - if a participant is below the mean on one variable, notice that they are also below the mean for the other variable - this suggests the two variables co-vary.

7 of 48

For participants 1, 25 and 50, their scores on each variable are all below the respective mean for each variable, for participant 100 their score is above the respective mean for each variable.

To formalise this, we can calculate the combined differences…..

8 of 48

ID

Study

Score

Mean_Study

Mean_Score

Study - Mean_Study

Score - Mean_Score

(Study - Mean_Study) * (Score - Mean_Score)

1

192

77

199

79.6

-7

-2.6

18.2

2

202

81

199

79.6

3

1.4

4.2

3

208

82

199

79.6

9

2.4

21.6

4

183

75

199

79.6

-16

-4.6

73.6

...

...

...

...

...

...

...

...

Sum = 1435.072

So covariance = (1435.072/99) = 14.49568

9 of 48

Now, one problem with covariance as we’ve calculated it is that the score we end up with depends on the measurement scales associated with our variables.

In other words, the covariance value isn’t standardised.

We can divide any value by the standard deviation and that will give us the distance from the mean in standard deviation units….

10 of 48

We can divide our covariance value by the standard deviations of our two variables (actually standard deviation of variable x multiplied by standard deviation of variable y) – in other words:

This is called the Pearson product-moment correlation coefficient and ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation) with 0 meaning on correlation at all.

11 of 48

The Standard Deviation of our Study variable is 7.09566

The Standard Deviation of our Score variable is 2.277481

So, we can divide our covariance result as follows:

14.49568

7.09566 X 2.277481

Which gives us 0.8969966 - let’s round it to 0.9

Which is the Pearson’s R for the correlation between our two variables.

12 of 48

Correlation is not Causation

There is a high correlation (Pearson’s r = 0.791) between chocolate consumption in a country and the number of Nobel Prize winners in that country.

Why do you think this is?

13 of 48

Correlation is not Causation

When interpreting correlation data one common pitfall is to assume that the score on one variable causes a particular score on the other. This is wrong!

Very often, common sense might suggest causation – e.g., time spent studying improves exam score.

But you cannot make any claim about causation from a correlation.

There may be other variables that we don’t know about – maybe being interested in an academic subject results in more engagement in general with it and it is this that influences time spent studying and exam performance.

Additionally, spurious correlations can be found all over the place…

14 of 48

Correlation is not Causation

https://www.tylervigen.com/spurious-correlations

15 of 48

Correlation is not Causation

16 of 48

Correlation is not Causation

17 of 48

R2 – How much variance in one variable can be explained by the other?

Square Pearson’s r to get R2.

If we multiply this value by 100, that will be the % of variance explained in one variable by the other.

For our example on time spent studying and exam score, r squared = 0.81 as r = 0.9

This means that about 81% of the variance in exam score is explained by time spent studying.

18 of 48

19 of 48

Regression

Regression is where we want to predict the value of one variable (called our Outcome variable) on the basis of the value of one or more Predictor variables.

Simple regression is when we have one predictor, multiple regression is when we have more than one…

One of the most commonly used regression type is OLS (ordinary least squares) which works by minimising the distance (deviation) between the observed data and the linear model.

20 of 48

Statistical Models

Most of what we do in applying statistics to particular research questions involves model building.

We build a statistical model and test whether it is a good fit for our data - in other words, whether it captures our data well.

All models are an approximation of reality, and some are better than others.

Or to paraphrase the statistician George Box,

“all models are wrong but some are useful.”

21 of 48

Real data

Model 1

Model 2

So how do we tell if a particular statistical model is a good fit to our data?

We can look at the extent to which our data deviate from a particular model (where deviation = error).

22 of 48

=

=

+

Error

Error

We want to select the model which has the smallest error (aka model residuals).

23 of 48

Regression

We can plot data on exam performance and days spent studying.

Wouldn’t it be helpful if we could draw a straight line such that if we know the value on one axis (x say), we could predict the value on the other (y say) ?

24 of 48

Determining the best line

For any line, we can calculate what’s known as the Ordinary Least Squares.

The Ordinary Least Squares (OLS) method in regression provides us with a line that results in the least differences between the values predicted by the line and the data themselves.

25 of 48

Plotting a Regression Line

With OLS regression, we can plot a straight line that minimises the residuals (i.e., the error).

β0 = intercept (when x=0)

β = gradient of the line

Residuali = difference between predicted score and actual score for participant i

26 of 48

Our data

27 of 48

The mean as a model of our data (SST)

28 of 48

The regression line as a model of our data (SSR)

29 of 48

Comparing the mean and regression lines as models of our data (SSM)

30 of 48

Is our regression model any good?

If SSM is large, then the regression model is better than the mean in terms of predicting values of the outcome variable.

If SSM is small, then the regression model is not much better than the mean in terms of predicting values of the outcome variable.

31 of 48

Is our regression model any good?

We can calculate the proportion of improvement in prediction by looking at the ratio of SSM to SST. Actually, this is R2:

And this is the same R2 that we worked out by squaring the Pearson correlation coefficient.

32 of 48

Is our regression model any good?

We can also assess how good our model is by using the F-test.

The F-test is based on the ratio of the improvement due to the model (SSM) and the difference between the model and the observed data (SSR).

Rather than use the Sums of Squares themselves, we use the Mean Squares (MSM and MSR) which we get by dividing the Sums of Squares by their respective degrees of freedom.

33 of 48

Is our regression model any good?

A good model will have large MSM and a small MSR

In other words, the improvement of the model compared to the mean will be good.

The difference between the model and our observed data will be small.

We need to be sure we’ve met the assumptions of regression though…

34 of 48

Assumptions of Parametric Statistics

Assumption 1 - the model residuals need to be normally distributed (although t-tests require the data themselves to be normal).

35 of 48

Assumptions of Parametric Statistics

Assumption 2 – Homogeneity of variance – the variances should not change systematically throughout the data. In designs where you test several groups of participants this means that the variances of each group should be equivalent.

36 of 48

Assumptions of Parametric Statistics

Assumption 3 – Interval data – data should be measured on an interval scale. In other words, the distance between two adjacent points should be the same as the distance between any other two adjacent points. R can’t tell you this – you need to determine it by yourself. Reaction time is a good example of interval data.

Assumption 4 – Independence. The data from one participant does not affect the data from another (i.e., they are independent). In repeated measures designs, we expect the scores in the experimental conditions to be independent between participants.

37 of 48

Now, let’s look at how we do this in R…

38 of 48

39 of 48

Is our regression model any good?

If MSM is large and MSR is small, then F will be large.

We can determine whether our F value is significant by looking up the critical values on the F table.

For SSM the degrees of freedom = number of variables in model (in our case 2).

For SSR the degrees of freedom = number of observations – number of parameters being estimated, including the constant (in our case 8-2 = 6)

40 of 48

Is our regression model any good?

In our example, df numerator = 2, df denominator = 6 for our example. Here is a portion of the F table for a .05 alpha level.

So we would need an F value greater than 5.143 for our result to be significant at p < 0.05.

41 of 48

Example

Imagine that you are Formula 1 team director. You’re interested in understanding how the number of points that a team scores is predicted by the amount of money invested in the team. As well as being in charge of F1, you also have a secret interest in statistical analysis. In dataset1.csv you will find (for each of the 20 drivers) the amount of money invested in their particular car (in £100,000s) plus the total number of points they were awarded over the season. Work out the simple linear regression equation that captures the relationship between investment (as our predictor) and points awarded (as our outcome).

42 of 48

The data

We are going to use the tidyverse and the Hmisc libraries:

> library(tidyverse)

> library(Hmisc)

43 of 48

Visualising the regression line

set.seed(1234)

ggplot(dataset1, aes(x = investment,

y = points)) +

geom_point() +

geom_smooth(method = "lm", se = FALSE) +

labs(x = "Investment (in £100,000s)",

y = "Points") +

theme_minimal() +

theme(text = element_text(size = 12)) +

ylim(0,300)

44 of 48

Building a simple linear model

Let’s build two models using the lm() function- one (model0) is just the intercept (the mean of our outcome) predicting the outcome (points) while the second (model1) is a model with investment predicting the outcome (points).

model0 <- lm(points ~ 1, data = dataset1)

model1 <- lm(points ~ investment, data = dataset1)

We can compare the models to each other and calculated the F-ratio using the anova() function.

anova(model0, model1)

45 of 48

Comparing the two models

> anova(model0, model1)

Analysis of Variance Table

Model 1: points ~ 1

Model 2: points ~ investment

Res.Df RSS Df Sum of Sq F Pr(>F)

1 19 120827

2 18 22046 1 98781 80.654 4.547e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The F-ratio comparing our two models is 80.654 indicating our model with our predictor (investment) is a better fit than our model with just the intercept (the mean).

46 of 48

Summary of our regression model

> summary(model1)

Call:

lm(formula = points ~ investment, data = dataset1)

Residuals:

Min 1Q Median 3Q Max

-55.936 -20.840 -2.978 28.212 60.615

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -50.92329 23.44967 -2.172 0.0435 *

investment 0.24166 0.02691 8.981 4.55e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 35 on 18 degrees of freedom

Multiple R-squared: 0.8175, Adjusted R-squared: 0.8074

F-statistic: 80.65 on 1 and 18 DF, p-value: 4.547e-08

Here we have our parameter estimates.

Here we have the t-test associated with our predictor (investment).

Here are the R-squared and Adjusted R-squared values (which reflects the number of predictors in our model).

47 of 48

Checking our Assumptions

> performance::check_model(model1)

All looks generally ok...

48 of 48

What does it mean?

We would conclude from this that the amount of money spent on a driver does indeed predict the number of points they score in a season of F1. Specifically, for every £24,166 spent on them they will score one additional point.

Remember, regression is nothing more than prediction - a simple regression model allows us to predict the value of a variable on the basis of knowing about another variable (and its relationship to that variable).