Wascher

Introduction

Following the 2016 US presidential election, analysts and voters were shocked. Outlets like The Huffington Post and The New York Times used statistics to forecast the election results, but still failed to predict the victory of Donald Trump. Yet, there does lie some truth in the now-debunked theory that polls didn’t detect “shy” Trump supporters: this race centered not on persuading voters, but on turning them out.

Despite the ongoing clash between persuasion and turnout, there’s no denying the latter’s importance. Historically, though, voter turnout has been low: it averages at 53.6% and alternates between 1996’s low of 49% and 2008’s high of 58.2%. Even larger than the difference in national turnout is the difference in state turnout – in 2012, 44.2% of Hawaii’s voters participated, while 78.2% of Minnesota’s voting eligible population did. This 34% difference could lead one to wonder what factors influence voter turnout. This project will seek to answer that question.

This analysis uses self-gathered data, accessed from the US Census and the American Community Survey (ACS). The data is broken down into nine variables, with the last six being predictors:^{[1]}

- state describes to which state/observation the data applies
- stateID is that state’s name, but abbreviated
- turnout is turnout in the 2016 election in each state, per cent, found in Census data
- hs is educational attainment, measuring the percentage of people living in each state with at least a high school diploma, found in the ACS
- college is another educational attainment variable, measuring the percentage of people living in each state with at least a Bachelor’s degree, found in the ACS
- minority is the percentage of a state’s population that identifies as non-White, found in the ACS
- young measures the percentage of people living in each state between the ages of 18 and 24, found in the ACS
- battle denotes a state’s battleground status. Based on conventional wisdom, the twelve states are Colorado, Florida, Iowa, Michigan, Minnesota, New Hampshire, Nevada, North Carolina, Ohio, Pennsylvania, Virginia, and Wisconsin. This is a categorical dummy variable, with 1 denoting a battleground state and 0 denoting a non-battleground state.
- weather measures the Election Day weather in each state’s largest city. For this scenario, the definition of “favorable” is that there was no rain or snow on November 8th, 2016. This will function as a categorical dummy variable, coded like battle. Data was compiled via various meteorological forecasts.

This analysis will dissect these leading factors that political scientists and pundits believe affect voter turnout from state-to-state. Additionally, it will assess how strong a role they play in the overall turnout of a state’s voting population, developing conclusions using multiple regression analysis. Many different models will be considered.

Exploratory Data Analysis

For the purposes of exploratory data analysis, simple visual and quantitative operations were performed on each variable and the dataset as a whole. Quantitatively, the following was found across 40 observations:

turnout hs college minority young

Min. :82.20 Min. :19.60 Min. : 5.20 Min. : 8.500 Min. :43.00

1st Qu.:86.10 1st Qu.:26.57 1st Qu.:14.05 1st Qu.: 9.600 1st Qu.:57.75

Median :89.35 Median :29.55 Median :21.65 Median : 9.800 Median :61.85

Mean :88.70 Mean :29.81 Mean :22.75 Mean : 9.876 Mean :61.66

3rd Qu.:91.08 3rd Qu.:32.60 3rd Qu.:31.18 3rd Qu.:10.100 3rd Qu.:65.65

Max. :93.50 Max. :41.50 Max. :75.00 Max. :12.300 Max. :74.80

As for the two categorical variables, battle and weather, results found that there were 12 battleground states and 38 non-battleground states, as originally intended. Furthermore, there was unfavorable weather in only 8 cities, with 42 not seeing rain or snow on November 8th.

To gain a visual understanding of the data, the four interval predictors were plotted against turnout. Here, hs and college show positive linear relationships, while young doesn’t appear to graphically illustrate a relationship at all. With 75% of its population identifying as non-White, Hawaii appears to be an outlier. Though an analysis of studentized residuals confirms this, Hawaii will be kept in the dataset, as omitting a state does more harm than good in a study of all 50 states. This rule would apply to all outliers, influential points, and high-leverage points in the study.

Additionally, a saturated, full multiple linear regression model was run with every predictor. Based on that analysis, it appears that at 95% confidence, the only statistically significant predictors are college, minority, and young. The adjusted r-squared for this model is 0.6415, the standard error is 3.819, and the f-statistic and accompanying p-value point to statistical significance.

Regression Model Building

To first ascertain a decent model, variables were chosen using stepwise selection. This returned a model with college, young, minority, and battle as predictors. After running this model on its own, the results found an adjusted r-squared of 0.642, a standard error of 3.816, and an f-statistic and accompanying p-value that point to statistical significance. This is slightly – albeit not much – better than the full model.

It appears young is not significant, so a third model was run, with only college, minority, and battle. Its adjusted r-squared of 0.6344 and standard error of 3.857 indicate a poorer fit. And based on the AIC of both, it appears Model 2 (282.558) has a slightly lower AIC than Model 3 (282.7034), suggesting a better fit for Model 2. Contrastingly and interestingly, a nested f-test of the two models indicated young is not worth adding as a predictor. Nevertheless, based on the other factors, Model 2, which includes young, appears superior.

Checking assumptions for this model proves successful: the residual plots, histogram, and QQ-plot show the model meets the tenets of linearity, normality, and variance. Again, running the same tests with Model 3 suggests that it does not fulfill these assumptions as fully, further demonstrating why Model 2 is superior. Scatterplots also point toward no signs of multicollinearity. Finally, the variable plots done during Exploratory Data Analysis indicate no need for transformations on the variables we do have, and including interactions would likely bear no fruit, either. In this dataset, Model 2 appears to be the best for predicting voter turnout.

Results and Conclusion

The final model uses college, young, minority, and battle as predictors. Its equation is:

y-hat = 57.07238 + 0.69555(college) - 1.26814(young) - 0.21367(minority) + 5.15359(battle)

Although this model has performed the best in these statistical analyses, it still is not great for predicting turnout. After all, its adjusted r-squared of 0.642, standard error of 3.816, and AIC of 282.558 are far from reliable. Such a conclusion makes sense, as these 4 variables likely aren’t the largest determiners of voter turnout.

In this way, the methodology of this study could be improved. Maybe tracking different variables would beget a stronger model. Additionally, the scope can be expanded past the 2016 election, including more observations in the sample. Finally, it’s interesting which models made it into the final model, and which didn’t. While some, like weather, are expected, it’s particularly notable that college attainment is in the final model – and even the smaller third model – but high school attainment is not.

Exploratory Data Analysis

> summary(WascherTurnout16)

state stateID turnout

Length:50 Length:50 Min. :43.00

Class :character Class :character 1st Qu.:57.75

Mode :character Mode :character Median :61.85

Mean :61.66

3rd Qu.:65.65

Max. :74.80

hs college minority

Min. :82.20 Min. :19.60 Min. : 5.20

1st Qu.:86.10 1st Qu.:26.57 1st Qu.:14.05

Median :89.35 Median :29.55 Median :21.65

Mean :88.70 Mean :29.81 Mean :22.75

3rd Qu.:91.08 3rd Qu.:32.60 3rd Qu.:31.18

Max. :93.50 Max. :41.50 Max. :75.00

young battle weather

Min. : 8.500 Min. :0.00 Min. :0.00

1st Qu.: 9.600 1st Qu.:0.00 1st Qu.:0.00

Median : 9.800 Median :0.00 Median :0.00

Mean : 9.876 Mean :0.24 Mean :0.16

3rd Qu.:10.100 3rd Qu.:0.00 3rd Qu.:0.00

Max. :12.300 Max. :1.00 Max. :1.00

Stud_res Stud_res2

Min. :-5.01917 Min. :-5.14921

1st Qu.:-0.63975 1st Qu.:-0.60355

Median : 0.13925 Median :-0.01566

Mean :-0.04392 Mean :-0.04333

3rd Qu.: 0.66350 3rd Qu.: 0.69297

Max. : 2.07342 Max. : 2.36014

> table(weather)

weather

0 1

42 8

> table(battle)

battle

0 1

38 12

> plot(hs,turnout)

> plot(college,turnout)

> plot(young,turnout)

> plot(minority,turnout)

> model<-lm(turnout~hs+college+young+minority+battle+weather)

> WascherTurnout16$Stud_res2<-studres(model)

> plot(WascherTurnout16$turnout,WascherTurnout16$Stud_res2)

> summary(model)

Call:

lm(formula = turnout ~ hs + college + young + minority + battle +

weather)

Residuals:

Min 1Q Median 3Q Max

-10.1602 -2.1689 -0.0636 2.3969 7.9042

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 30.3846 21.5380 1.411 0.165515

hs 0.3364 0.2452 1.372 0.177135

college 0.6354 0.1361 4.667 2.99e-05 ***

young -1.4750 0.9284 -1.589 0.119449

minority -0.1852 0.0497 -3.726 0.000564 ***

battle 4.7835 1.3611 3.514 0.001051 **

weather 0.7877 1.7170 0.459 0.648717

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.819 on 43 degrees of freedom

Multiple R-squared: 0.6854, Adjusted R-squared: 0.6415

F-statistic: 15.61 on 6 and 43 DF, p-value: 2.061e-09

Regression Model Building

> both.model<-step(model, direction='both')

Start: AIC=140.46

turnout ~ hs + college + young + minority + battle + weather

Df Sum of Sq RSS AIC

- weather 1 3.07 630.23 138.70

<none> 627.17 140.46

- hs 1 27.46 654.63 140.60

- young 1 36.81 663.98 141.31

- battle 1 180.15 807.32 151.09

- minority 1 202.43 829.60 152.45

- college 1 317.63 944.80 158.95

Step: AIC=138.7

turnout ~ hs + college + young + minority + battle

Df Sum of Sq RSS AIC

- hs 1 25.20 655.44 138.66

<none> 630.23 138.70

- young 1 33.99 664.22 139.33

+ weather 1 3.07 627.17 140.46

- battle 1 191.95 822.18 150.00

- minority 1 199.58 829.81 150.46

- college 1 340.54 970.77 158.30

Step: AIC=138.66

turnout ~ college + young + minority + battle

Df Sum of Sq RSS AIC

<none> 655.44 138.66

+ hs 1 25.20 630.23 138.70

- young 1 28.74 684.17 138.81

+ weather 1 0.81 654.63 140.60

- battle 1 219.42 874.86 151.10

- minority 1 348.87 1004.31 158.00

- college 1 583.51 1238.95 168.50

> model2<-lm(turnout~college+young+minority+battle)

> summary(model2)

Call:

lm(formula = turnout ~ college + young + minority + battle)

Residuals:

Min 1Q Median 3Q Max

-7.9666 -2.7854 0.0198 2.5285 8.0637

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 57.07238 9.26836 6.158 1.82e-07 ***

college 0.69555 0.10989 6.329 1.01e-07 ***

young -1.26814 0.90285 -1.405 0.167008

minority -0.21367 0.04366 -4.894 1.31e-05 ***

battle 5.15359 1.32780 3.881 0.000337 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.816 on 45 degrees of freedom

Multiple R-squared: 0.6712, Adjusted R-squared: 0.642

F-statistic: 22.97 on 4 and 45 DF, p-value: 2.177e-10

> model3<-lm(turnout~college+minority+battle)

> summary(model3)

Call:

lm(formula = turnout ~ college + minority + battle)

Residuals:

Min 1Q Median 3Q Max

-7.6798 -2.7505 -0.2332 2.2822 8.7076

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 44.94230 3.40017 13.218 < 2e-16 ***

college 0.67157 0.10970 6.122 1.90e-07 ***

minority -0.20467 0.04364 -4.690 2.47e-05 ***

battle 5.63557 1.29619 4.348 7.54e-05 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.857 on 46 degrees of freedom

Multiple R-squared: 0.6568, Adjusted R-squared: 0.6344

F-statistic: 29.34 on 3 and 46 DF, p-value: 9.37e-11

> AIC(model3)

[1] 282.7034

> AIC(model2)

[1] 282.558

> anova(model3,model2)

Analysis of Variance Table

Model 1: turnout ~ college + minority + battle

Model 2: turnout ~ college + young + minority + battle

Res.Df RSS Df Sum of Sq F Pr(>F)

1 46 684.17

2 45 655.44 1 28.736 1.9729 0.167

> Resid.error<-residuals(model2)

> fitted.Y<-predict(model2)

>

> par(mfrow=c(2,3))

> plot(fitted.Y,Resid.error,xlab="Fitted response value",ylab="Model Residuals")

> abline(h=0)

> plot(college,Resid.error)

> abline(h=0)

> plot(young,Resid.error)

> abline(h=0)

> plot(minority,Resid.error)

> abline(h=0)

> hist(Resid.error)

> box()

> qqnorm(Resid.error)

> qqline(Resid.error,probs = c(0.15, 0.85))

> Resid.error<-residuals(model3)

> fitted.Y<-predict(model3)

>

> par(mfrow=c(2,3))

> plot(fitted.Y,Resid.error,xlab="Fitted response value",ylab="Model Residuals")

> abline(h=0)

> plot(college,Resid.error)

> abline(h=0)

> plot(minority,Resid.error)

> abline(h=0)

> hist(Resid.error)

> box()

> qqnorm(Resid.error)

> qqline(Resid.error,probs = c(0.15, 0.85))

> plot(young,minority)

> plot(college,young)

> plot(college,minority)

[1] The dataset can be downloaded at github.com/BradleyWascher/STAT401