Wascher

Introduction

Following the 2016 US presidential election, analysts and voters were shocked. Outlets like The Huffington Post and The New York Times used statistics to forecast the election results, but still failed to predict the victory of Donald Trump. Yet, there does lie some truth in the now-debunked theory that polls didn’t detect “shy” Trump supporters: this race centered not on persuading voters, but on turning them out.

Despite the ongoing clash between persuasion and turnout, there’s no denying the latter’s importance. Historically, though, voter turnout has been low: it averages at 53.6% and alternates between 1996’s low of 49% and 2008’s high of 58.2%. Even larger than the difference in national turnout is the difference in state turnout – in 2012, 44.2% of Hawaii’s voters participated, while 78.2% of Minnesota’s voting eligible population did. This 34% difference could lead one to wonder what factors influence voter turnout. This project will seek to answer that question.

        This analysis uses self-gathered data, accessed from the US Census and the American Community Survey (ACS). The data is broken down into nine variables, with the last six being predictors:[1] 

This analysis will dissect these leading factors that political scientists and pundits believe affect voter turnout from state-to-state. Additionally, it will assess how strong a role they play in the overall turnout of a state’s voting population, developing conclusions using multiple regression analysis. Many different models will be considered.

Exploratory Data Analysis

        For the purposes of exploratory data analysis, simple visual and quantitative operations were performed on  each variable and the dataset as a whole. Quantitatively, the following was found across 40 observations:

turnout            hs           college         minority        young

Min.   :82.20   Min.   :19.60   Min.   : 5.20   Min.   : 8.500   Min.   :43.00

1st Qu.:86.10   1st Qu.:26.57   1st Qu.:14.05   1st Qu.: 9.600   1st Qu.:57.75

Median :89.35   Median :29.55   Median :21.65   Median : 9.800   Median :61.85

Mean   :88.70   Mean   :29.81   Mean   :22.75   Mean   : 9.876   Mean   :61.66

3rd Qu.:91.08   3rd Qu.:32.60   3rd Qu.:31.18   3rd Qu.:10.100   3rd Qu.:65.65

Max.   :93.50   Max.   :41.50   Max.   :75.00   Max.   :12.300   Max.   :74.80

 

        As for the two categorical variables, battle and weather, results found that there were 12 battleground states and 38 non-battleground states, as originally intended. Furthermore, there was unfavorable weather in only 8 cities, with 42 not seeing rain or snow on November 8th.

To gain a visual understanding of the data, the four interval predictors were plotted against turnout. Here, hs and college show positive linear relationships, while young doesn’t appear to graphically illustrate a relationship at all. With 75% of its population identifying as non-White, Hawaii appears to be an outlier. Though an analysis of studentized residuals confirms this, Hawaii will be kept in the dataset, as omitting a state does more harm than good in a study of all 50 states. This rule would apply to all outliers, influential points, and high-leverage points in the study.

Additionally, a saturated, full multiple linear regression model was run with every predictor. Based on that analysis, it appears that at 95% confidence, the only statistically significant predictors are college, minority, and young. The adjusted r-squared for this model is 0.6415, the standard error is 3.819, and the f-statistic and accompanying p-value point to statistical significance.

Regression Model Building

To first ascertain a decent model, variables were chosen using stepwise selection. This returned a model with college, young, minority, and battle as predictors. After running this model on its own, the results found an adjusted r-squared of 0.642, a standard error of 3.816, and an f-statistic and accompanying p-value that point to statistical significance. This is slightly – albeit not much – better than the full model.

It appears young is not significant, so a third model was run, with only college, minority, and battle. Its adjusted r-squared of 0.6344 and standard error of 3.857 indicate a poorer fit. And based on the AIC of both, it appears Model 2 (282.558) has a slightly lower AIC than Model 3 (282.7034), suggesting a better fit for Model 2. Contrastingly and interestingly, a nested f-test of the two models indicated young is not worth adding as a predictor. Nevertheless, based on the other factors, Model 2, which includes young, appears superior.

Checking assumptions for this model proves successful: the residual plots, histogram, and QQ-plot show the model meets the tenets of linearity, normality, and variance. Again, running the same tests with Model 3 suggests that it does not fulfill these assumptions as fully, further demonstrating why Model 2 is superior. Scatterplots also point toward no signs of multicollinearity. Finally, the variable plots done during Exploratory Data Analysis indicate no need for transformations on the variables we do have, and including interactions would likely bear no fruit, either. In this dataset, Model 2 appears to be the best for predicting voter turnout.

Results and Conclusion

The final model uses college, young, minority, and battle as predictors. Its equation is:

y-hat = 57.07238 + 0.69555(college) - 1.26814(young) - 0.21367(minority) + 5.15359(battle)

Although this model has performed the best in these statistical analyses, it still is not great for predicting turnout. After all, its adjusted r-squared of 0.642, standard error of 3.816, and AIC of 282.558 are far from reliable. Such a conclusion makes sense, as these 4 variables likely aren’t the largest determiners of voter turnout.

In this way, the methodology of this study could be improved. Maybe tracking different variables would beget a stronger model. Additionally, the scope can be expanded past the 2016 election, including more observations in the sample. Finally, it’s interesting which models made it into the final model, and which didn’t. While some, like weather, are expected, it’s particularly notable that college attainment is in the final model – and even the smaller third model – but high school attainment is not.


R Code

Exploratory Data Analysis

> summary(WascherTurnout16)

    state             stateID             turnout    

 Length:50          Length:50          Min.   :43.00  

 Class :character   Class :character   1st Qu.:57.75  

 Mode  :character   Mode  :character   Median :61.85  

                                       Mean   :61.66  

                                       3rd Qu.:65.65  

                                       Max.   :74.80  

       hs           college         minority    

 Min.   :82.20   Min.   :19.60   Min.   : 5.20  

 1st Qu.:86.10   1st Qu.:26.57   1st Qu.:14.05  

 Median :89.35   Median :29.55   Median :21.65  

 Mean   :88.70   Mean   :29.81   Mean   :22.75  

 3rd Qu.:91.08   3rd Qu.:32.60   3rd Qu.:31.18  

 Max.   :93.50   Max.   :41.50   Max.   :75.00  

     young            battle        weather    

 Min.   : 8.500   Min.   :0.00   Min.   :0.00  

 1st Qu.: 9.600   1st Qu.:0.00   1st Qu.:0.00  

 Median : 9.800   Median :0.00   Median :0.00  

 Mean   : 9.876   Mean   :0.24   Mean   :0.16  

 3rd Qu.:10.100   3rd Qu.:0.00   3rd Qu.:0.00  

 Max.   :12.300   Max.   :1.00   Max.   :1.00  

    Stud_res          Stud_res2      

 Min.   :-5.01917   Min.   :-5.14921  

 1st Qu.:-0.63975   1st Qu.:-0.60355  

 Median : 0.13925   Median :-0.01566  

 Mean   :-0.04392   Mean   :-0.04333  

 3rd Qu.: 0.66350   3rd Qu.: 0.69297  

 Max.   : 2.07342   Max.   : 2.36014  

> table(weather)

weather

 0  1

42  8

> table(battle)

battle

 0  1

38 12


> plot(hs,turnout)

> plot(college,turnout)


> plot(young,turnout)

> plot(minority,turnout)

> model<-lm(turnout~hs+college+young+minority+battle+weather)

> WascherTurnout16$Stud_res2<-studres(model)

> plot(WascherTurnout16$turnout,WascherTurnout16$Stud_res2)


> summary(model)

Call:

lm(formula = turnout ~ hs + college + young + minority + battle +

    weather)

Residuals:

     Min       1Q   Median       3Q      Max

-10.1602  -2.1689  -0.0636   2.3969   7.9042

Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept)  30.3846    21.5380   1.411 0.165515    

hs            0.3364     0.2452   1.372 0.177135    

college       0.6354     0.1361   4.667 2.99e-05 ***

young        -1.4750     0.9284  -1.589 0.119449    

minority     -0.1852     0.0497  -3.726 0.000564 ***

battle        4.7835     1.3611   3.514 0.001051 **

weather       0.7877     1.7170   0.459 0.648717    

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.819 on 43 degrees of freedom

Multiple R-squared:  0.6854,        Adjusted R-squared:  0.6415

F-statistic: 15.61 on 6 and 43 DF,  p-value: 2.061e-09


Regression Model Building

> both.model<-step(model, direction='both')

Start:  AIC=140.46

turnout ~ hs + college + young + minority + battle + weather

           Df Sum of Sq    RSS    AIC

- weather   1      3.07 630.23 138.70

<none>                  627.17 140.46

- hs        1     27.46 654.63 140.60

- young     1     36.81 663.98 141.31

- battle    1    180.15 807.32 151.09

- minority  1    202.43 829.60 152.45

- college   1    317.63 944.80 158.95

Step:  AIC=138.7

turnout ~ hs + college + young + minority + battle

           Df Sum of Sq    RSS    AIC

- hs        1     25.20 655.44 138.66

<none>                  630.23 138.70

- young     1     33.99 664.22 139.33

+ weather   1      3.07 627.17 140.46

- battle    1    191.95 822.18 150.00

- minority  1    199.58 829.81 150.46

- college   1    340.54 970.77 158.30

Step:  AIC=138.66

turnout ~ college + young + minority + battle

           Df Sum of Sq     RSS    AIC

<none>                   655.44 138.66

+ hs        1     25.20  630.23 138.70

- young     1     28.74  684.17 138.81

+ weather   1      0.81  654.63 140.60

- battle    1    219.42  874.86 151.10

- minority  1    348.87 1004.31 158.00

- college   1    583.51 1238.95 168.50

> model2<-lm(turnout~college+young+minority+battle)

> summary(model2)

Call:

lm(formula = turnout ~ college + young + minority + battle)

Residuals:

    Min      1Q  Median      3Q     Max

-7.9666 -2.7854  0.0198  2.5285  8.0637

Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 57.07238    9.26836   6.158 1.82e-07 ***

college      0.69555    0.10989   6.329 1.01e-07 ***

young       -1.26814    0.90285  -1.405 0.167008    

minority    -0.21367    0.04366  -4.894 1.31e-05 ***

battle       5.15359    1.32780   3.881 0.000337 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.816 on 45 degrees of freedom

Multiple R-squared:  0.6712,        Adjusted R-squared:  0.642

F-statistic: 22.97 on 4 and 45 DF,  p-value: 2.177e-10


> model3<-lm(turnout~college+minority+battle)

> summary(model3)

Call:

lm(formula = turnout ~ college + minority + battle)

Residuals:

    Min      1Q  Median      3Q     Max

-7.6798 -2.7505 -0.2332  2.2822  8.7076

Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 44.94230    3.40017  13.218  < 2e-16 ***

college      0.67157    0.10970   6.122 1.90e-07 ***

minority    -0.20467    0.04364  -4.690 2.47e-05 ***

battle       5.63557    1.29619   4.348 7.54e-05 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.857 on 46 degrees of freedom

Multiple R-squared:  0.6568,        Adjusted R-squared:  0.6344

F-statistic: 29.34 on 3 and 46 DF,  p-value: 9.37e-11

> AIC(model3)

[1] 282.7034

> AIC(model2)

[1] 282.558

> anova(model3,model2)

Analysis of Variance Table

Model 1: turnout ~ college + minority + battle

Model 2: turnout ~ college + young + minority + battle

  Res.Df    RSS Df Sum of Sq      F Pr(>F)

1     46 684.17                          

2     45 655.44  1    28.736 1.9729  0.167


> Resid.error<-residuals(model2)

> fitted.Y<-predict(model2)

>

>  par(mfrow=c(2,3))

> plot(fitted.Y,Resid.error,xlab="Fitted response value",ylab="Model Residuals")

> abline(h=0)

> plot(college,Resid.error)

> abline(h=0)

> plot(young,Resid.error)

> abline(h=0)

> plot(minority,Resid.error)

> abline(h=0)

> hist(Resid.error)

> box()

> qqnorm(Resid.error)

> qqline(Resid.error,probs = c(0.15, 0.85))


> Resid.error<-residuals(model3)

> fitted.Y<-predict(model3)

>

> par(mfrow=c(2,3))

> plot(fitted.Y,Resid.error,xlab="Fitted response value",ylab="Model Residuals")

> abline(h=0)

> plot(college,Resid.error)

> abline(h=0)

> plot(minority,Resid.error)

> abline(h=0)

> hist(Resid.error)

> box()

> qqnorm(Resid.error)

> qqline(Resid.error,probs = c(0.15, 0.85))


> plot(young,minority)

> plot(college,young)

> plot(college,minority)


[1]  The dataset can be downloaded at github.com/BradleyWascher/STAT401