Datathon 2015
Daniel Ben-Chitrit�Mason Cooper�Kathleen Flynn�Nobah Lee�Mary Powathil�
Getting Started
Problem Area: Financial aid in (~1400) colleges across the country from data.gov
Goal: Identify general trends in aid distribution throughout the country � to determine where schools invest more money - federal grants or work study.
Develop a predictive model to see how the amount of money impacts the number of recipients.
Construct a second model to predict the number of recipients of an award based on the number of recipients for other awards.
Variables:
Federal Award, Disbursement, and Recipients for Perkins, Work Study, and Federal Grants
Visualizing the Data
Without Outliers
Frequency Distribution
Frequency Distribution
Regression Tree
Linear Modeling (1)
Residuals:
Min 1Q Median 3Q Max
-1.25775 -0.17497 -0.00916 0.15783 1.38346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.98667 0.10493 28.46 <2e-16 ***
log(Federal.Award) 0.55917 0.01228 45.54 <2e-16 ***
log(Recipients) 0.50472 0.01256 40.19 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2728 on 1385 degrees of freedom
Multiple R-squared: 0.938, Adjusted R-squared: 0.9379
F-statistic: 1.047e+04 on 2 and 1385 DF, p-value: < 2.2e-16
Linear Modeling (1)
summary(predict(lm(log(Disbursements) ~ log(Federal.Award) + log(Recipients))))
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.077 11.870 12.480 12.560 13.270 15.030
summary(log(Disbursements))
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.963 11.840 12.480 12.560 13.400 15.140
Compare the predictions from the linear model correlating disbursements of work study to federal awards and recipients
Linear Modeling (2)
Linear Modeling (2)
Linear Modeling (3)
Residuals:
Min 1Q Median 3Q Max
-5.8332 -0.5512 0.2171 0.7403 3.3210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.2733 0.1919 -6.634 4.66e-11 ***
log(Recipients) 0.8744 0.0343 25.491 < 2e-16 ***
log(Recipients.2) 0.2573 0.0319 8.067 1.54e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.167 on 1385 degrees of freedom
Multiple R-squared: 0.4542, Adjusted R-squared: 0.4534
F-statistic: 576.2 on 2 and 1385 DF, p-value: < 2.2e-16
Linear Modeling (3)
summary(predict(lm(log(Recipients.1) ~ log(Recipients) + log(Recipients.2))), log(Recipients.1))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.690 4.284 4.830 4.846 5.400 6.968
summary(log(Recipients.1))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.609 3.806 5.056 4.846 5.763 7.562
Compare the predictions from the linear model correlating recipients of the Perkins award with recipients of federal grants and work study
Linear Modeling (3)
Linear Modeling (3)
Linear Modeling (4)
Linear Modeling (4)
work_diff<-(finaid$Disbursements-finaid$Federal.Award)/10000
Call:lm(formula = work_diff ~ finaid$Recipients)
Residual standard error: 33.3 on 1386 degrees of freedom
Multiple R-squared: 0.3563, Adjusted R-squared: 0.3558
F-statistic: 767.2 on 1 and 1386 DF, p-value: < 2.2e-16
Call:lm(formula = work_diff ~ finaid$Recipients + finaid$Federal.Award)
Residuals:
Min 1Q Median 3Q Max
-159.94 -6.72 -0.27 4.32 922.89
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.635e+00 1.080e+00 -4.293 1.89e-05 ***
finaid$Recipients 8.363e-02 3.232e-03 25.880 < 2e-16 ***
finaid$Federal.Award -2.883e-05 2.384e-06 -12.094 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 31.69 on 1385 degrees of freedom
Multiple R-squared: 0.4178, Adjusted R-squared: 0.4169
F-statistic: 496.9 on 2 and 1385 DF, p-value: < 2.2e-16
Call:lm(formula = work_diff ~ finaid$Recipients
+ finaid$Disbursements)
Residuals:
Min 1Q Median 3Q Max
-207.18 -6.51 2.89 7.19 439.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.359e+00 9.446e-01 -5.673 1.70e-08 ***
finaid$Recipients -2.346e-02 3.403e-03 -6.895 8.16e-12 ***
finaid$Disbursements 4.499e-05 1.843e-06 24.413 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27.86 on 1385 degrees of freedom
Multiple R-squared: 0.55, Adjusted R-squared: 0.5493
F-statistic: 846.3 on 2 and 1385 DF, p-value: < 2.2e-16
Linear Modeling (4)
grant_diff<-(finaid$Disbursements.2-finaid$Federal.Award.2)/10000
Call:lm(formula = grant_diff ~ finaid$Recipients.2)
Residual standard error: 20.54 on 1386 degrees of freedom
Multiple R-squared: 0.2782, Adjusted R-squared: 0.2777
F-statistic: 534.1 on 1 and 1386 DF, p-value: < 2.2e-16
Call:lm(formula = grant_diff ~ finaid$Recipients.2 + finaid$Federal.Award.2)
Residuals:
Min 1Q Median 3Q Max
-130.635 -3.743 -1.094 0.989 170.120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.793e-01 5.499e-01 -0.690 0.490
finaid$Recipients.2 -1.417e-04 6.104e-04 -0.232 0.816
finaid$Federal.Award.2 4.343e-05 1.558e-06 27.868 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.45 on 1385 degrees of freedom
Multiple R-squared: 0.5375, Adjusted R-squared: 0.5368
F-statistic: 804.9 on 2 and 1385 DF, p-value: < 2.2e-16
Call:lm(formula = grant_diff ~ finaid$Recipients.2 + finaid$Disbursements.2)
Residuals:
Min 1Q Median 3Q Max
-96.102 -2.105 0.478 2.299 92.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.374e+00 3.497e-01 -6.789 1.68e-11 ***
finaid$Recipients.2 -4.067e-03 3.791e-04 -10.726 < 2e-16 ***
finaid$Disbursements.2 4.008e-05 6.511e-07 61.553 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.63 on 1385 degrees of freedom
Multiple R-squared: 0.8068, Adjusted R-squared: 0.8065
F-statistic: 2891 on 2 and 1385 DF, p-value: < 2.2e-16
Linear Modeling (4)
> summary(predict(work_lm2))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-13.830 -1.179 3.938 13.130 14.500 516.800
> summary(work_diff)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-109.3000 -0.4494 3.7460 13.1300 14.4600 956.6000
> summary(predict(grant_lm2))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.273 2.048 5.624 12.760 14.960 287.600
> summary(grant_diff)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-28.930 1.928 5.299 12.760 12.700 252.400
Linear Modeling (4)
Linear Modeling (4)
Summary
Discoveries:
Future Studies: Explore the linear vs. non-linearity of the data, map values using ggmap, explore more categorical variables
Questions?