1 of 24

Datathon 2015

Daniel Ben-Chitrit�Mason Cooper�Kathleen Flynn�Nobah Lee�Mary Powathil�

2 of 24

Getting Started

Problem Area: Financial aid in (~1400) colleges across the country from data.gov

Goal: Identify general trends in aid distribution throughout the country � to determine where schools invest more money - federal grants or work study.

Develop a predictive model to see how the amount of money impacts the number of recipients.

Construct a second model to predict the number of recipients of an award based on the number of recipients for other awards.

Variables:

Federal Award, Disbursement, and Recipients for Perkins, Work Study, and Federal Grants

3 of 24

Visualizing the Data

4 of 24

Without Outliers

5 of 24

6 of 24

Frequency Distribution

7 of 24

Frequency Distribution

8 of 24

Regression Tree

9 of 24

Linear Modeling (1)

Residuals:

Min 1Q Median 3Q Max

-1.25775 -0.17497 -0.00916 0.15783 1.38346

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.98667 0.10493 28.46 <2e-16 ***

log(Federal.Award) 0.55917 0.01228 45.54 <2e-16 ***

log(Recipients) 0.50472 0.01256 40.19 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2728 on 1385 degrees of freedom

Multiple R-squared: 0.938, Adjusted R-squared: 0.9379

F-statistic: 1.047e+04 on 2 and 1385 DF, p-value: < 2.2e-16

10 of 24

Linear Modeling (1)

summary(predict(lm(log(Disbursements) ~ log(Federal.Award) + log(Recipients))))

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.077 11.870 12.480 12.560 13.270 15.030

summary(log(Disbursements))

Min. 1st Qu. Median Mean 3rd Qu. Max.

8.963 11.840 12.480 12.560 13.400 15.140

Compare the predictions from the linear model correlating disbursements of work study to federal awards and recipients

11 of 24

Linear Modeling (2)

12 of 24

Linear Modeling (2)

13 of 24

Linear Modeling (3)

Residuals:

Min 1Q Median 3Q Max

-5.8332 -0.5512 0.2171 0.7403 3.3210

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.2733 0.1919 -6.634 4.66e-11 ***

log(Recipients) 0.8744 0.0343 25.491 < 2e-16 ***

log(Recipients.2) 0.2573 0.0319 8.067 1.54e-15 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.167 on 1385 degrees of freedom

Multiple R-squared: 0.4542, Adjusted R-squared: 0.4534

F-statistic: 576.2 on 2 and 1385 DF, p-value: < 2.2e-16

14 of 24

Linear Modeling (3)

summary(predict(lm(log(Recipients.1) ~ log(Recipients) + log(Recipients.2))), log(Recipients.1))

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.690 4.284 4.830 4.846 5.400 6.968

summary(log(Recipients.1))

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.609 3.806 5.056 4.846 5.763 7.562

Compare the predictions from the linear model correlating recipients of the Perkins award with recipients of federal grants and work study

15 of 24

Linear Modeling (3)

16 of 24

Linear Modeling (3)

17 of 24

Linear Modeling (4)

18 of 24

Linear Modeling (4)

work_diff<-(finaid$Disbursements-finaid$Federal.Award)/10000

Call:lm(formula = work_diff ~ finaid$Recipients)

Residual standard error: 33.3 on 1386 degrees of freedom

Multiple R-squared: 0.3563, Adjusted R-squared: 0.3558

F-statistic: 767.2 on 1 and 1386 DF, p-value: < 2.2e-16

Call:lm(formula = work_diff ~ finaid$Recipients + finaid$Federal.Award)

Residuals:

Min 1Q Median 3Q Max

-159.94 -6.72 -0.27 4.32 922.89

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -4.635e+00 1.080e+00 -4.293 1.89e-05 ***

finaid$Recipients 8.363e-02 3.232e-03 25.880 < 2e-16 ***

finaid$Federal.Award -2.883e-05 2.384e-06 -12.094 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 31.69 on 1385 degrees of freedom

Multiple R-squared: 0.4178, Adjusted R-squared: 0.4169

F-statistic: 496.9 on 2 and 1385 DF, p-value: < 2.2e-16

Call:lm(formula = work_diff ~ finaid$Recipients

+ finaid$Disbursements)

Residuals:

Min 1Q Median 3Q Max

-207.18 -6.51 2.89 7.19 439.84

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -5.359e+00 9.446e-01 -5.673 1.70e-08 ***

finaid$Recipients -2.346e-02 3.403e-03 -6.895 8.16e-12 ***

finaid$Disbursements 4.499e-05 1.843e-06 24.413 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.86 on 1385 degrees of freedom

Multiple R-squared: 0.55, Adjusted R-squared: 0.5493

F-statistic: 846.3 on 2 and 1385 DF, p-value: < 2.2e-16

19 of 24

Linear Modeling (4)

grant_diff<-(finaid$Disbursements.2-finaid$Federal.Award.2)/10000

Call:lm(formula = grant_diff ~ finaid$Recipients.2)

Residual standard error: 20.54 on 1386 degrees of freedom

Multiple R-squared: 0.2782, Adjusted R-squared: 0.2777

F-statistic: 534.1 on 1 and 1386 DF, p-value: < 2.2e-16

Call:lm(formula = grant_diff ~ finaid$Recipients.2 + finaid$Federal.Award.2)

Residuals:

Min 1Q Median 3Q Max

-130.635 -3.743 -1.094 0.989 170.120

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3.793e-01 5.499e-01 -0.690 0.490

finaid$Recipients.2 -1.417e-04 6.104e-04 -0.232 0.816

finaid$Federal.Award.2 4.343e-05 1.558e-06 27.868 <2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.45 on 1385 degrees of freedom

Multiple R-squared: 0.5375, Adjusted R-squared: 0.5368

F-statistic: 804.9 on 2 and 1385 DF, p-value: < 2.2e-16

Call:lm(formula = grant_diff ~ finaid$Recipients.2 + finaid$Disbursements.2)

Residuals:

Min 1Q Median 3Q Max

-96.102 -2.105 0.478 2.299 92.405

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -2.374e+00 3.497e-01 -6.789 1.68e-11 ***

finaid$Recipients.2 -4.067e-03 3.791e-04 -10.726 < 2e-16 ***

finaid$Disbursements.2 4.008e-05 6.511e-07 61.553 < 2e-16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.63 on 1385 degrees of freedom

Multiple R-squared: 0.8068, Adjusted R-squared: 0.8065

F-statistic: 2891 on 2 and 1385 DF, p-value: < 2.2e-16

20 of 24

Linear Modeling (4)

> summary(predict(work_lm2))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-13.830 -1.179 3.938 13.130 14.500 516.800

> summary(work_diff)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-109.3000 -0.4494 3.7460 13.1300 14.4600 956.6000

> summary(predict(grant_lm2))

Min. 1st Qu. Median Mean 3rd Qu. Max.

-8.273 2.048 5.624 12.760 14.960 287.600

> summary(grant_diff)

Min. 1st Qu. Median Mean 3rd Qu. Max.

-28.930 1.928 5.299 12.760 12.700 252.400

21 of 24

Linear Modeling (4)

22 of 24

Linear Modeling (4)

23 of 24

Summary

Discoveries:

  1. The most accurate prediction was for disbursement, using federal award and recipients
  2. The model for predicting recipients using the other two variables was most accurate for Perkins and Work Study
  3. It was important to log() the value to normalize and accurately represent it using a good scaled model
  4. Predicting the difference in disbursement and federal award determines how much the school will provide, and this was most accurately represented for federal grants

Future Studies: Explore the linear vs. non-linearity of the data, map values using ggmap, explore more categorical variables

24 of 24

Questions?