1 of 47

Multiple Regression Analysis and Analytical Techniques on the Relation of COVID-19 to Nutrition

Elias Karnoub, Gabriella Provenzano, Andrew Shen, Matthew Tam

December 4, 2020

2 of 47

Abstract

As we are currently living in a pandemic and the topic of COVID-19 continues to gain attention, we should try to look at what we can to do to try and protect ourselves, our families and friends, and everyone who is sacrificing their lives to help. The goal of the project is to analyze the relationship between the number of confirmed cases of COVID-19 and the food consumption of each country. To see which variables in our dataset predicted our outcome best, forward, backward, and bi-directional model selection, cross-validation measures , and ridge and LASSO were implemented with the use of R programming. With our analysis, we found that the number of confirmed cases can be predicted by Vegetal.Products, Offals, Stimulants, Miscellaneous, and Vegetable.Oils consumption and Obesity rates.

3 of 47

Introduction

4 of 47

COVID-19 cases compared to other countries

In comparison to other nations, the United States has the most COVID-19 cases throughout the world.

Pettersson, Henrik, et al. “Tracking Coronavirus' Global Spread.” CNN, Cable News Network, 3 Dec. 2020, www.cnn.com/interactive/2020/health/coronavirus-maps-and-cases/.

5 of 47

Cases in each state

The United States as a whole has continues to see an increase in cases, currently with more in the midwest

“CDC COVID Data Tracker.” United States COVID-19 Cases and Deaths by State, Center for Disease Control and Protection, 3 Dec. 2020, covid.cdc.gov/covid-data-tracker/.

6 of 47

Hospitalizations and Deaths in the United States

As we can see hospitalizations and deaths continue to rise across the country. In addition, the United States is higher in every category then the national average.

“Our Data.” The COVID Tracking Project, The Atlantic Monthly Group, 2020, covidtracking.com/data.

7 of 47

Effects on age

“COVID-19 Hospitalization and Death by Race/Ethnicity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 2020, www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/hospitalization-death-by-race-ethnicity.html.

The older population is at a high risk of death if contracting COVID-19 because their immune system is weaker as they age. In addition, they also may have a lot of predisposed health conditions which puts them at higher risk.

8 of 47

COVID-19 in New Jersey

New Jersey continues to see an increase in cases as Governor Murphy starts to put limits on gatherings and shutting down non-essential businesses.

New Jersey COVID-19 Information Hub, New Jersey Department of Health , 2020, covid19.nj.gov/.

9 of 47

Topic: COVID-19

As we are students in a pandemic, we were all drawn to this topic because of how much it is affecting the world, country, and our own lives. COVID-19 has taken over the news and media as concerns continue to increase as cases rise . As statistics and math majors, we wanted to use our skills learned in this class and attempt to find interesting results in data related to COVID-19. From the case built in the previous slides it is clear that COVID has become a major problem across the world. We hope to find new perspectives on the issue and to investigate it further through our project.

10 of 47

Chosen Statistical Topic

We chose stepwise regression for our project because we wanted to see what variables in our dataset predicted our outcome best and which ones created the best fit. Cross validation also helps us measure the validity and accuracy of our model. This will allow us to make sure we are not over-fitting or under-fitting the data. We also decided to look at Ridge and Lasso regression to see how multicollinearity can play a role.

11 of 47

Data Set

  • This data set comes from kaggle.com posted by user Maria Ren labeled “Covid-19 Healthy Diet Dataset”
    • Note: We used Food_Supply_Quantity_kg_Data.csv
  • This data set measures 30 different criteria including obesity rates as well as various metrics relating to COVD-19 affectedness
  • 170 countries are investigated among these predictors
  • With so many different features, multiple regression can be an important tool for analyzing significance for a specific response
  • Variables that rate various nutritional aspects of these countries will be used to assess predictability of how well a country can respond to the global pandemic

12 of 47

Variables in the Dataset

  • Country
  • Alcoholic Beverages
  • Animal products
  • Animal fats
  • Aquatic Products
  • Cereals- Excluding Beer
  • Fish, Seafood
  • Fruits- Excluding Wine
  • Meat
  • Milk- Excluding Butter
  • Miscellaneous
  • Offals
  • Oil Crops
  • Pulses
  • Species

  • Starchy Roots
  • Stimulants
  • Sugar Crops
  • Sugar and Sweeteners
  • Tree Nuts
  • Vegetal Products
  • Vegetable Oils
  • Obesity
  • Undernourished
  • Confirmed cases
  • Deaths
  • Recovered
  • Active
  • Population

13 of 47

Variables Selected

From this data, we chose confirmed COVID-19 cases as our dependent variable and used the other variables as our potential predictors or independent variables.

For our project, we were interested in determining which independent variables were best predictors of confirmed COVID cases within each country. However, we ignored COVID-19 related data in the set such as Deaths, Recovered, and Active since these are expected to be closely related.

14 of 47

Materials and Methods

15 of 47

Loading Dataset Into R

  • read.csv #Reads the CSV file that is downloaded on the computer
  • library(faraway), library(psych), library(ggplot2), library(olsrr), library(glm), require (caret) # the packages that were going to be needed
  • We then needed to clean up the data to remove points that had missing information

16 of 47

Running a regression model

  • We are looking to find a relationship between confirmed cases occurred with specific factors. We are using lm regression and below is the completed model:

17 of 47

Stepwise Regression

  • Stepwise regression is a very useful model selection technique and can be broken into 3 different methods:
    • A forward stepwise regression is a where variables, starting with the empty model, are added into the model if it meets a certain statistical criteria( ex: adjusted r^2, AIC, p-value).
    • A backward stepwise regression is where variables, starting with the full model, are removed from the model if the model without the variable in consideration meets a certain statistical criteria.
    • A bidirectional stepwise regression is a model selection method where variables are considered for addition or removal from the model if the model with/without the variable in consideration meets a certain statistical criteria.

  • The bidirectional stepwise regression is used most of the time, as both forwards and backwards selection models can have errors due to random sampling fluctuations
    • ex: with enough predictors, it’s possible to add/remove a predictor without any serious consequence to the model when it shouldn’t have been added/removed

18 of 47

Forward stepwise regression in R

  • In order to find out which regressors contribute positively to the model and which don’t, we need to run a stepwise regression to select the ideal variables.

  • We’ll start by doing a forward stepwise regression on the confirmed cases model.

Predictors selected: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable.Oils, Fruits...Excluding.wine, Eggs, Oilcrops

19 of 47

Backward stepwise regression in R

  • To compare with the forward stepwise regression, we ran a backward stepwise regression on the confirmed cases model to select the best variables to take out of the model.

Predictors selected: Alcoholic.Beverages, Animal.Fats, aquatic.Products..Other, Cereals...Excluding.Beer, Fish..Seafood, Fruits...Excluding.Wine, Meat, Milk...Excluding.Butter, Miscellaneous, Offals,Oilcrops, Pulses, Spices, Starchy.Roots, Sugar...Sweeteners, Vegetable.Oils, Vegetables

20 of 47

Bidirectional stepwise regression in R

  • To compare with the forward and backwards stepwise regression, we ran a bidirectional stepwise regression on the confirmed cases model to select the best variables to put in the model.

Predictors selected: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable.Oils

21 of 47

Cross Validation Model

  • Cross Validation is used to assess the accuracy and validity of the model by separating a dataset into testing and training data
    • Accuracy: identifies the percentage of times the model correctly classified instances out of all possible outcomes
    • A valid model should show good accuracy
  • Training data is used to build a model, which will predict values in the testing dataset
  • K-Fold validation is a useful cross-validation technique as it uses a lot of the data to estimate the model, and can be used to evaluate models with higher predictive and explanatory power

22 of 47

Cross Validation in R

  • To use this technique in R, the packages faraway, olsrr, and caret need to be installed
  • After loading and cleaning up the data set, it is important to first do set.seed ( ) so that we can ensure that the code and results can be reproduced
  • We used K fold and separated it into 5 and 10
  • We then need to define the training control so that we can train the model
  • After that we print the model to get the results

23 of 47

Ridge Regression

  • Ridge Regression is a very useful tool in statistical analysis
  • In particular, Ridge Regression is an important technique to use when there is multicollinearity
    • Multicollinearity can be thought of as disturbingly high correlation between predictors
    • Multicollinearity can be detrimental to linear regression analysis and in many cases yield unreliable results
  • Multicollinearity can certainly be expected to be present in real-world data sets such as the one focused on in this project
  • In terms of nutritional intake, one would logically expect, for example, a predictor such as fat intake to be highly correlated with obesity levels in a certain country

24 of 47

Ridge Regression

  • Ridge Regression offers a unique way at combating the effects of multicollinearity given that this data set is likely to experience this to some degree
  • Using Ridge Regression, there is no longer the need to worry about compromising predictability for unbiased estimators
    • Ridge Regression gives biased estimators and can therefore greatly reduce variance, while least squares estimation produces unbiased estimators at the cost of potentially high variance
  • It is important to scale values when performing Ridge Regression because there is a necessary punishment on coefficient size
    • Larger coefficients are penalized
    • While coefficients are punished, none are removed during Ridge Regression

25 of 47

Ridge Regression in R

  • To use this technique in R, the packages faraway and glmnet are necessary
  • After loading and cleaning up the data set, it is important to break up the data into a matrix of the predictors and a vector of the response

  • It is extremely important to use set.seed() in order to ensure that, while certain elements of the process are random, they can be reproduced given a specific index
  • Finally, using alpha=0 in the glmnet procedure signifies the use of Ridge Regression

26 of 47

Ridge Regression in R

  • Developing the Ridge Regression technique further, it is critical that a value of lambda is chosen by cross validation (the default is 10-fold cross validation in R)
  • After choosing the appropriate value for lambda, the corresponding cross-validated Ridge Regression model can be run in R and the coefficients can be analyzed
  • Thank you very much to Professor Mardekian for posting his guides to various code examples on Canvas which were used during this project

27 of 47

LASSO Regression Analysis

  • Similar to Ridge Regression in many ways, LASSO is another unique regression analysis tool that has become a very popular technique
  • LASSO is another important method which can be used to tackle multicollinearity which, as previously discussed, is a great hindrance during least-squares regression
  • However, LASSO, unlike Ridge Regression, is able to perform variable selection
  • Undesirable predictors can and will be eliminated
    • This is very similar to procedures like best subset regression
  • Similar to Ridge Regression, the performance of the LASSO model can be optimized through various means such as cross validation

28 of 47

LASSO Regression Analysis in R

  • To use this technique in R, the faraway and glmnet packages are necessary
  • Just like with Ridge Regression, the data will need to be broken into a matrix of the predictors and a vector of the appropriate response

  • This time, signaling that alpha = 1 is the key to beginning the process of LASSO as opposed to Ridge Regression
  • While there is not a screenshot of it here, it is necessary to use set.seed() in order to ensure that these results can be easily reproduced

29 of 47

LASSO Regression Analysis in R

  • Again, using cross validation will help to ensure that the potential of LASSO is maximized
  • After using cross validation to find the ideal lambda, using that in the model will provide the final LASSO model and coefficients can be analyzed
  • Thank you very much to Professor Mardekian for posting his guides to various code examples on Canvas which were used during this project

30 of 47

Results

31 of 47

Forward Stepwise Regression Cross Validation

  • After conducting 10 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.409
  • After conducting 5 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.365
  • Both of these values are lower than the originally predicted r^2 values without cross validating.

10 Fold

5 Fold

32 of 47

Forward Stepwise Regression Results

Confirmed forward regression included the parameters: Vegetal. Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable. Oils, Fruits… Excluding.Wine, Eggs, and Oilcrops

These variables were selected because they met the criteria of having a p-value of less than 0.3

33 of 47

Backward Stepwise Regression Results

Confirmed backward regression includes the following parameters: Alcoholic.Beverages, Animal.fats, Aquatic.Products..Other, Cereals… Excluding Beer, Fish..Seafood, Fruits… Excluding.Wine, Meat, Milk...Excluding Butter, Miscellaneous, Offals, Oilcrops. Pulses, Spices, Starchy.Roots, Sugar...Sweeteners. Vegetable.OIls. Vegetables

These variables were selected because they met the criteria of having a p-value of greater than 0.3

34 of 47

Backward Stepwise Regression Cross Validation

  • After conducting 10 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.35
  • After conducting 5 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.323
  • Both of these values are lower than the originally predicted r^2 values without cross validating.

10 Fold

5 Fold

35 of 47

Bidirectional Stepwise Regression Results

Confirmed bidirectional stepwise regression includes the following parameters: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, and Vegetable.Oils .

These variables were selected based on the p-value criteria for both forward and backward stepwise regression.

36 of 47

Bidirectional Stepwise Regression Cross Validation

  • After conducting 10 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.408
  • After conducting 5 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.373
  • Both of these values are lower than the originally predicted r^2 values without cross validating.

10 Fold

5 Fold

37 of 47

Ridge Regression

  • This are the results of the cross validation that was performed during the Ridge Regression analysis
  • The goal was to pick the appropriate value of Lambda which would be chosen through cross validation
  • The cross validation metric that was chosen for this particular project was Mean-Squared Error
  • Mean-Squared Error is of course one of many ways to assess error
    • This means our ideal choice of Lambda will be when Mean-Squared Error is minimized

38 of 47

Ridge Regression

  • These are the results of the Ridge Regression that was performed along with cross validation
  • It is important to remember that Ridge Regression is highly critical for large coefficients and during instances of multicollinearity
  • The response, or the outcome the features are trying to predict, is confirmed cases of COVD-19
  • Of the predictors, it seems that intake of eggs contributes most towards an increase in COVD-19 rates (in this model at least)

39 of 47

Ridge Regression

  • Stimulants also seem to be one of the predictors which may lead to increased COVID-19 cases in this model
  • In terms of predictors that seem to lead to decreased COVID-19 cases in this model, aquatic products seem to be the greatest
  • Other notable features which appear to lead to decreased COVID-19 cases in this model are offals and sugar crops

40 of 47

LASSO Regression

  • These are the results of the cross validation that was performed using the LASSO Regression Analysis
  • We used cross validation to find the particular Lambda which minimized Mean-Squared Error

41 of 47

LASSO Regression

  • It is important to remember that LASSO is able to perform variable selection
  • It can also be used to help with multicollinearity
  • Looking at the LASSO model, any variable with a dot beside it has been eliminated
    • Ex: Spices, Animal Products, etc.
  • This has led to a very different model when compared to the case of Ridge Regression
  • Variables like Sugar Crops or Offals which were used in the Ridge Regression model to predict a decrease in COVID-19 cases have been removed
  • There are some similarities such as Egg and Stimulant intake being useful in predicting an increase in COVID-19 cases in this particular model

42 of 47

Discussion

43 of 47

Final Model Selection

  • After analysing the forward, backward, and bidirectional stepwise regressions, our final model selection is the bidirectional model.
  • When looking at the bidirectional model, although the values of MSE , MAE, and R^2 are not the best they are very close
  • Choosing the the highest R^2 or lowest error is not always the best criteria
  • We still selected this model because we didn’t want to overfit
    • The forward stepwise model uses three more variables than the bidirectional and there is not a significant difference between the MSE, MAE, and R^2 values
  • The bidirectional model has 6 predictors which lessens the chance of overfitting the data

44 of 47

LASSO, Ridge, and Multicollinearity

  • While bidirectional stepwise regression was the model of choice for this project, multicollinearity is a pressing and persistent subject that affects data sets and least-squares regression greatly
  • In a data set such as this, multicollinearity is more than just a possibility; it should even be expected
  • Due to the nature of various nutritional groups, there is a predictable positive relationship interconnecting many of the features
  • For example, a high intake in meat should be closely related to a high intake in animal fats
  • Due to examples like this, it is important to have considered regression analysis tools that can account for such issues

45 of 47

LASSO, Ridge, and Multicollinearity

  • The LASSO model shown in the results section displays this idea of multicollinearity very well
  • During the variable selection process, only ten variables were left in the final model that was chosen
  • This could potentially support the idea that many of these variables are quite related to each other
  • Not only is it important to not include too many predictors in our final model to avoid overfitting, but research is quite expensive
    • If it is, at times, not necessary to gather excess variables and perhaps in the case of multicollinearity even detrimental, then a lot of money can be saved by eliminating predictors
  • While the bidirectional model was chosen for this particular project, the key ideas and themes that motivate techniques like Ridge and LASSO Regression are important to consider during the modeling process when predicting how COVID-19 confirmation changes

46 of 47

Acknowledgements

  • Thank you very much to Professor Mardekian for his instruction throughout the semester
  • Lessons on various techniques such as model selection, least-squares regression, Ridge and LASSO Regression were extremely helpful during the making of this project
  • Example codes posted on Canvas were also extremely helpful and were referenced many times during the coding and analysis elements throughout this project

47 of 47

Literature Cited

  • “CDC COVID Data Tracker.” United States COVID-19 Cases and Deaths by State, Center for Disease Control and Protection, 3 Dec. 2020, covid.cdc.gov/covid-data-tracker/.
  • “COVID-19 Hospitalization and Death by Race/Ethnicity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 2020, www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/hospitalization-death-by-race-ethnicity.html.
  • New Jersey COVID-19 Information Hub, New Jersey Department of Health , 2020, covid19.nj.gov/.
  • “Our Data.” The COVID Tracking Project, The Atlantic Monthly Group, 2020, covidtracking.com/data.
  • Pettersson, Henrik, et al. “Tracking Coronavirus' Global Spread.” CNN, Cable News Network, 3 Dec. 2020, www.cnn.com/interactive/2020/health/coronavirus-maps-and-cases/.
  • Ren, Maria. “COVID-19 Healthy Diet Dataset.” Kaggle, 19 Nov. 2020, www.kaggle.com/mariaren/covid19-healthy-diet-dataset.
  • Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2017.