1 of 47

Multiple Regression Analysis and Analytical Techniques on the Relation of COVID-19 to Nutrition

Elias Karnoub, Gabriella Provenzano, Andrew Shen, Matthew Tam

December 4, 2020

2 of 47

Abstract

As we are currently living in a pandemic and the topic of COVID-19 continues to gain attention, we should try to look at what we can to do to try and protect ourselves, our families and friends, and everyone who is sacrificing their lives to help. The goal of the project is to analyze the relationship between the number of confirmed cases of COVID-19 and the food consumption of each country. To see which variables in our dataset predicted our outcome best, forward, backward, and bi-directional model selection, cross-validation measures , and ridge and LASSO were implemented with the use of R programming. With our analysis, we found that the number of confirmed cases can be predicted by Vegetal.Products, Offals, Stimulants, Miscellaneous, and Vegetable.Oils consumption and Obesity rates.

3 of 47

Introduction

4 of 47

COVID-19 cases compared to other countries

In comparison to other nations, the United States has the most COVID-19 cases throughout the world.

Pettersson, Henrik, et al. “Tracking Coronavirus' Global Spread.” CNN, Cable News Network, 3 Dec. 2020, www.cnn.com/interactive/2020/health/coronavirus-maps-and-cases/.

5 of 47

Cases in each state

The United States as a whole has continues to see an increase in cases, currently with more in the midwest

“CDC COVID Data Tracker.” United States COVID-19 Cases and Deaths by State, Center for Disease Control and Protection, 3 Dec. 2020, covid.cdc.gov/covid-data-tracker/.

6 of 47

Hospitalizations and Deaths in the United States

As we can see hospitalizations and deaths continue to rise across the country. In addition, the United States is higher in every category then the national average.

“Our Data.” The COVID Tracking Project, The Atlantic Monthly Group, 2020, covidtracking.com/data.

7 of 47

Effects on age

“COVID-19 Hospitalization and Death by Race/Ethnicity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 2020, www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/hospitalization-death-by-race-ethnicity.html.

The older population is at a high risk of death if contracting COVID-19 because their immune system is weaker as they age. In addition, they also may have a lot of predisposed health conditions which puts them at higher risk.

8 of 47

COVID-19 in New Jersey

New Jersey continues to see an increase in cases as Governor Murphy starts to put limits on gatherings and shutting down non-essential businesses.

New Jersey COVID-19 Information Hub, New Jersey Department of Health , 2020, covid19.nj.gov/.

9 of 47

Topic: COVID-19

As we are students in a pandemic, we were all drawn to this topic because of how much it is affecting the world, country, and our own lives. COVID-19 has taken over the news and media as concerns continue to increase as cases rise . As statistics and math majors, we wanted to use our skills learned in this class and attempt to find interesting results in data related to COVID-19. From the case built in the previous slides it is clear that COVID has become a major problem across the world. We hope to find new perspectives on the issue and to investigate it further through our project.

10 of 47

Chosen Statistical Topic

We chose stepwise regression for our project because we wanted to see what variables in our dataset predicted our outcome best and which ones created the best fit. Cross validation also helps us measure the validity and accuracy of our model. This will allow us to make sure we are not over-fitting or under-fitting the data. We also decided to look at Ridge and Lasso regression to see how multicollinearity can play a role.

11 of 47

Data Set

This data set comes from kaggle.com posted by user Maria Ren labeled “Covid-19 Healthy Diet Dataset”

Note: We used Food_Supply_Quantity_kg_Data.csv

This data set measures 30 different criteria including obesity rates as well as various metrics relating to COVD-19 affectedness
170 countries are investigated among these predictors
With so many different features, multiple regression can be an important tool for analyzing significance for a specific response
Variables that rate various nutritional aspects of these countries will be used to assess predictability of how well a country can respond to the global pandemic

12 of 47

Variables in the Dataset

Country
Alcoholic Beverages
Animal products
Animal fats
Aquatic Products
Cereals- Excluding Beer
Fish, Seafood
Fruits- Excluding Wine
Meat
Milk- Excluding Butter
Miscellaneous
Offals
Oil Crops
Pulses
Species

Starchy Roots
Stimulants
Sugar Crops
Sugar and Sweeteners
Tree Nuts
Vegetal Products
Vegetable Oils
Obesity
Undernourished
Confirmed cases
Deaths
Recovered
Active
Population

13 of 47

Variables Selected

From this data, we chose confirmed COVID-19 cases as our dependent variable and used the other variables as our potential predictors or independent variables.

For our project, we were interested in determining which independent variables were best predictors of confirmed COVID cases within each country. However, we ignored COVID-19 related data in the set such as Deaths, Recovered, and Active since these are expected to be closely related.

14 of 47

Materials and Methods

15 of 47

Loading Dataset Into R

read.csv #Reads the CSV file that is downloaded on the computer
library(faraway), library(psych), library(ggplot2), library(olsrr), library(glm), require (caret) # the packages that were going to be needed
We then needed to clean up the data to remove points that had missing information

16 of 47

Running a regression model

We are looking to find a relationship between confirmed cases occurred with specific factors. We are using lm regression and below is the completed model:

17 of 47

Stepwise Regression

Stepwise regression is a very useful model selection technique and can be broken into 3 different methods:

A forward stepwise regression is a where variables, starting with the empty model, are added into the model if it meets a certain statistical criteria( ex: adjusted r^2, AIC, p-value).
A backward stepwise regression is where variables, starting with the full model, are removed from the model if the model without the variable in consideration meets a certain statistical criteria.
A bidirectional stepwise regression is a model selection method where variables are considered for addition or removal from the model if the model with/without the variable in consideration meets a certain statistical criteria.

The bidirectional stepwise regression is used most of the time, as both forwards and backwards selection models can have errors due to random sampling fluctuations

ex: with enough predictors, it’s possible to add/remove a predictor without any serious consequence to the model when it shouldn’t have been added/removed

18 of 47

Forward stepwise regression in R

In order to find out which regressors contribute positively to the model and which don’t, we need to run a stepwise regression to select the ideal variables.

We’ll start by doing a forward stepwise regression on the confirmed cases model.

Predictors selected: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable.Oils, Fruits...Excluding.wine, Eggs, Oilcrops

19 of 47

Backward stepwise regression in R

To compare with the forward stepwise regression, we ran a backward stepwise regression on the confirmed cases model to select the best variables to take out of the model.

Predictors selected: Alcoholic.Beverages, Animal.Fats, aquatic.Products..Other, Cereals...Excluding.Beer, Fish..Seafood, Fruits...Excluding.Wine, Meat, Milk...Excluding.Butter, Miscellaneous, Offals,Oilcrops, Pulses, Spices, Starchy.Roots, Sugar...Sweeteners, Vegetable.Oils, Vegetables

20 of 47

Bidirectional stepwise regression in R

To compare with the forward and backwards stepwise regression, we ran a bidirectional stepwise regression on the confirmed cases model to select the best variables to put in the model.

Predictors selected: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable.Oils

21 of 47

Cross Validation Model

Cross Validation is used to assess the accuracy and validity of the model by separating a dataset into testing and training data

Accuracy: identifies the percentage of times the model correctly classified instances out of all possible outcomes
A valid model should show good accuracy

Training data is used to build a model, which will predict values in the testing dataset
K-Fold validation is a useful cross-validation technique as it uses a lot of the data to estimate the model, and can be used to evaluate models with higher predictive and explanatory power

22 of 47

Cross Validation in R

To use this technique in R, the packages faraway, olsrr, and caret need to be installed
After loading and cleaning up the data set, it is important to first do set.seed ( ) so that we can ensure that the code and results can be reproduced
We used K fold and separated it into 5 and 10
We then need to define the training control so that we can train the model
After that we print the model to get the results

23 of 47

Ridge Regression

Ridge Regression is a very useful tool in statistical analysis
In particular, Ridge Regression is an important technique to use when there is multicollinearity

Multicollinearity can be thought of as disturbingly high correlation between predictors
Multicollinearity can be detrimental to linear regression analysis and in many cases yield unreliable results

Multicollinearity can certainly be expected to be present in real-world data sets such as the one focused on in this project
In terms of nutritional intake, one would logically expect, for example, a predictor such as fat intake to be highly correlated with obesity levels in a certain country

24 of 47

Ridge Regression

Ridge Regression offers a unique way at combating the effects of multicollinearity given that this data set is likely to experience this to some degree
Using Ridge Regression, there is no longer the need to worry about compromising predictability for unbiased estimators

Ridge Regression gives biased estimators and can therefore greatly reduce variance, while least squares estimation produces unbiased estimators at the cost of potentially high variance

It is important to scale values when performing Ridge Regression because there is a necessary punishment on coefficient size

Larger coefficients are penalized
While coefficients are punished, none are removed during Ridge Regression

25 of 47

Ridge Regression in R

To use this technique in R, the packages faraway and glmnet are necessary
After loading and cleaning up the data set, it is important to break up the data into a matrix of the predictors and a vector of the response

It is extremely important to use set.seed() in order to ensure that, while certain elements of the process are random, they can be reproduced given a specific index
Finally, using alpha=0 in the glmnet procedure signifies the use of Ridge Regression

26 of 47

Ridge Regression in R

Developing the Ridge Regression technique further, it is critical that a value of lambda is chosen by cross validation (the default is 10-fold cross validation in R)
After choosing the appropriate value for lambda, the corresponding cross-validated Ridge Regression model can be run in R and the coefficients can be analyzed
Thank you very much to Professor Mardekian for posting his guides to various code examples on Canvas which were used during this project

27 of 47

LASSO Regression Analysis

Similar to Ridge Regression in many ways, LASSO is another unique regression analysis tool that has become a very popular technique
LASSO is another important method which can be used to tackle multicollinearity which, as previously discussed, is a great hindrance during least-squares regression
However, LASSO, unlike Ridge Regression, is able to perform variable selection
Undesirable predictors can and will be eliminated

This is very similar to procedures like best subset regression

Similar to Ridge Regression, the performance of the LASSO model can be optimized through various means such as cross validation

28 of 47

LASSO Regression Analysis in R

To use this technique in R, the faraway and glmnet packages are necessary
Just like with Ridge Regression, the data will need to be broken into a matrix of the predictors and a vector of the appropriate response

This time, signaling that alpha = 1 is the key to beginning the process of LASSO as opposed to Ridge Regression
While there is not a screenshot of it here, it is necessary to use set.seed() in order to ensure that these results can be easily reproduced

29 of 47

LASSO Regression Analysis in R

Again, using cross validation will help to ensure that the potential of LASSO is maximized
After using cross validation to find the ideal lambda, using that in the model will provide the final LASSO model and coefficients can be analyzed
Thank you very much to Professor Mardekian for posting his guides to various code examples on Canvas which were used during this project

31 of 47

Forward Stepwise Regression Cross Validation

After conducting 10 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.409
After conducting 5 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.365
Both of these values are lower than the originally predicted r^2 values without cross validating.

10 Fold

5 Fold

32 of 47

Forward Stepwise Regression Results

Confirmed forward regression included the parameters: Vegetal. Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable. Oils, Fruits… Excluding.Wine, Eggs, and Oilcrops

These variables were selected because they met the criteria of having a p-value of less than 0.3

33 of 47

Backward Stepwise Regression Results

Confirmed backward regression includes the following parameters: Alcoholic.Beverages, Animal.fats, Aquatic.Products..Other, Cereals… Excluding Beer, Fish..Seafood, Fruits… Excluding.Wine, Meat, Milk...Excluding Butter, Miscellaneous, Offals, Oilcrops. Pulses, Spices, Starchy.Roots, Sugar...Sweeteners. Vegetable.OIls. Vegetables

These variables were selected because they met the criteria of having a p-value of greater than 0.3

34 of 47

Backward Stepwise Regression Cross Validation

After conducting 10 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.35
After conducting 5 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.323
Both of these values are lower than the originally predicted r^2 values without cross validating.

10 Fold

5 Fold

35 of 47

Bidirectional Stepwise Regression Results

Confirmed bidirectional stepwise regression includes the following parameters: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, and Vegetable.Oils .

These variables were selected based on the p-value criteria for both forward and backward stepwise regression.

36 of 47

Bidirectional Stepwise Regression Cross Validation

After conducting 10 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.408
After conducting 5 fold cross validation on the forward stepwise regression model, we get that r^2 = 0.373
Both of these values are lower than the originally predicted r^2 values without cross validating.

10 Fold

5 Fold

37 of 47

Ridge Regression

This are the results of the cross validation that was performed during the Ridge Regression analysis
The goal was to pick the appropriate value of Lambda which would be chosen through cross validation
The cross validation metric that was chosen for this particular project was Mean-Squared Error
Mean-Squared Error is of course one of many ways to assess error

This means our ideal choice of Lambda will be when Mean-Squared Error is minimized

38 of 47

Ridge Regression

These are the results of the Ridge Regression that was performed along with cross validation
It is important to remember that Ridge Regression is highly critical for large coefficients and during instances of multicollinearity
The response, or the outcome the features are trying to predict, is confirmed cases of COVD-19
Of the predictors, it seems that intake of eggs contributes most towards an increase in COVD-19 rates (in this model at least)

39 of 47

Ridge Regression

Stimulants also seem to be one of the predictors which may lead to increased COVID-19 cases in this model
In terms of predictors that seem to lead to decreased COVID-19 cases in this model, aquatic products seem to be the greatest
Other notable features which appear to lead to decreased COVID-19 cases in this model are offals and sugar crops

40 of 47

LASSO Regression

These are the results of the cross validation that was performed using the LASSO Regression Analysis
We used cross validation to find the particular Lambda which minimized Mean-Squared Error

41 of 47

LASSO Regression

It is important to remember that LASSO is able to perform variable selection
It can also be used to help with multicollinearity
Looking at the LASSO model, any variable with a dot beside it has been eliminated

Ex: Spices, Animal Products, etc.

This has led to a very different model when compared to the case of Ridge Regression
Variables like Sugar Crops or Offals which were used in the Ridge Regression model to predict a decrease in COVID-19 cases have been removed
There are some similarities such as Egg and Stimulant intake being useful in predicting an increase in COVID-19 cases in this particular model

43 of 47

Final Model Selection

After analysing the forward, backward, and bidirectional stepwise regressions, our final model selection is the bidirectional model.
When looking at the bidirectional model, although the values of MSE , MAE, and R^2 are not the best they are very close
Choosing the the highest R^2 or lowest error is not always the best criteria
We still selected this model because we didn’t want to overfit

The forward stepwise model uses three more variables than the bidirectional and there is not a significant difference between the MSE, MAE, and R^2 values

The bidirectional model has 6 predictors which lessens the chance of overfitting the data

44 of 47

LASSO, Ridge, and Multicollinearity

While bidirectional stepwise regression was the model of choice for this project, multicollinearity is a pressing and persistent subject that affects data sets and least-squares regression greatly
In a data set such as this, multicollinearity is more than just a possibility; it should even be expected
Due to the nature of various nutritional groups, there is a predictable positive relationship interconnecting many of the features
For example, a high intake in meat should be closely related to a high intake in animal fats
Due to examples like this, it is important to have considered regression analysis tools that can account for such issues

45 of 47

LASSO, Ridge, and Multicollinearity

The LASSO model shown in the results section displays this idea of multicollinearity very well
During the variable selection process, only ten variables were left in the final model that was chosen
This could potentially support the idea that many of these variables are quite related to each other
Not only is it important to not include too many predictors in our final model to avoid overfitting, but research is quite expensive

If it is, at times, not necessary to gather excess variables and perhaps in the case of multicollinearity even detrimental, then a lot of money can be saved by eliminating predictors

While the bidirectional model was chosen for this particular project, the key ideas and themes that motivate techniques like Ridge and LASSO Regression are important to consider during the modeling process when predicting how COVID-19 confirmation changes

46 of 47

Acknowledgements

Thank you very much to Professor Mardekian for his instruction throughout the semester
Lessons on various techniques such as model selection, least-squares regression, Ridge and LASSO Regression were extremely helpful during the making of this project
Example codes posted on Canvas were also extremely helpful and were referenced many times during the coding and analysis elements throughout this project

47 of 47

Literature Cited

“CDC COVID Data Tracker.” United States COVID-19 Cases and Deaths by State, Center for Disease Control and Protection, 3 Dec. 2020, covid.cdc.gov/covid-data-tracker/.
“COVID-19 Hospitalization and Death by Race/Ethnicity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 2020, www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/hospitalization-death-by-race-ethnicity.html.
New Jersey COVID-19 Information Hub, New Jersey Department of Health , 2020, covid19.nj.gov/.
“Our Data.” The COVID Tracking Project, The Atlantic Monthly Group, 2020, covidtracking.com/data.
Pettersson, Henrik, et al. “Tracking Coronavirus' Global Spread.” CNN, Cable News Network, 3 Dec. 2020, www.cnn.com/interactive/2020/health/coronavirus-maps-and-cases/.
Ren, Maria. “COVID-19 Healthy Diet Dataset.” Kaggle, 19 Nov. 2020, www.kaggle.com/mariaren/covid19-healthy-diet-dataset.
Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2017.

1 of 47

2 of 47

3 of 47

4 of 47

5 of 47

6 of 47

7 of 47

8 of 47

9 of 47

10 of 47

11 of 47

12 of 47

13 of 47

14 of 47

15 of 47

16 of 47

17 of 47

18 of 47

19 of 47

20 of 47

21 of 47

22 of 47

23 of 47

24 of 47

25 of 47

26 of 47

27 of 47

28 of 47

29 of 47

30 of 47

31 of 47

32 of 47

33 of 47

34 of 47

35 of 47

36 of 47

37 of 47

38 of 47

39 of 47

40 of 47

41 of 47

42 of 47

43 of 47

44 of 47

45 of 47

46 of 47

47 of 47