Multiple Regression Analysis and Analytical Techniques on the Relation of COVID-19 to Nutrition
Elias Karnoub, Gabriella Provenzano, Andrew Shen, Matthew Tam
December 4, 2020
Abstract
As we are currently living in a pandemic and the topic of COVID-19 continues to gain attention, we should try to look at what we can to do to try and protect ourselves, our families and friends, and everyone who is sacrificing their lives to help. The goal of the project is to analyze the relationship between the number of confirmed cases of COVID-19 and the food consumption of each country. To see which variables in our dataset predicted our outcome best, forward, backward, and bi-directional model selection, cross-validation measures , and ridge and LASSO were implemented with the use of R programming. With our analysis, we found that the number of confirmed cases can be predicted by Vegetal.Products, Offals, Stimulants, Miscellaneous, and Vegetable.Oils consumption and Obesity rates.
Introduction
COVID-19 cases compared to other countries
In comparison to other nations, the United States has the most COVID-19 cases throughout the world.
Pettersson, Henrik, et al. “Tracking Coronavirus' Global Spread.” CNN, Cable News Network, 3 Dec. 2020, www.cnn.com/interactive/2020/health/coronavirus-maps-and-cases/.
Cases in each state
The United States as a whole has continues to see an increase in cases, currently with more in the midwest
“CDC COVID Data Tracker.” United States COVID-19 Cases and Deaths by State, Center for Disease Control and Protection, 3 Dec. 2020, covid.cdc.gov/covid-data-tracker/.
Hospitalizations and Deaths in the United States
As we can see hospitalizations and deaths continue to rise across the country. In addition, the United States is higher in every category then the national average.
“Our Data.” The COVID Tracking Project, The Atlantic Monthly Group, 2020, covidtracking.com/data.
Effects on age
“COVID-19 Hospitalization and Death by Race/Ethnicity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 2020, www.cdc.gov/coronavirus/2019-ncov/covid-data/investigations-discovery/hospitalization-death-by-race-ethnicity.html.
The older population is at a high risk of death if contracting COVID-19 because their immune system is weaker as they age. In addition, they also may have a lot of predisposed health conditions which puts them at higher risk.
COVID-19 in New Jersey
New Jersey continues to see an increase in cases as Governor Murphy starts to put limits on gatherings and shutting down non-essential businesses.
New Jersey COVID-19 Information Hub, New Jersey Department of Health , 2020, covid19.nj.gov/.
Topic: COVID-19
As we are students in a pandemic, we were all drawn to this topic because of how much it is affecting the world, country, and our own lives. COVID-19 has taken over the news and media as concerns continue to increase as cases rise . As statistics and math majors, we wanted to use our skills learned in this class and attempt to find interesting results in data related to COVID-19. From the case built in the previous slides it is clear that COVID has become a major problem across the world. We hope to find new perspectives on the issue and to investigate it further through our project.
Chosen Statistical Topic
We chose stepwise regression for our project because we wanted to see what variables in our dataset predicted our outcome best and which ones created the best fit. Cross validation also helps us measure the validity and accuracy of our model. This will allow us to make sure we are not over-fitting or under-fitting the data. We also decided to look at Ridge and Lasso regression to see how multicollinearity can play a role.
Data Set
Variables in the Dataset
Variables Selected
From this data, we chose confirmed COVID-19 cases as our dependent variable and used the other variables as our potential predictors or independent variables.
For our project, we were interested in determining which independent variables were best predictors of confirmed COVID cases within each country. However, we ignored COVID-19 related data in the set such as Deaths, Recovered, and Active since these are expected to be closely related.
Materials and Methods
Loading Dataset Into R
Running a regression model
Stepwise Regression
Forward stepwise regression in R
Predictors selected: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable.Oils, Fruits...Excluding.wine, Eggs, Oilcrops
Backward stepwise regression in R
Predictors selected: Alcoholic.Beverages, Animal.Fats, aquatic.Products..Other, Cereals...Excluding.Beer, Fish..Seafood, Fruits...Excluding.Wine, Meat, Milk...Excluding.Butter, Miscellaneous, Offals,Oilcrops, Pulses, Spices, Starchy.Roots, Sugar...Sweeteners, Vegetable.Oils, Vegetables
Bidirectional stepwise regression in R
Predictors selected: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable.Oils
Cross Validation Model
Cross Validation in R
Ridge Regression
Ridge Regression
Ridge Regression in R
Ridge Regression in R
LASSO Regression Analysis
LASSO Regression Analysis in R
LASSO Regression Analysis in R
Results
Forward Stepwise Regression Cross Validation
10 Fold
5 Fold
Forward Stepwise Regression Results
Confirmed forward regression included the parameters: Vegetal. Products, Obesity, Offals, Stimulants, Miscellaneous, Vegetable. Oils, Fruits… Excluding.Wine, Eggs, and Oilcrops
These variables were selected because they met the criteria of having a p-value of less than 0.3
Backward Stepwise Regression Results
Confirmed backward regression includes the following parameters: Alcoholic.Beverages, Animal.fats, Aquatic.Products..Other, Cereals… Excluding Beer, Fish..Seafood, Fruits… Excluding.Wine, Meat, Milk...Excluding Butter, Miscellaneous, Offals, Oilcrops. Pulses, Spices, Starchy.Roots, Sugar...Sweeteners. Vegetable.OIls. Vegetables
These variables were selected because they met the criteria of having a p-value of greater than 0.3
Backward Stepwise Regression Cross Validation
10 Fold
5 Fold
Bidirectional Stepwise Regression Results
Confirmed bidirectional stepwise regression includes the following parameters: Vegetal.Products, Obesity, Offals, Stimulants, Miscellaneous, and Vegetable.Oils .
These variables were selected based on the p-value criteria for both forward and backward stepwise regression.
Bidirectional Stepwise Regression Cross Validation
10 Fold
5 Fold
Ridge Regression
Ridge Regression
Ridge Regression
LASSO Regression
LASSO Regression
Discussion
Final Model Selection
LASSO, Ridge, and Multicollinearity
LASSO, Ridge, and Multicollinearity
Acknowledgements
Literature Cited