Social Science Data Modeling with R
About Me
Importance of Data Modeling in Social Science
IDE and Packages for this Lecture
Base R Packages | visdat |
tidyverse | dlookr |
ggplot2 | missRanger |
summarytools | gapminder |
Outline
Introduction to Data Modeling - Types of Data
Qualitative Data (Categorical)
Nominal
Ordinal
Quantitative Data (Numeric)
Discrete
Continuous
Types of Data in Regression
Cross-Sectional Data
Time Series Data
Panel Data
Introduction to Data Modeling - Types of Data
Cross-Sectional Data
Time Series Data
Panel Data
Data Cleaning
Data Cleaning
dplyr::mutate()
Data Cleaning
dlookr::plot_na_pareto()
visdat::vis_miss()
data %>% drop_na()
dlookr::imputate_na()
missRanger::missRanger()
Univariate Analysis – Introduction
Meaning
Relevance
Univariate Analysis – Descriptive Statistics
Categorical
Numeric
Univariate Analysis – Descriptive Statistics
Categorical
freq()
Numeric
descr()
install.packages(“summarytools”)
library(summarytools)
Univariate Analysis – Data Visualization
Categorical
Numeric
Univariate Analysis – Data Visualization (Histogram)
Left-skewed
Symmetric
Right-skewed
Univariate Analysis – Data Visualization (Histogram)
Symmetric
The mean and median are (approximately) equal in a symmetric or normal distribution.
Univariate Analysis – Data Visualization (Histogram)
Right-skewed
The mean is sensitive to extreme values, and thus, the median is more robust in a skewed distribution. The mean is pulled towards the skew or extreme values.
Univariate Analysis – Data Visualization (Histogram)
Left-skewed
The mean is sensitive to extreme values, and thus, the median is more robust in a skewed distribution. The mean is pulled towards the skew or extreme values.
Univariate Analysis – Variability or Spread
Numeric
Univariate Analysis – Variability or Spread
Numeric
Which is robust?
Bivariate Analysis – Introduction
Meaning
Relevance
Bivariate Analysis – Assessing Relationships
Two Numeric Variables
Two Categorical Variables
One Numeric, One Categorical
Bivariate Analysis – Assessing Relationships (Numeric)
Positive Relationship
Negative Relationship
Bivariate Analysis – Assessing Relationships
Almost No Relationship
Bivariate Analysis – Assessing Relationships
Bivariate Analysis – Assessing Relationships
Bivariate Analysis – Assessing Relationships
Types of Correlation Coefficients:
Bivariate Analysis – Assessing Relationships
(Correlation is not different from zero)
Bivariate Analysis – Assessing Relationships (Categorical)
Contingency Table
summarytools::ctable(var1, var2)
Chi-squared Test
table(var1, var2)
chisq.test(table)
Bivariate Analysis – One Numeric, One Categorical
Data Visualization
T-Test
t.test(nvar1 ~ cvar2)
ggstatsplot::ggbetweenstats()
Multivariate Analysis – Introduction
Meaning
Relevance
Multivariate Analysis – Types
Principal Component Analysis (PCA)
Factor Analysis
Cluster Analysis
Multivariate Analysis – Types
Multivariate Analysis of Variance (MANOVA)
Structural Equation Modeling (SEM)
Regression Analysis
“Galton’s Universal Law of Regression”
Height of Sons
Fathers’ Height
Dependent Variable
Independent Variable
Regression Analysis – Terminology
Dependent Variable | Independent Variable |
Explained Variable | Explanatory Variable |
Predictand | Predictor |
Regressand | Regressor |
Outcome Variable | Covariate |
Controlled Variable | Control Variable |
Linear Regression Analysis – Simple vs. Multiple
Height of Sons
Fathers’ Height
Height of Sons
Fathers’ Height
Mothers’ Height
Nutrition
Assumptions of Classical Linear Regression Model
What if these assumptions are violated?
Linear Regression Analysis – Problem
Nature of Variables
Dataset
lm(formula, data)
Source: https://www.kaggle.com/datasets/rahulmenon1758/car-resale-prices
Dependent Variable | Independent Variable(s) |
Continuous | Numeric or categorical |
Now, let’s practice
LPM, Logit and Probit Models
Examples:
LPM, Logit and Probit Models
Examples:
Approaches to Measuring Binary Outcome Variables
LPM, Logit and Probit Models
LPM, Logit and Probit Models
LPM, Logit and Probit Models
How do we get around the problems of LPM – probability rule violation?
Estimate the LPM by the usual OLS
method. If estimated Y has some values
less than 0 [i.e. negative], Y is assumed
to be zero for those cases; if they are
greater than 1, they are assumed to be 1.
Constrained LPM
Devise an estimating technique that will
guarantee that the estimated conditional
probabilities Y will lie between 0 and 1.
Logit and Probit Models
LPM, Logit and Probit Models
Logit and Probit Models
are sigmoid, or s-shaped.
LPM, Logit and Probit Models
A scenario
Match |
Win |
Win |
Win |
Win |
Lose |
Lose |
Lose |
Lose |
Lose |
Lose |
Match | Frequency |
Win | 4 |
Lose | 6 |
What can we do with count data?
What is the PROBABILITY that a team wins a match?
What are the ODDS of winning the match?
LPM, Logit and Probit Models
Match | Frequency |
Win | 4 |
Lose | 6 |
Probability is the number of events divided by the total outcome.
Odds are the chances of something happening to something not happening.
LPM, Logit and Probit Models
Probability is odds can be expressed in terms of each other.
LPM, Logit and Probit Models
LPM, Logit and Probit Models
Nature of Variables
Dataset
glm(formula, family, data)
mfx::logitmfx()
mfx::probitmfx()
Source: https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset
Dependent Variable | Independent Variable(s) |
Categorical (only 2 categories) | Numeric or categorical |
Now, let’s practice
Time Series Analysis - Introduction
sequence of data recorded over time.
observations are evenly spaced.
Time Series Analysis – Time Series Objects in R
ts(data, start, end, frequency)
is.ts()
Now, let’s practice
Time Series Analysis - Trends
Time Series Analysis - Trends
Time Series Analysis - Trends
Time Series Analysis - Trends
Time Series Analysis - Variance
Time Series Analysis – Basic Time Series Models
THANK
YOU