1 of 62

Social Science Data Modeling with R

2 of 62

About Me

  • Elijah Appiah from Ghana.
  • Ph.D. Economics at NIDA in Bangkok, Thailand.
  • Economist by profession and Data Scientist by passion.
  • Enthusiastic about working with data daily.
  • Technical skills in LATEX, Microsoft Office (Word, Excel, PowerPoint), SPSS, Stata, EViews, Python, R, Power BI, Tableau, and Google TensorFlow.
  • Augustine Otobi Ogbaji from Nigeria.
  • Postgraduate student at University of Calabar and a Faculty at SICSS.
  • Data Science and Machine Learning Engineer.
  • Passionate about Artificial Intelligence.

3 of 62

Importance of Data Modeling in Social Science

  • Test and develop theories.
  • Predict and forecast future social trends and phenomena.
  • Understand complex relationships among variables.
  • Generate new hypothesis and research questions.
  • Assess impact of social policies and interventions.
  • Encourages data-driven decision-making.
  • Communicate research findings.

4 of 62

IDE and Packages for this Lecture

  • Main IDE for R is Rstudio
  • Packages:

Base R Packages

visdat

tidyverse

dlookr

ggplot2

missRanger

summarytools

gapminder

5 of 62

Outline

  • Introduction to Data Modeling
    • Types of Data (Statistics)
    • Types of Data (Regression)
    • Data Cleaning and Preprocessing
  • Univariate Analysis
  • Bivariate Analysis
  • Multivariate Analysis

  • Correlation Analysis
  • Regression Analysis
  • Time Series Analysis

6 of 62

Introduction to Data Modeling - Types of Data

Qualitative Data (Categorical)

Nominal

  • Names, labels, categories with no natural order
  • E.g. gender, countries, marital status, etc…

Ordinal

  • Names, labels, categories with an order.
  • E.g. Likert Scales (Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)

Quantitative Data (Numeric)

Discrete

  • Measures of variables that can be counted.
  • E.g. number of students, number of passengers in a vehicle, number of SICSS participants, etc…

Continuous

  • Measured on a continuum, or within an interval.
  • E.g. height, weight, temperature, etc…

7 of 62

Types of Data in Regression

Cross-Sectional Data

  • Data collected at a single point in time from different individuals, or subjects.

Time Series Data

  • Data collected on one or more variables over a series of time intervals or periods, typically at regular intervals.

Panel Data

  • Consists of both cross-sectional and time series data; data is collected on multiple individuals or subjects over multiple time periods.

8 of 62

Introduction to Data Modeling - Types of Data

Cross-Sectional Data

  • Linear Regression
  • Linear Probability Model, Logit, and Probit Models
  • Ordinal Logit and Probit Models
  • Multinomial Logit and Probit Models
  • Instrumental Variable Estimation and Two-Stage Least Squares (2SLS), 3SLS.
  • Tobit Model
  • Poisson Regression Model
  • Heckman Model, etc…

Time Series Data

  • Autoregressive (AR) Model
  • Moving Average (MA) Model
  • ARMA Model
  • Autoregressive Integrated Moving Average (ARIMA) Model
  • Seasonal ARIMA Model
  • Vector Autoregression (VAR)
  • Autoregressive Distributed Lag (ARDL) Model, etc…

Panel Data

  • Pooled OLS, Fixed Effects, and Random Effects Models
  • Dynamic Panel Model, etc…

9 of 62

Data Cleaning

  • Data type conversion

  • Handling missing data
    • remove, or
    • impute – mean, median, mode, predictive modeling

  • Outlier detection and treatment

10 of 62

Data Cleaning

  • Data type conversion
  • Convert character data types to factors.

dplyr::mutate()

11 of 62

Data Cleaning

  • Handling missing data

dlookr::plot_na_pareto()

visdat::vis_miss()

    • Remove missing values

data %>% drop_na()

    • impute – mean, median, mode, predictive modeling

dlookr::imputate_na()

missRanger::missRanger()

12 of 62

Univariate Analysis – Introduction

Meaning

  • Analyze a single variable.

Relevance

  • Helps understand the nature of a single variable.
    • Distribution
    • Central tendency
    • Variability

13 of 62

Univariate Analysis – Descriptive Statistics

Categorical

  • Frequency Distribution Table

Numeric

  • Measures of Central Tendency
    • Mean, Mode, Median

  • Measures of Dispersion/Spread
    • Range, Variance, Standard Deviation, Interquartile Range

  • Other Measures
    • Skewness, Kurtosis, Coefficient of Variation (CV), etc…

14 of 62

Univariate Analysis – Descriptive Statistics

Categorical

  • Frequency Distribution Table

freq()

Numeric

  • Measures of Central Tendency
  • Measures of Dispersion
  • Other Measures

descr()

install.packages(“summarytools”)

library(summarytools)

15 of 62

Univariate Analysis – Data Visualization

Categorical

  • Bar plot

Numeric

  • Histogram
  • Density plot
  • Box plot
  • Dot plot
  • Frequency polygon

16 of 62

Univariate Analysis – Data Visualization (Histogram)

Left-skewed

Symmetric

Right-skewed

17 of 62

Univariate Analysis – Data Visualization (Histogram)

Symmetric

The mean and median are (approximately) equal in a symmetric or normal distribution.

18 of 62

Univariate Analysis – Data Visualization (Histogram)

Right-skewed

The mean is sensitive to extreme values, and thus, the median is more robust in a skewed distribution. The mean is pulled towards the skew or extreme values.

19 of 62

Univariate Analysis – Data Visualization (Histogram)

Left-skewed

The mean is sensitive to extreme values, and thus, the median is more robust in a skewed distribution. The mean is pulled towards the skew or extreme values.

20 of 62

Univariate Analysis – Variability or Spread

Numeric

  • Range
    • Maximum value – Minimum value
    • Very sensitive to outliers or extreme values, making it an unreliable measure of spread.

  • Variance
    • Dispersion of data points around the mean.
    • Average squared deviation from the mean.
    • Sensitive to outliers: outliers with large deviations from the mean inflate the variance.

21 of 62

Univariate Analysis – Variability or Spread

Numeric

  • Standard Deviation
    • Dispersion of data points around the mean.
    • Average deviation from the mean.
    • Sensitive to outliers: outliers with large deviations from the mean inflate the standard deviation.

  • Interquartile Range (IQR)
    • Difference between 3rd and 1st quartiles (Q3 – Q1).
    • Less sensitive to extreme values and skewed data because it considers the middle 50% of the data and less influenced by the extreme values in the tails of the distribution.

Which is robust?

  • In a symmetric/normal distribution, the standard deviation is just the measure to go for.
  • In a skewed distribution, the IQR is robust.

22 of 62

Bivariate Analysis – Introduction

Meaning

  • Explores the relationship between two variables.

Relevance

  • Identify relationships.
  • Assessing correlations (statistical significance).
  • Test causal relationships through hypothesis.

23 of 62

Bivariate Analysis – Assessing Relationships

Two Numeric Variables

  • Scatter Plot
  • Correlation

Two Categorical Variables

  • Contingency Table
  • Chi-squared Test

One Numeric, One Categorical

  • Box Plot / Violin Plot
  • T-test / ANOVA

24 of 62

Bivariate Analysis – Assessing Relationships (Numeric)

Positive Relationship

Negative Relationship

25 of 62

Bivariate Analysis – Assessing Relationships

Almost No Relationship

26 of 62

Bivariate Analysis – Assessing Relationships

  •  

27 of 62

Bivariate Analysis – Assessing Relationships

  • Correlation Coefficient
    • Quantify strength and direction of the linear relationship between two numeric variables.

28 of 62

Bivariate Analysis – Assessing Relationships

Types of Correlation Coefficients:

  • Pearson Product-Moment Correlation Coefficient or Pearson Correlation Coefficient.

  • Spearman Correlation Coefficient (Spearman’s rho)

  • Kendall’s Tau Correlation Coefficient

29 of 62

Bivariate Analysis – Assessing Relationships

  •  

(Correlation is not different from zero)

30 of 62

Bivariate Analysis – Assessing Relationships (Categorical)

  • Analyzing the relationship between two categorical variables typically involves creating contingency tables and conducting chi-squared tests.

Contingency Table

  • Construct a table that cross-tabulates the two categorical variables.
  • Each cell shows the count/frequency of observations in each combination of categories.

summarytools::ctable(var1, var2)

Chi-squared Test

  • Chi-squared test of independence.
  • Assesses whether the observed frequencies differ significantly from what would be expected if the variables were independent.

table(var1, var2)

chisq.test(table)

31 of 62

Bivariate Analysis – One Numeric, One Categorical

  • Involves exploring differences in numeric values across different categories.

Data Visualization

  • Box plot/ Violin plot

T-Test

  • Test whether there are statistically significant differences in means between the numeric variable across the categories of the categorical variables.

t.test(nvar1 ~ cvar2)

ggstatsplot::ggbetweenstats()

32 of 62

Multivariate Analysis – Introduction

Meaning

  • Statistical technique used to analyze data with multiple variables simultaneously.

Relevance

  • Gain deeper insights into complex datasets.
  • Uncover hidden patterns in data.
  • Explore relationships between variables.
  • Dimensionality reduction.

33 of 62

Multivariate Analysis – Types

Principal Component Analysis (PCA)

  • Reduces high-dimensional data while preserving variance.
  • Identifies important variables or patterns.

Factor Analysis

  • Identifies latent factors explaining correlations.

Cluster Analysis

  • Groups data points based on similarities.
  • Examples: K-means, Hierarchical clustering.

34 of 62

Multivariate Analysis – Types

Multivariate Analysis of Variance (MANOVA)

  • Extends ANOVA to multiple dependent variables.
  • Assesses significant group differences with multiple response variables.

Structural Equation Modeling (SEM)

  • Models complex relationships among observed and latent variables.
  • Combines factor analysis and regression to test causal hypotheses.

35 of 62

Regression Analysis

  • Regression is the study of the dependence of one variable (dependent variable) on one or more other variables (independent variable[s]), with the aim of estimating or predicting the mean of the dependent variable based on the known values of the independent variable(s).

“Galton’s Universal Law of Regression”

  • Find out how the average height of sons changes given the father’s height.

Height of Sons

Fathers’ Height

Dependent Variable

Independent Variable

36 of 62

Regression Analysis – Terminology

Dependent Variable

Independent Variable

Explained Variable

Explanatory Variable

Predictand

Predictor

Regressand

Regressor

Outcome Variable

Covariate

Controlled Variable

Control Variable

37 of 62

Linear Regression Analysis – Simple vs. Multiple

  •  

Height of Sons

Fathers’ Height

Height of Sons

Fathers’ Height

Mothers’ Height

Nutrition

38 of 62

Assumptions of Classical Linear Regression Model

  • There is no perfect relationship between the independent variables.
  • The variance of the error term is constant or homoscedastic.
  • There is no autocorrelation between the error terms.

What if these assumptions are violated?

  • Multicollinearity
    • It inflates the variance – regression coefficients cannot be estimated with great precision [Variance Inflation Factor (VIF)].
  • Heteroscedasticity
    • Incorrect standard errors, and unreliable tests of significance.
  • Autocorrelation
    • Incorrect standard errors, and unreliable tests of significance.

39 of 62

Linear Regression Analysis – Problem

Nature of Variables

Dataset

  • Car Resale Data - 2023 from Kaggle

lm(formula, data)

Source: https://www.kaggle.com/datasets/rahulmenon1758/car-resale-prices

Dependent Variable

Independent Variable(s)

Continuous

Numeric or categorical

40 of 62

Now, let’s practice

41 of 62

LPM, Logit and Probit Models

  • The dependent variable is binary.

Examples:

  • Yes or No (0 = No, 1 = Yes)
  • Health Outcome (0 = Not Cured, 1 = Cured)
  • Votes (0 = NPP, 1 = NDC)
  • Patient Satisfaction of Healthcare (0 = Satisfied, 1 = Not Satisfied)

42 of 62

LPM, Logit and Probit Models

  • The dependent variable is binary.

Examples:

  • Yes or No (0 = No, 1 = Yes)
  • Votes (0 = NPP, 1 = NDC)
  • Patient Satisfaction of Healthcare (0 = Satisfied, 1 = Not Satisfied)

Approaches to Measuring Binary Outcome Variables

  • Linear Probability Model (LPM)
  • Logit or Logistic Regression Model
  • Probit (or Normit) Model

43 of 62

LPM, Logit and Probit Models

  •  

44 of 62

LPM, Logit and Probit Models

  •  

45 of 62

LPM, Logit and Probit Models

How do we get around the problems of LPM – probability rule violation?

Estimate the LPM by the usual OLS

method. If estimated Y has some values

less than 0 [i.e. negative], Y is assumed

to be zero for those cases; if they are

greater than 1, they are assumed to be 1.

Constrained LPM

Devise an estimating technique that will

guarantee that the estimated conditional

probabilities Y will lie between 0 and 1.

Logit and Probit Models

46 of 62

LPM, Logit and Probit Models

Logit and Probit Models

  • Cumulative distribution functions (CDF)

are sigmoid, or s-shaped.

  • Some CDFs are the logistic CDF (logit) and standard normal CDF (probit).

47 of 62

LPM, Logit and Probit Models

A scenario

Match

Win

Win

Win

Win

Lose

Lose

Lose

Lose

Lose

Lose

Match

Frequency

Win

4

Lose

6

What can we do with count data?

What is the PROBABILITY that a team wins a match?

What are the ODDS of winning the match?

48 of 62

LPM, Logit and Probit Models

Match

Frequency

Win

4

Lose

6

Probability is the number of events divided by the total outcome.

 

 

Odds are the chances of something happening to something not happening.

49 of 62

LPM, Logit and Probit Models

 

Probability is odds can be expressed in terms of each other.

 

 

 

 

 

50 of 62

LPM, Logit and Probit Models

  •  

51 of 62

LPM, Logit and Probit Models

Nature of Variables

Dataset

  • Employee Dataset from Kaggle

glm(formula, family, data)

mfx::logitmfx()

mfx::probitmfx()

Source: https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset

Dependent Variable

Independent Variable(s)

Categorical (only 2 categories)

Numeric or categorical

52 of 62

Now, let’s practice

53 of 62

Time Series Analysis - Introduction

  • Time Series is chronological

sequence of data recorded over time.

  • The basic assumption is that

observations are evenly spaced.

54 of 62

Time Series Analysis – Time Series Objects in R

  • The ts() function is used to create a time series object.

ts(data, start, end, frequency)

  • The is.ts() function whether an object is of the ts() class.

is.ts()

55 of 62

Now, let’s practice

56 of 62

Time Series Analysis - Trends

  • Some time series do not exhibit any trend.

57 of 62

Time Series Analysis - Trends

  • Some time series exhibit linear trends over time.

58 of 62

Time Series Analysis - Trends

  • Some time series exhibit rapid growth trends over time.

59 of 62

Time Series Analysis - Trends

  • Some time series exhibit periodic trends over time.

60 of 62

Time Series Analysis - Variance

  • Some time series exhibit increasing variance over time.

61 of 62

Time Series Analysis – Basic Time Series Models

  • White Noise (WN)
  • Random Walk (RW)
  • Autoregression (AR)
  • Simple Moving Average (MA)

62 of 62

THANK

YOU