2 of 62

About Me

Elijah Appiah from Ghana.
Ph.D. Economics at NIDA in Bangkok, Thailand.
Economist by profession and Data Scientist by passion.
Enthusiastic about working with data daily.
Technical skills in LATEX, Microsoft Office (Word, Excel, PowerPoint), SPSS, Stata, EViews, Python, R, Power BI, Tableau, and Google TensorFlow.

Augustine Otobi Ogbaji from Nigeria.
Postgraduate student at University of Calabar and a Faculty at SICSS.
Data Science and Machine Learning Engineer.
Passionate about Artificial Intelligence.

3 of 62

Importance of Data Modeling in Social Science

Test and develop theories.
Predict and forecast future social trends and phenomena.
Understand complex relationships among variables.
Generate new hypothesis and research questions.
Assess impact of social policies and interventions.
Encourages data-driven decision-making.
Communicate research findings.

4 of 62

IDE and Packages for this Lecture

Main IDE for R is Rstudio
Packages:

Base R Packages	visdat
tidyverse	dlookr
ggplot2	missRanger
summarytools	gapminder

5 of 62

Outline

Introduction to Data Modeling

Types of Data (Statistics)
Types of Data (Regression)
Data Cleaning and Preprocessing

Univariate Analysis
Bivariate Analysis
Multivariate Analysis

Correlation Analysis
Regression Analysis
Time Series Analysis

6 of 62

Introduction to Data Modeling - Types of Data

Qualitative Data (Categorical)

Nominal

Names, labels, categories with no natural order
E.g. gender, countries, marital status, etc…

Ordinal

Names, labels, categories with an order.
E.g. Likert Scales (Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree)

Quantitative Data (Numeric)

Discrete

Measures of variables that can be counted.
E.g. number of students, number of passengers in a vehicle, number of SICSS participants, etc…

Continuous

Measured on a continuum, or within an interval.
E.g. height, weight, temperature, etc…

7 of 62

Types of Data in Regression

Cross-Sectional Data

Data collected at a single point in time from different individuals, or subjects.

Time Series Data

Data collected on one or more variables over a series of time intervals or periods, typically at regular intervals.

Panel Data

Consists of both cross-sectional and time series data; data is collected on multiple individuals or subjects over multiple time periods.

8 of 62

Introduction to Data Modeling - Types of Data

Cross-Sectional Data

Linear Regression
Linear Probability Model, Logit, and Probit Models
Ordinal Logit and Probit Models
Multinomial Logit and Probit Models
Instrumental Variable Estimation and Two-Stage Least Squares (2SLS), 3SLS.
Tobit Model
Poisson Regression Model
Heckman Model, etc…

Time Series Data

Autoregressive (AR) Model
Moving Average (MA) Model
ARMA Model
Autoregressive Integrated Moving Average (ARIMA) Model
Seasonal ARIMA Model
Vector Autoregression (VAR)
Autoregressive Distributed Lag (ARDL) Model, etc…

Panel Data

Pooled OLS, Fixed Effects, and Random Effects Models
Dynamic Panel Model, etc…

9 of 62

Data Cleaning

Data type conversion

Handling missing data

remove, or
impute – mean, median, mode, predictive modeling

Outlier detection and treatment

10 of 62

Data Cleaning

Data type conversion
Convert character data types to factors.

dplyr::mutate()

11 of 62

Data Cleaning

Handling missing data

dlookr::plot_na_pareto()

visdat::vis_miss()

Remove missing values

data %>% drop_na()

impute – mean, median, mode, predictive modeling

dlookr::imputate_na()

missRanger::missRanger()

12 of 62

Univariate Analysis – Introduction

Meaning

Analyze a single variable.

Relevance

Helps understand the nature of a single variable.

Distribution
Central tendency
Variability

13 of 62

Univariate Analysis – Descriptive Statistics

Categorical

Frequency Distribution Table

Numeric

Measures of Central Tendency

Mean, Mode, Median

Measures of Dispersion/Spread

Range, Variance, Standard Deviation, Interquartile Range

Other Measures

Skewness, Kurtosis, Coefficient of Variation (CV), etc…

14 of 62

Univariate Analysis – Descriptive Statistics

Categorical

Frequency Distribution Table

freq()

Numeric

Measures of Central Tendency
Measures of Dispersion
Other Measures

descr()

install.packages(“summarytools”)

library(summarytools)

15 of 62

Univariate Analysis – Data Visualization

Categorical

Bar plot

Numeric

Histogram
Density plot
Box plot
Dot plot
Frequency polygon

16 of 62

Univariate Analysis – Data Visualization (Histogram)

Left-skewed

Symmetric

Right-skewed

17 of 62

Univariate Analysis – Data Visualization (Histogram)

Symmetric

The mean and median are (approximately) equal in a symmetric or normal distribution.

18 of 62

Univariate Analysis – Data Visualization (Histogram)

Right-skewed

The mean is sensitive to extreme values, and thus, the median is more robust in a skewed distribution. The mean is pulled towards the skew or extreme values.

19 of 62

Univariate Analysis – Data Visualization (Histogram)

Left-skewed

The mean is sensitive to extreme values, and thus, the median is more robust in a skewed distribution. The mean is pulled towards the skew or extreme values.

20 of 62

Univariate Analysis – Variability or Spread

Numeric

Range

Maximum value – Minimum value
Very sensitive to outliers or extreme values, making it an unreliable measure of spread.

Variance

Dispersion of data points around the mean.
Average squared deviation from the mean.
Sensitive to outliers: outliers with large deviations from the mean inflate the variance.

21 of 62

Univariate Analysis – Variability or Spread

Numeric

Standard Deviation

Dispersion of data points around the mean.
Average deviation from the mean.
Sensitive to outliers: outliers with large deviations from the mean inflate the standard deviation.

Interquartile Range (IQR)

Difference between 3^rd and 1^st quartiles (Q3 – Q1).
Less sensitive to extreme values and skewed data because it considers the middle 50% of the data and less influenced by the extreme values in the tails of the distribution.

Which is robust?

In a symmetric/normal distribution, the standard deviation is just the measure to go for.
In a skewed distribution, the IQR is robust.

22 of 62

Bivariate Analysis – Introduction

Meaning

Explores the relationship between two variables.

Relevance

Identify relationships.
Assessing correlations (statistical significance).
Test causal relationships through hypothesis.

23 of 62

Bivariate Analysis – Assessing Relationships

Two Numeric Variables

Scatter Plot
Correlation

Two Categorical Variables

Contingency Table
Chi-squared Test

One Numeric, One Categorical

Box Plot / Violin Plot
T-test / ANOVA

24 of 62

Bivariate Analysis – Assessing Relationships (Numeric)

Positive Relationship

Negative Relationship

25 of 62

Bivariate Analysis – Assessing Relationships

Almost No Relationship

26 of 62

Bivariate Analysis – Assessing Relationships

27 of 62

Bivariate Analysis – Assessing Relationships

Correlation Coefficient

Quantify strength and direction of the linear relationship between two numeric variables.

28 of 62

Bivariate Analysis – Assessing Relationships

Types of Correlation Coefficients:

Pearson Product-Moment Correlation Coefficient or Pearson Correlation Coefficient.

Spearman Correlation Coefficient (Spearman’s rho)

Kendall’s Tau Correlation Coefficient

29 of 62

Bivariate Analysis – Assessing Relationships

(Correlation is not different from zero)

30 of 62

Bivariate Analysis – Assessing Relationships (Categorical)

Analyzing the relationship between two categorical variables typically involves creating contingency tables and conducting chi-squared tests.

Contingency Table

Construct a table that cross-tabulates the two categorical variables.
Each cell shows the count/frequency of observations in each combination of categories.

summarytools::ctable(var1, var2)

Chi-squared Test

Chi-squared test of independence.
Assesses whether the observed frequencies differ significantly from what would be expected if the variables were independent.

table(var1, var2)

chisq.test(table)

31 of 62

Bivariate Analysis – One Numeric, One Categorical

Involves exploring differences in numeric values across different categories.

Data Visualization

Box plot/ Violin plot

T-Test

Test whether there are statistically significant differences in means between the numeric variable across the categories of the categorical variables.

t.test(nvar1 ~ cvar2)

ggstatsplot::ggbetweenstats()

32 of 62

Multivariate Analysis – Introduction

Meaning

Statistical technique used to analyze data with multiple variables simultaneously.

Relevance

Gain deeper insights into complex datasets.
Uncover hidden patterns in data.
Explore relationships between variables.
Dimensionality reduction.

33 of 62

Multivariate Analysis – Types

Principal Component Analysis (PCA)

Reduces high-dimensional data while preserving variance.
Identifies important variables or patterns.

Factor Analysis

Identifies latent factors explaining correlations.

Cluster Analysis

Groups data points based on similarities.
Examples: K-means, Hierarchical clustering.

34 of 62

Multivariate Analysis – Types

Multivariate Analysis of Variance (MANOVA)

Extends ANOVA to multiple dependent variables.
Assesses significant group differences with multiple response variables.

Structural Equation Modeling (SEM)

Models complex relationships among observed and latent variables.
Combines factor analysis and regression to test causal hypotheses.

35 of 62

Regression Analysis

Regression is the study of the dependence of one variable (dependent variable) on one or more other variables (independent variable[s]), with the aim of estimating or predicting the mean of the dependent variable based on the known values of the independent variable(s).

“Galton’s Universal Law of Regression”

Find out how the average height of sons changes given the father’s height.

Height of Sons

Fathers’ Height

Dependent Variable

Independent Variable

36 of 62

Regression Analysis – Terminology

Dependent Variable	Independent Variable
Explained Variable	Explanatory Variable
Predictand	Predictor
Regressand	Regressor
Outcome Variable	Covariate
Controlled Variable	Control Variable

37 of 62

Linear Regression Analysis – Simple vs. Multiple

Height of Sons

Fathers’ Height

Height of Sons

Fathers’ Height

Mothers’ Height

Nutrition

38 of 62

Assumptions of Classical Linear Regression Model

There is no perfect relationship between the independent variables.
The variance of the error term is constant or homoscedastic.
There is no autocorrelation between the error terms.

What if these assumptions are violated?

Multicollinearity

It inflates the variance – regression coefficients cannot be estimated with great precision [Variance Inflation Factor (VIF)].

Heteroscedasticity

Incorrect standard errors, and unreliable tests of significance.

Autocorrelation

Incorrect standard errors, and unreliable tests of significance.

39 of 62

Linear Regression Analysis – Problem

Nature of Variables

Dataset

Car Resale Data - 2023 from Kaggle

lm(formula, data)

Source: https://www.kaggle.com/datasets/rahulmenon1758/car-resale-prices

Dependent Variable	Independent Variable(s)
Continuous	Numeric or categorical

40 of 62

Now, let’s practice

41 of 62

LPM, Logit and Probit Models

The dependent variable is binary.

Examples:

Yes or No (0 = No, 1 = Yes)
Health Outcome (0 = Not Cured, 1 = Cured)
Votes (0 = NPP, 1 = NDC)
Patient Satisfaction of Healthcare (0 = Satisfied, 1 = Not Satisfied)

42 of 62

LPM, Logit and Probit Models

The dependent variable is binary.

Examples:

Yes or No (0 = No, 1 = Yes)
Votes (0 = NPP, 1 = NDC)
Patient Satisfaction of Healthcare (0 = Satisfied, 1 = Not Satisfied)

Approaches to Measuring Binary Outcome Variables

Linear Probability Model (LPM)
Logit or Logistic Regression Model
Probit (or Normit) Model

43 of 62

LPM, Logit and Probit Models

44 of 62

LPM, Logit and Probit Models

45 of 62

LPM, Logit and Probit Models

How do we get around the problems of LPM – probability rule violation?

Estimate the LPM by the usual OLS

method. If estimated Y has some values

less than 0 [i.e. negative], Y is assumed

to be zero for those cases; if they are

greater than 1, they are assumed to be 1.

Constrained LPM

Devise an estimating technique that will

guarantee that the estimated conditional

probabilities Y will lie between 0 and 1.

Logit and Probit Models

46 of 62

LPM, Logit and Probit Models

Logit and Probit Models

Cumulative distribution functions (CDF)

are sigmoid, or s-shaped.

Some CDFs are the logistic CDF (logit) and standard normal CDF (probit).

47 of 62

LPM, Logit and Probit Models

A scenario

Match
Win
Win
Win
Win
Lose
Lose
Lose
Lose
Lose
Lose

Match	Frequency
Win	4
Lose	6

What can we do with count data?

What is the PROBABILITY that a team wins a match?

What are the ODDS of winning the match?

48 of 62

LPM, Logit and Probit Models

Match	Frequency
Win	4
Lose	6

Probability is the number of events divided by the total outcome.

Odds are the chances of something happening to something not happening.

49 of 62

LPM, Logit and Probit Models

Probability is odds can be expressed in terms of each other.

50 of 62

LPM, Logit and Probit Models

51 of 62

LPM, Logit and Probit Models

Nature of Variables

Dataset

Employee Dataset from Kaggle

glm(formula, family, data)

mfx::logitmfx()

mfx::probitmfx()

Source: https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset

Dependent Variable	Independent Variable(s)
Categorical (only 2 categories)	Numeric or categorical

52 of 62

Now, let’s practice

53 of 62

Time Series Analysis - Introduction

Time Series is chronological

sequence of data recorded over time.

The basic assumption is that

observations are evenly spaced.

54 of 62

Time Series Analysis – Time Series Objects in R

The ts() function is used to create a time series object.

ts(data, start, end, frequency)

The is.ts() function whether an object is of the ts() class.

is.ts()

1 of 62

2 of 62

3 of 62

4 of 62

5 of 62

6 of 62

7 of 62

8 of 62

9 of 62

10 of 62

11 of 62

12 of 62

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

18 of 62

19 of 62

20 of 62

21 of 62

22 of 62

23 of 62

24 of 62

25 of 62

26 of 62

27 of 62

28 of 62

29 of 62

30 of 62

31 of 62

32 of 62

33 of 62

34 of 62

35 of 62

36 of 62

37 of 62

38 of 62

39 of 62

40 of 62

41 of 62

42 of 62

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

48 of 62

49 of 62

50 of 62

51 of 62

52 of 62

53 of 62

54 of 62

55 of 62

56 of 62

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

62 of 62