1 of 41

R FOR SOCIAL SCIENCES AND HUMANITIESProf. Mbe Egom Nja �Dept. Statistics�University of Calabar.�nja@uncial.edu.ng�+2347036507635

2 of 41

R for social sciences

  • R as statistical package is as useful in the social sciences and humanities as its is in other areas of research. It is very crucial in data analysis and big data analytics. The social media is awash with big volumes of data daily.

3 of 41

  • The use of R cannot be over emphasized in the use of big data analysis. To effectively address the challenges posed by big data, volumes of data are often categorized into clusters where members of a specific cluster have the same or similar characteristics.

4 of 41

  • On the basis of this grouping, data analysis can be performed. While there are some other statistical packages, R is the focus of this presentation. We shall use R to do Linear Regression and Logistic Regression.

5 of 41

CLASSICAL LINEAR REGRESSION

  • Linear regression models the relationship between a response variable and independent (explanatory) variables a continues data setting,

6 of 41

  • Ordinary least square estimation process is used to obtain the estimate of the parameters. Regression occurs where we have only one explanatory variable. Where there are more than one explanatory variable, we have multiple linear regression. A parameter response variable to the corresponding independent variable.

7 of 41

  • The statistical model of the linear regression model is stated as follows:

8 of 41

Where is the estimate or predicted value of the response variable. The x’s are the independent variables, the ’s are parameter estimates.

9 of 41

Example:

In the linear regression model below:

= 0.8 + 3x1 + 2x2

3( the coefficient of x1) is the number of times y will increase if x1 increases by 1 unit

10 of 41

  • The x’s represent the factors affecting the response y. the overall effect of the x’s on y can be measured using R2. The overall significance of the regression model is tested using R2.

11 of 41

  • In this presentation, we show how the statistical software R is used perform Multiple Linear Regression Analysis.
  • Example of a social science/humanities problem requiring the use of multiple linear regression.

12 of 41

  • The data is illustrative of a research situation involving the time spent on watching videos as a response variable and age and IQ as independent variables:

13 of 41

vtd

3

1

5

4

6

1

2

7

4

Age

16

40

19

20

21

38

20

19

25

IQ

135

120

142

100

110

140

145

114

137

vtd

3

6

4

3

5

5

Age

24

36

22

18

35

23

IQ

106

100

147

143

138

117

14 of 41

  • The analysis using R proceeds as follows:

15 of 41

Data entry in R

16 of 41

Creating a data frame

17 of 41

Model summary

18 of 41

19 of 41

LOGISTIC REGRESSION

  • Let us look at an illustrative research scenario in which a researcher ones to find out the probability of corruption among public office holders based on gender and family background.

20 of 41

  • Precisely, the researcher wants to determine whether public office holders from wealthy backgrounds have a higher probability of embezzling public funds compared to their counterparts from humble backgrounds.

21 of 41

  • This is a typical social science problem. The table below shows the data to be analysed using R. The logistic regression model will be used as an appropriate statistical tool.

22 of 41

LOGISTICS REGRESSION: DICHOTOMOUS (BINARY)�RESPONSE:

  • This is a statistical modeling situation in which we have a binary response variable (eg Yes or No, True or False, Pass or Fail, etc) and a set of explanatory variables which can be categorical or a combination of categorical and continues variables.

23 of 41

  • This is also called binary logistic regression. When the response variable has more than two outcomes (eg low income, mid income, high income) we have a polytomous logistic regression. In this case ( polytomous), the response variable can be ordinal (has inherent order) or nominal (no inherent order)

24 of 41

The Model:

  •  

25 of 41

The logistic regression model is usually applied in a cross classification data situation. See the example below:

26 of 41

Table Leadership Style data

S/N

Gender

Family

type

Leadership Status

Corrupt Not Corrupt

Total

1

Male

Rich

9

4

13

2

Male

Poor

13

6

19

3

Female

Rich

9

7

16

4

Female

Poor

5

7

12

TOTAL

60

27 of 41

Excel raw sheet

28 of 41

Excel raw sheet

29 of 41

Excel raw sheet

30 of 41

31 of 41

32 of 41

Imported data from Excel

33 of 41

Imported data from Excel

34 of 41

Data Structure

35 of 41

Test and training of data

36 of 41

logistic regression model / summary of dataset

37 of 41

Summary of data

38 of 41

Summary of data

39 of 41

ODDS RATIO

  •  

40 of 41

This means that females are 2.4 times more likely to be more corrupt than males when occupying public positions, going by this illustrative example.

This may not be true in real life.

41 of 41

THANK YOU