JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 17

Machine learning psychometrics

Improved cognitive ability validity from supervised training on item level data

Andrew Cutler, Boston University (USA)

Shane McLoughlin, University of Chester (UK)

Curtis Dunkel, Western Illinois University (USA)

Emil O. W. Kirkegaard, Ulster Institute for Social Research (UK/DK)

2 of 17

Predict or Understand?

Psychometricians like to understand constructs
Use strong assumptions in scoring tests

Equality of item validity
Linear
Non-interactive

How much more validity do those tests have if they are scored to maximize predictive power (of eg. income, education)?

3 of 17

Let’s try! First, the data

Dataset	N	Questions	Age	Education	Income	Sex
ENEM (Brazilian national exam)	551438	185	n	y	y	y
Estonia Raven's	2738	60	y	n	n	y
NLSY97	1109	182	n	y	y	y
Vietnam Experience Study	4376	202	y	y	y	n
American National Election Study	5790	10	y	y	y	y
British Cohort Study (1970)	9433	120	n	n	y	n
Online Vocab Test	9278	45	y	y	n	y

4 of 17

Example: online vocabulary test

45 multiple choice items with response level data
https://openpsychometrics.org/_rawdata/ ← Great resource!
12,173 people took online vocabulary test, 9,278 English natives
Criterion variables:

Sex
Age
Education
Various others not used here (e.g. country, OCEAN personality)

5 of 17

Binary vs. categorical item coding

Most items in available datasets use multiple choice format
But usually data only saved in binary correct/incorrect format
But you should never throw away data!

6 of 17

7 of 17

Methods

Sum scores: we just sum tests

Simplest possible

Item response theory: compute item properties, person scores

Better than sum scores, usually

Ridge regression (l2 penalization): estimate weight of each item

Optimal weights for prediction, assumes no sparsity of effects

Random forest regression: build very complex decision trees

Catches interaction effects

Honorable mentions: deep learning, support vector machines

8 of 17

Meta-analysis approach

Multi-method/coding/dataset/outcome
80/20 train/test splits. Loop 50 times
Quantify sources of validity gain by variation

9 of 17

All datasets: boxplot results

10 of 17

All datasets: average results

On average across datasets and outcomes:

IRT: +5.2%
RF: +2.6%
Ridge: +22.2%

11 of 17

Shorter tests

Method	Test	Outcome	r	n_questions
g_irt	ENEM	parent_edu	0.323	185
Regression	ENEM	parent_edu	0.323	14
g_sum	ENEM	g_irt	0.956	185
Regression	ENEM	g_irt	0.958	42

12 of 17

Conclusion: implications

Conclusion: ML can beat traditional scoring for prediction purposes
Regression > Random Forest: items quite independent
Predictive validity of cognitive data has been underestimated
Cognitive tests for hiring purposes
SAT/ACT/GRE for enrollment
Cognitive tests for predicting dementia

13 of 17

Thank you

Slides available
Data available
Analysis code available

14 of 17

Random Forest

Predict weight from: age, gender, height

Male?

>5’8?

>5’2?

>28 yo?

>40 yo?

Similar Samples

15 of 17

Two cultures: understanding and prediction

Traditional psychometrics/psychology is firmly in the statistics/understanding camp
But their work often is to predict
So they should gain from using prediction-optimized methods

16 of 17

Scoring of cognitive data

Scoring of cognitive data seems quite archaic
Based on strong statistical assumptions:

Equality of item validity
Linear
Non-interactive

Assumptions necessary to compute scores when you don’t have strong computers

17 of 17

Cognitive tests are used for prediction in practice

So why not optimize for predictive validity?
Supervised vs. unsupervised learning
Classical methods are all unsupervised learning methods
Jensen: g is the active ingredient in tests
Special case of the assumption underlying principal components regression
But usually only roughly true