1 of 17

Machine learning psychometrics

Improved cognitive ability validity from supervised training on item level data

Andrew Cutler, Boston University (USA)

Shane McLoughlin, University of Chester (UK)

Curtis Dunkel, Western Illinois University (USA)

Emil O. W. Kirkegaard, Ulster Institute for Social Research (UK/DK)

2 of 17

Predict or Understand?

  • Psychometricians like to understand constructs
  • Use strong assumptions in scoring tests
    • Equality of item validity
    • Linear
    • Non-interactive
  • How much more validity do those tests have if they are scored to maximize predictive power (of eg. income, education)?

3 of 17

Let’s try! First, the data

Dataset

N

Questions

Age

Education

Income

Sex

ENEM (Brazilian national exam)

551438

185

n

y

y

y

Estonia Raven's

2738

60

y

n

n

y

NLSY97

1109

182

n

y

y

y

Vietnam Experience Study

4376

202

y

y

y

n

American National Election Study

5790

10

y

y

y

y

British Cohort Study (1970)

9433

120

n

n

y

n

Online Vocab Test

9278

45

y

y

n

y

4 of 17

Example: online vocabulary test

  • 45 multiple choice items with response level data
  • https://openpsychometrics.org/_rawdata/ ← Great resource!
  • 12,173 people took online vocabulary test, 9,278 English natives
  • Criterion variables:
    • Sex
    • Age
    • Education
    • Various others not used here (e.g. country, OCEAN personality)

5 of 17

Binary vs. categorical item coding

  • Most items in available datasets use multiple choice format
  • But usually data only saved in binary correct/incorrect format
  • But you should never throw away data!

6 of 17

7 of 17

Methods

  • Sum scores: we just sum tests
    • Simplest possible
  • Item response theory: compute item properties, person scores
    • Better than sum scores, usually
  • Ridge regression (l2 penalization): estimate weight of each item
    • Optimal weights for prediction, assumes no sparsity of effects
  • Random forest regression: build very complex decision trees
    • Catches interaction effects
  • Honorable mentions: deep learning, support vector machines

8 of 17

Meta-analysis approach

  • Multi-method/coding/dataset/outcome
  • 80/20 train/test splits. Loop 50 times
  • Quantify sources of validity gain by variation

9 of 17

All datasets: boxplot results

10 of 17

All datasets: average results

  • On average across datasets and outcomes:
    • IRT: +5.2%
    • RF: +2.6%
    • Ridge: +22.2%

11 of 17

Shorter tests

Method

Test

Outcome

r

n_questions

g_irt

ENEM

parent_edu

0.323

185

Regression

ENEM

parent_edu

0.323

14

g_sum

ENEM

g_irt

0.956

185

Regression

ENEM

g_irt

0.958

42

12 of 17

Conclusion: implications

  • Conclusion: ML can beat traditional scoring for prediction purposes
  • Regression > Random Forest: items quite independent
  • Predictive validity of cognitive data has been underestimated
  • Cognitive tests for hiring purposes
  • SAT/ACT/GRE for enrollment
  • Cognitive tests for predicting dementia

13 of 17

Thank you

  • Slides available
  • Data available
  • Analysis code available

14 of 17

Random Forest

Predict weight from: age, gender, height

Male?

>5’8?

>5’2?

>28 yo?

>40 yo?

Similar Samples

Similar Samples

Similar Samples

Similar Samples

Similar Samples

Similar Samples

15 of 17

Two cultures: understanding and prediction

  • Traditional psychometrics/psychology is firmly in the statistics/understanding camp
  • But their work often is to predict
  • So they should gain from using prediction-optimized methods

16 of 17

Scoring of cognitive data

  • Scoring of cognitive data seems quite archaic
  • Based on strong statistical assumptions:
    • Equality of item validity
    • Linear
    • Non-interactive
  • Assumptions necessary to compute scores when you don’t have strong computers

17 of 17

Cognitive tests are used for prediction in practice

  • So why not optimize for predictive validity?
  • Supervised vs. unsupervised learning
  • Classical methods are all unsupervised learning methods
  • Jensen: g is the active ingredient in tests
  • Special case of the assumption underlying principal components regression
  • But usually only roughly true