1 of 12

Predicting Human Deliberative Judgments (2018)

Owain Evans (FHI), Andreas Stuhlmüller, Chris Cundy, Ryan Carey, Zachary Kenton, Thomas McGrath, Andrew Schreiber (Ought)

How to spend a million dollars cleverly and get no conference paper out of it

2 of 12

Moravec’s paradox

“Hard things are easy for computers, easy things are hard”

(Theorem proving, matmul, search) vs (Object rec, walking, few-shot)

DL → lots of progress on some easy-for-humans things.�
Little on a different class of hard-for-humans: “deliberation”: �medical advice, hiring, criminal trial verdicts, conference reviews, science

Labelling, fast and slow

Youtube recommendations based on clicks and likes. 1 second / 1 minute.�
Not great for choosing between Silver / Levine / Andreadis / Brunskill lectures.�1 label = 2 hours expertise = $100?�
Papers: predict score by performance stats? Predict impact by score? �Test of time: decade!�
Slow due to serial reasoning, meta-reasoning, running experiments, discussing with experts, etc.

Recall AlphaGo: double imitation

Desiderata for a deliberative dataset

Two: “Fermi” (weird composite estimates) �“Politifact” (online research for verifying news stories)

Open dataset of “slow judgments”

n=25,000 probability judgments on different timescales.

What are they trying to do?

Optimizing fast might initially do well for slow judgments �(but will ultimately diverge — mis-alignment)�
“while we cannot hope for highly accurate predictions of slow judgments, we can seek [calibrated] ML algorithms”

Enough talk

: user u’s probability judgment given time t

: slow judgment, time category = 2

Collaborative Filtering (KNN and SVD)

Neural Collaborative Filtering: NN maps the latent question and user embeddings to judgments

Hierarchical Bayesian Linear Regression

Limitations

Task: True/False classification, not RL�
“Unfortunately our dataset… is unlikely to be a good testing ground for predicting slow judgments”�
slow judgments > fast judgments, but (slow - fast) was smaller than expected�
No collab filter. variability among subjects was hard to distinguish from noise; algos could not exploit similarities among users.

Limitation: can’t distinguish really good labellers

Ultra detailed data?

record the individual steps people take during deliberating

requiring users to make their reasoning explicit

recording their use of web search.

Or the usual (doomed?) fallback of adding structure to the NN