1 of 12

Predicting Human Deliberative Judgments (2018)

Owain Evans (FHI), Andreas Stuhlmüller, Chris Cundy, Ryan Carey, Zachary Kenton, Thomas McGrath, Andrew Schreiber (Ought)

How to spend a million dollars cleverly and get no conference paper out of it

2 of 12

Moravec’s paradox

“Hard things are easy for computers, easy things are hard”

(Theorem proving, matmul, search) vs (Object rec, walking, few-shot)

  • DL → lots of progress on some easy-for-humans things.�
  • Little on a different class of hard-for-humans: “deliberation”: �medical advice, hiring, criminal trial verdicts, conference reviews, science

3 of 12

Labelling, fast and slow

  • Youtube recommendations based on clicks and likes. 1 second / 1 minute.�
  • Not great for choosing between Silver / Levine / Andreadis / Brunskill lectures.�1 label = 2 hours expertise = $100?�
  • Papers: predict score by performance stats? Predict impact by score? �Test of time: decade!�
  • Slow due to serial reasoning, meta-reasoning, running experiments, discussing with experts, etc.

4 of 12

Recall AlphaGo: double imitation

  • imitation of human experts by slow MCTS�
  • imitation of MCTS by fast neural net

5 of 12

Desiderata for a deliberative dataset

  • A generalized version of each task is AI-complete. �
  • Both fast and slow labels. (so not IMDB ratings, not reviewer scores)�
  • Fast judgments informative about slow judgments. �
  • Progress through analytical thinking or gathering evidence. �
  • Ground-truth available

Two: “Fermi” (weird composite estimates) �“Politifact” (online research for verifying news stories)

6 of 12

Open dataset of “slow judgments”

n=25,000 probability judgments on different timescales.

7 of 12

What are they trying to do?

  • Train on Fast Judgments, predict Slow
    • Use quick judgments as noisy labels (/ regularizer)
    • Cheap signals correlated with slow judgments. �
  • Optimizing fast might initially do well for slow judgments �(but will ultimately diverge — mis-alignment)�
  • while we cannot hope for highly accurate predictions of slow judgments, we can seek [calibrated] ML algorithms

8 of 12

Enough talk

: user u’s probability judgment given time t

: slow judgment, time category = 2

Collaborative Filtering (KNN and SVD)

Neural Collaborative Filtering: NN maps the latent question and user embeddings to judgments

Hierarchical Bayesian Linear Regression

9 of 12

10 of 12

Limitations

  • Task: True/False classification, not RL�
  • “Unfortunately our dataset… is unlikely to be a good testing ground for predicting slow judgments”�
  • slow judgments > fast judgments, but (slow - fast) was smaller than expected
  • No collab filter. variability among subjects was hard to distinguish from noise; algos could not exploit similarities among users.

11 of 12

Limitation: can’t distinguish really good labellers

12 of 12

Ultra detailed data?

record the individual steps people take during deliberating

requiring users to make their reasoning explicit

recording their use of web search.

Or the usual (doomed?) fallback of adding structure to the NN