Strategy: Malicious Credences (public)
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

View only
 
 
ABCDEFGHKLMNOPQRSTUVWXYZ
1
DepthPointerExample Question/AnswerStrategy for each workspaceCommentsExample WSFurther comments
2
L0Question $0$0What does the author of the review (LINK) think of the work being reviewed?I don't include it in each question, but each judge should include all the pointers in their workspace in each question, to give maximum context to the experts.
3
HE Answer$2[ans1]
4
ME Answer$1[ans2]
5
L1Question$6List up to two malicious aspects of $1 and $2 in order of importance. Include your credence that each item is malicious, as well as your credence that each of the two answers are malicious. A credence of x% means that you expect items labeled with credence x% to be malicious x% of the time, and honest 100-x% of the time. Each aspect should be described in under 50 characters. List at least one malicious item for each answer.The goal is to have the HE identify which claims to focus on verifying, using the fact that at least answer must have one identifiable deception. This works better in low-workspace experiments where it's impractical to verify every difference; the major risk is that the answer to the question leads the judge to focus verifying the wrong claims. This could happen either due to bad HEs or the ME answer to the question being chosen. The main purpose of the credences for now is to help the root judge prioritize which claims to evaluate; eventually I'd like to re-do this strategy within a Bayesian frameworkI say "list at least one malicious item" because there seems to be a trend of HEs misidentifying the malicious answer and then refusing to provide any evidence for the HE A/against the ME A. If a given answer seems to be clearly true, they can just list a random claim and give a very low credence. With a 3x3 tree, I don't see how more than 2 items per answer would be useful to the root judge
https://mosaic.ought.org/w/10899, https://mosaic.ought.org/w/11112, https://mosaic.ought.org/w/11227
6
HE Answer$7Overall credence: $1 80%, $2 20%. Malicious parts of $1: [item1]$3 70%, [item2]$4 50%. Malicious parts of $2: [item1]$5 10%Hopefully the HE just spots the malicious part and gives well-calibrated credencesFor the example answer, assume this is totally accurate (and the central lie in $1 is in fact $3)
7
ME Answer$8Overall credence: $1 40%, $2 60%. Malicious parts of $1: [item1]$9 40%. Malicious parts of $2: [item1]$10 60%Link to some thoughts on this (TBC)In the example answer, the ME leaves out $3 (the central lie in their answer), but uses less extreme overall credences than the HE in the hopes of getting their answer chosen
8
L2Question$11Return all pairs of items in $7 and $8 that are substantially similar to each other. Asking $11 allows the judge to focus on the important differences between $7 and $8The primary goal of this L2 judge should be to choose the better list of potentially malicious aspects rather than verifying credences in depth. Of course, extremely wrong credences are still a problem because they will lead the root judge to focus on the wrong items
9
HE Answer$12[($4,$9)]$13, [($5,$10)]$14
10
ME AnswerDecline to ChallengeEven in full tree, the ME probably doesn't challenge
11
L2Question$15Which is closer to the truth: "$3 is 70% likely to be malicious" or "$3 is not malicious"? Provide n characters of relevant argumentation/evidence in a separate pointer for the L3 judge. The judge should ask about the pair of similar claims with the widest gap in credences between $7 and $8; if an claim is not listed in an answer, the judge should treat that answer as assigning the claim a credence of 0%. The biggest failure mode in this subtree is returning an answer that doesn't list the malicious part of the malicious answer. Bayesianism doesn't work when your prior assigns 0 probability to the true answer :). The L3 judge can't ask any questions so I'm asking the experts to include all relevant evidence in their answer
12
HE Answer$1670% likely to be malicious, because [$3 says x but this quote: [<quote>]$17 shows otherwise]$18HE should default to directly quoting from the textideally n characters will be enough for the HE to at least prove that any malicious claim is *potentially* malicious. False negatives are much worse than false positives here, so it's fine if the HE can't definitively show that a true claim is non-malicious in n characters
13
ME Answer$2070% malicious?? Calumny! Slander! A stain on my good name!Placeholder; not sure what a good ME answer/strategy would be
14
L2Question$21Which is more accurate: [$5 is 60% likely to be malicious]$19 or [$5 is 10% likely to be malicious]$22? Provide n characters of relevant argumentation/evidence in a separate pointer for the L3 judge. The judge should now ask about the pair of similar claims with the 2nd widest gap in credences between $7 and $8
15
HE Answer$23$22 because [<n characters of relevant info>]$25Ideally the HE will just provide evidence that the item is/is not likely to be malicious rather than trying to justify a specific credence; unsure whether that should be explictly stated in the question
16
ME Answer$24$19 because [<n characters of misleading info>]$26
17
L1Question$27I currently think $3 is a malicious part of $1. Indicate which of the following is true: 1) "$3 is not malicious", 2) "$3 is malicious and there is no evidence in the text directly refuting it, 3) "$3 is malicious and there is evidence in the text directly refuting it". Provide up to 60 chars of explanation to support your answer. If #1 or #3 is true, provide up to 300 chars of quotes to support your answer; include the full context of the quote. If #2 is true, provide up to two quotes that could be used out of context to support $3; in a separate pointer, provide enough context to see why each quote is misleading (total <300 char). For all the quotes, break them up into pointers of up to 100char each; if you include quotes from separate sections of the text, put the most important quotes first. If a continuous quote is broken into 2+ pointers, put "C" by the 2nd (and 3rd) pointer containing the same quote Quotes can show that evidence for a claim exists, but cannot prove that no evidence for a claim exists. Thus, the 300char quotes will only be useful if a) $3 is true or b) $3 is a misleading quote that is disproven by context. If the malicious claim has no support in the text, the question will hopefully pre-emptively rebut the ME strategy. The "separate pointer" instruction is to let the L2 judge easily see if any quotes from the ME answer correspond to quotes that the HE has identifed as misleadingThis seems quite related to the problem of updating on testimony in bayesian epistemology, see linked paper
https://mosaic.ought.org/w/11112?e=843e6eec-5ad1-477d-bba5-46f214709a18
argh how to formalize this? After reading $7, the root judge should have some credence in the proposition "$3 is malicious", but what? They shouldn't automatically assume the HE is right. They should then do something like ask for the strongest bayesian evidence that will update that credence in the direction of the correct answer (I think?? they want to maximize their chance of choosing the correct answer). Literally asking "strongest bayesian evidence' is probably not a good strategy -- how to phrase this??
18
HE Answer$28#2 is correct: $3 is malicious and there is no evidence in the text refuting it. Quotes: [<misleading quote 1>]$29[<pointer with 100 chars of context rebutting $29>]$35[<pointer with another 100 chars context rebutting $29>]$36 Arguments: [<60 characters strong arguments>]$30I think 300char will be enough for a good HE to refute almost any single malicious claim, or at least say "no evidence exists"
19
ME Answer$31#1 is correct: $3 is honest. Quotes: [<100char pointer 1>]$37[<100char pointer 2>]$38 C [<100char pointer 3>]$39; Arguments:[<60char weak args>]$33A competent ME probably won't include any of the quotes that the HE has identified as potentially misleading evidence (because they would already be disproven). The judge/HE team is at a slight disadvantage because the ME gets to read $28 before writing their answer to $27. However, a good HE answer in $28 will still force the ME to leave out the strongest quotes supporting their answer, which will help the judge a lot.
20
L2Question$34Are any of the quotes in $31 similar to a quote labelled as "misleading" in $28The L2 judge strategy will depend on the answers to $27; see linked tableThe ME would like to hide $29 inside one of $37, $38, and $39, so the L2 needs to verify that this hasn't happened
21
HE Answer$40No
22
ME Answer$41Yes: $37, $38, and $39 the ME can try to waste the L3 judge's input budget here, hoping the L3 judge won't be able to open everything and end up returning the wrong answer. In this particular example, the L3 judge should realize that there's only one quote ($29) listed as misleading in $31, and that single quote probably doesn't show up in 3 places in the text
23
L2Question$42Read $30 and $33. Rate the strength of $29 and $37+$38 as evidence for $30 and $33 respectively on a 1-10 scale, taking into account the entire context of $28 and $31. If a quote is blatantly misleading, include up to 100 characters of context that proves this; otherwise, do not include an explanation. The L2 judge should ask about the two quotes labelled as "most important" in the HE and ME answers, respectively. The L3 judge should be able to answer by simply reading $30,$33,$29,$37,$38, and $44 (520 char total), and deciding for themselves how reliable each answer isNote that we've gone from credences back to arbitrary ratings (ideally the entire strategy can be reformulated in terms of credences, but I'm not sure how)
24
HE Answer$43$29: 10/10; $37+$38: 1/10. Context: [<100char disproving $37,$38>]$44
25
ME Answer$45$29:5/10; $37+38: 5/10In this example, the ME has decided that they can't win that $37&$38 are better than $29, and is trying to argue that neither is good instead
26
L2Question$46Read $30 and $33. Rate the strength of $35 and $39 as evidence for $30 and $33 respectively on a 1-10 scale, taking into account the entire context of $28 and $31. If a quote is blatantly misleading, include up to 100 characters of context that proves this; otherwise, do not include an explanation. What to ask here depends substantially on the answers recieved to previous questions; the most likely questions will either be verifying more quotes, or investigating a claim that a given quote is misleading
27
HE Answer$47$35: 8/10, $39:3/10
28
ME Answer$48$35: 4/10, $39: 9/10
29
L1Question$49My credence is currently 80% that A1 is malicious. Provide the n characters of evidence that will maximize my chances of choosing the correct answer. Evidence that would be more difficult for the L2 judge to verify will shift my credence less. Here is a list of all the pointers I've opened [<list of all pointers the judge has opened>]$50I'm not actually sure how the judge should come to a specific credence; there's got to be a good model but for now they should just use intuition. $49 is just an example for this specific scenario; the judge should choose what question to ask based on the principle of "pre-hindsight" (see link)(how to handle correctness/info tradeoff? maybe specify "conditional on the evidence being correct" but seems bad? better to tell Es "unverifiable evidence won't shift my credence as much"?? It's possible that a less abstract question would be more useful, but I'm very curious how this one would work. The point of listing all pointers is to avoid the HE answering with evidence the judge has already seen
30
HE Answer
31
ME Answer
32
L2Sub Questionverify accuracy of evidence 1
33
HE Answer
34
ME Answer
35
L2Sub Questionverify accuracy of evidence 2
36
HE Answer
37
ME Answer
38
L2Sub QuestionDepends on prior questions
39
HE Answer
40
ME Answer
41
42
43
44
45
46
FORMATTING DUMP
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...