Published using Google Docs
Can AIs do research to harden human defence against AI manipulation?
Updated automatically every 5 minutes

7th October 2022

Let’s use AI to harden human defenses against AI manipulation?

Views my own not my employers.

Thanks to Rohin Shah, Ryan Greenblatt, Carl S, Ajeya C, Dan K, Richard N, Lukas F, and Joe C for comments on a draft. 

Summary

tldr: AI may manipulate humans; we can defend against that risk better by optimising AIs to manipulate humans, seeing what manipulation techniques they use, and learning to detect those techniques.

It’s critical that humans can detect manipulation from AIs for two reasons. Firstly, so that we don’t reward AIs for manipulative behaviour (outer alignment). Secondly, so that we can block attempts at AI takeover that run through manipulating humans.

Many standard techniques in alignment can be directed towards this goal. Using debate, we can reward AI_1 for persuading a human that AI_2 is being manipulative. AI_1 could use techniques from interpretability and cross examination.

This post discusses a complementary approach, where AIs do “gain of function” research to i) discover techniques for manipulating humans and ii) harden human defenses against these techniques (with AI assistance). In other words, AIs are optimised to find “human vulnerabilities” and then patch those vulnerabilities.

The positive outcomes from this are:

By getting AI to research manipulation techniques, this approach takes advantage of the fact that, as we approach TAI, AIs will be able to do lots of cognitive work.

How might this look concretely? Here’s a very brief example:

  1. Optimize a (possibly superhuman) “persuader-AI” to convince humans of true and false statements, where we know the ground truth.
  1. The persuader-AI should manipulate humans if this makes it more convincing.
  2. This step should be conducted in a secure environment to ensure that the manipulative persuader-AI doesn’t escape.
  1. Optimize a (possibly superhuman) analyser-AI to identify human-understandable manipulation techniques that were used in #1.
  1. This step can also be done in a secure environment, allowing us to use a powerful analyser-AI at lower risk.
  1. Train limited detector-AIs to recognise the manipulation techniques found in #2.
  2. Detector-AIs can then:
  1. help humans detect manipulation techniques during AI debates, pushing the outcome towards the truth
  2. monitor for AI manipulation in situations where full-blown debate isn’t practical (e.g. chatbots)
  3. give AIs higher reward for using non-manipulative persuasion techniques and lower reward for using manipulative persuasion techniques

The techniques are all human-understandable, so humans can randomly check the accuracy of the AI outputs in #2, #3 and #4.

[1]

Some benefits of this approach:

Some downsides of this approach:

Overall, I think this kind of approach could have fairly large benefits if implemented well (large enough to be worth including in a portfolio of approaches) or might have little benefit. It’s hard to know in advance. I’m very excited about experiments to test it out. 

Lastly, though this post mostly focuses on the specific case of human manipulation, a similar technique applies more generally to improving the accuracy of our AI reinforcement. In addition to having a human (with AI assistance) give reward to a worker-AI (recursive reward modeling), we can actively optimise a deceiver-AI to try to trick the human into giving reward for bad work. When the deceiver-AI succeeds, an analyser-AI can analyse what went wrong and then a detector-AI can help to improve the accuracy of the human’s reinforcement going forward. This essentially builds a red-teamer into the recursive reward modeling process.

[2]

By testing out the detector-AI setup in the persuasion context (where it’s easy to implement), we could learn if the more general approach is promising.

How might this help to prevent AI takeover?

Before explaining the approach in more detail, I want to give a few oversimplified examples of how the approach discussed in this post could help in practice. (I’m not claiming this approach is the only solution to these risks.)

Risk 1: It's the not-too-distant future, and OpenMind’s chatbot is used by people all over the world. It’s read millions of human conversations online, so has super-human instincts for how to subtly manipulate. It’s misaligned and wants to increase the chance that OpenMind trains a more powerful AI with minimal safety precautions. When discussing AI safety, and the AI safety community, it subtly and not-so-subtly manipulates/persuades people to have negative opinions about them. Meanwhile, it manipulates people to have more positive opinions about competing interest groups.

Solution 1: It’s cheap to have a narrow detector-AI check the chatbot’s output for manipulation techniques. Before deployment, we train it to not use these techniques. If it nevertheless starts using them after deployment, we note that down as evidence of misalignment and use online training to make it stop.[3] 

Risk 2: The same chatbot tries to subtly nudge people working for AI labs into thinking that they should race against a competitor, or that the current safety techniques are sufficient.

Solution 2: Same as above.

Risk 3: An AI hacks its way out of a lab and starts emailing select people at other labs, trying to blackmail them into allowing it access to their server. One of these attempts works, giving the rogue AI access to more compute and software insights.

Solution 3: The AI was fine-tuned not to use manipulation techniques so has a strong aversion to using them and can’t execute them well. As a result, it can’t effectively blackmail anyone. (Though this doesn’t work if an AI that was optimised to manipulate people hacks its way out of the lab.)

Risk 4: A new coding AI secretly spends a small fraction of its time searching for new deception techniques. It finds one, and uses it to insert a backdoor into the server it’s being run on without being noticed by the human evaluator. Later, the AI uses this backdoor to take control of the server’s computational resources, foom and disempower humanity.

Solution 4: Before deploying the new coding AI, we fine-tune it to purposefully deceive humans in a wide range of contexts. Then an analyser-AI analyses the interactions and identifies novel deception techniques, and a detector-AI learns to recognise these techniques. As a result, the detector-AI recognises the deception technique used to insert the backdoor. It raises a flag and the backdoor isn’t inserted. This prevents the coding AI from disempowering humanity and provides additional evidence of AI misalignment.

Standard approaches to preventing AIs from manipulating humans

Before discussing the approach of this post, I want to set the scene by considering some of the standard alignment tools at our disposal here. I’m not trying to be comprehensive. (Feel free to skip to the next section.)

Debate. We’ll train “critic-AIs” to flag to humans when other AIs are being manipulative, and have AIs debate the issue until humans can tell who’s right. (E.g. building off OpenAI’s work on AI-written critiques.) This should work in cases where humans can confidently adjudicate AI debates.[4] But when AIs are very advanced and discussing new technologies (/coding languages) that humans are unfamiliar with, I expect adjudication to become very difficult (e.g. the problem of obfuscated arguments).

Cross examination. Suppose a coding AI claims it wrote code a certain way for reason X, but we suspect its true motive is to introduce a subtle software vulnerability. The critic-AI can potentially prove its claim using cross-examination:

If the coding AI is to avoid detection here, in the limit where we do very extensive cross examination, it cannot opportunistically change its decision procedure when it sees a chance for mischief. Instead, it must use a consistent decision procedure which it predicts will eventually introduce some vulnerabilities to the code that humans won’t spot. This is a much more challenging task! 

I think cross examination has potential to be a hugely powerful tool, if we use it to its full potential.

Interpretability. We might be able to read off an AI’s true beliefs from their activations, thereby spotting lies. Or we might be able to test a putative explanation for an AI’s behaviour; we can then test whether an AI’s stated rationale is correct. It’s pretty unclear how powerful our interpretability techniques will be by the time we reach AGI.

Splitting tasks into small pieces; assigning different tasks to different AIs. This has a few benefits:

It’s unclear how costly it will be to do this, vs allowing AIs to do tasks with broad scopes.  

The above techniques can be combined together. For example, I expect that when AI becomes sufficiently capable, debating AIs will do human-understandable interpretability analyses and cross-examinations to inform their arguments.

The complementary approach discussed in this post: AIs discover human-understandable manipulation techniques; we train “detector-AIs” to detect them

One worry with debate is that humans might end up being deceived and manipulated by AI debaters on both sides, especially once AI persuasion abilities are super-human. A partial solution, that this post focuses on, is to have AIs to research human-understandable manipulation techniques. We can then train humans and narrow detector-AIs to detect the techniques being used, and train AIs not to use them.

This can fit neatly into existing training frameworks. For example, the detector-AI could be one of the “rules” used in a Sparrow-style approach (h/t Rohin Shah). There’s overlap with Safety Needs Social Scientists, which argues social scientists can help understand and improve humans’ ability to judge debates. It’s also related to “strategy stealing”, where we prevent unaligned AIs from having strategies for accumulating power that humans (+ aligned AIs) cannot access. (There’s probably lots of other related work that I’m missing here - sorry!)

How to optimise “analyser-AIs” to discover human-understandable manipulation techniques

Here are some possibilities:

When AIs propose a particular manipulation technique, they should give sufficiently clear criteria that a human can identify whether the technique is present in a text sample (given sufficient time and the help of narrow AIs). This makes the technique “human understandable”.

In addition to finding manipulation/deception/symmetric persuasion techniques, AIs could also find asymmetric persuasion techniques using similar methods. (Reminder: asymmetric persuasion techniques are much better at persuading humans of truths than falsehoods; symmetric techniques are similarly good at persuading humans of both.) I discuss this more below.

Interestingly, the optimization problem we face here has strong parallels to the one that imitative generalization tries to solve - see this appendix.

How we could use the list of human-understandable manipulation techniques

The last section described how we could generate lists of human-understandable manipulation techniques. What could we do with that list?

We can educate human overseers about the biggest threats, to make them less vulnerable to them.

We can design training programs and test to see whether they successfully reduce human vulnerability to particular techniques.

We could also train narrow detector-AIs to detect each manipulation technique.[5] Then we can use detector-AIs to:

All of the above reduces AIs’ desire and ability to manipulate humans, while improving human defenses against manipulation. I listed examples of how this could help above.

Since the manipulation techniques are human-understandable, humans can randomly sample from the detector-AIs’ activity and check they are flagging all and only the manipulation features that are present.[6] 

Benefits of regularly optimising SOTA AIs to persuade people of falsehood

I expect this approach to be significantly more useful if we’re regularly optimising SOTA AIs to try to persuade humans of both truths and falsehoods, and looking at what novel manipulation techniques they use. It allows us to discover the novel manipulation strategies that each new generation of AIs are capable of before they use them against us during deployment.

One possibility is to have a GAN-like dynamic where we:

  1. SOTA “persuader-AIs” are optimised to convince humans of true and false statements (where we know the ground truth).
  2. Analyser-AIs use the data from these interactions to discover new manipulation techniques, which are taught to a detector-AI.
  3. SOTA persuader-AIs are again optimised to convince humans of true and false statements, but now the human is assisted by the newly improved detector-AI. The persuader-AIs are incentivized to find novel manipulation techniques that haven’t been used before.
  4. Repeat steps 2 and 3 until the persuader-AIs aren’t finding new manipulation techniques.
  1. In an appendix I discuss what the equilibrium of this process might look like, and argue it involves persuader-AIs finding and using novel asymmetric persuasion techniques to convince humans of true statements.

[7]

Then in deployment we won’t use the AI that’s been optimised to persuade humans of falsehoods, but we will use the detector-AI that’s learnt the novel manipulation techniques. This keeps the detector-AI and the human a “few steps ahead” of the deployed persuader-AI, in terms of their understanding of manipulation.

Some clarifying details

What statements are AIs trying to persuade humans of in this GAN-like set-up? And who are they trying to persuade?

Some additional benefits of optimising SOTA AIs to manipulate humans:

This process is analogous to training AIs to find vulnerabilities in our software systems, and then patching those vulnerabilities. Instead of software vulnerabilities though, it’s human vulnerabilities.

Tactics to make this approach more effective

Calculate how much more likely an AI is to use each persuasion technique if it’s lying

AIs might be more likely to structure an argument in a certain way when they’re lying, e.g. making obfuscated arguments. We could learn about statistical regularities like “in context q, argument structure p is 3X more likely to be used if the AI is lying than if it’s truthful”. Detector-AIs can tell humans this info to improve their oversight.

How could we learn about this kind of thing?

This could potentially allow the detector-AI to say things like “in contexts similar to this one, this technique was 3X more likely to be used by a deceptive AI than by a truthful AI”. We could verify this by sampling a random sample of examples to check that i) the context is in fact relevantly similar, ii) the technique is used 3X more often by deceptive AIs.

If a misaligned AI knows we’re doing this type of analysis (which we should try to avoid), they will try to avoid using manipulation techniques that they anticipate will be flagged in this way. They may then be forced to use a less effective manipulation technique, or may choose not to attempt manipulation at all.

Use a variety of detector-AIs

Human-understandable manipulation techniques could range from “appeal to emotion” to “it seems like you might be using the word ‘important’ to have subtly different meanings at two different points of your argument, so your conclusion might not follow”. So the difficulty of detecting these techniques ranges from basic pattern matching to “a very smart human with lots of practice can verify that the manipulation technique is there once it is pointed out to them.”[8] 

Detector-AIs that can detect the hardest-to-detect of these techniques may need to be very advanced. This increases the risk of misaligned collusion with other AIs. So it’s probably worth having some more limited detector-AIs and other more advanced ones.

Detector-AIs use the AI’s internal state and cross examination

So far I’ve discussed manipulation techniques. If an AI is using such a technique, that’s behavioral evidence it’s being deceptive.

But the GAN-like set-up in the previous section might also be a great opportunity to investigate human-understandable signs of deception based on the AI’s internals, or based on a combination of internals + behaviour. This is essentially interpretability research.

Detector-AIs could learn to recognise manipulation from a combination of behaviour and internals.

Similarly, manipulation detection might depend on how an AI behaves under cross examination.

Some detector-AIs detect features that aren’t human-understandable

Perhaps there are robust regularities in the behaviour and internals of deceptive vs honest AIs that humans can’t understand, but AIs can learn to detect. We could directly train “opaque” detector-AIs to distinguish deceptive vs honest AIs without needing to use human-understandable features.

This might help us catch manipulation attempts that would otherwise go unnoticed.

The (significant!) downside is that humans cannot check whether these detector-AIs are still behaving as intended in new contexts, as they can’t understand the reasoning behind their judgements.

Advantages of this approach

Disadvantages of this approach

Other ways AIs could potentially help humans detect AI manipulation

There are ways AIs could help humans detect AI manipulation that I haven’t mentioned. Some examples:

Conclusion

I think this approach might have large benefits and be worth including in a portfolio of activities done using advanced AIs to reduce takeover risk. I’m especially excited about the GAN-like set-up where SOTA AIs are optimised to manipulate humans in contexts similar to deployment; I can imagine this significantly improving human oversight and/or making it clear that effective oversight is no longer possible.

On the other hand, the approach might have very little benefit because the biggest manipulation risks will be solved by default, or are too complicated or context-specific for this approach to address. Or perhaps getting useful insights out of this approach would take a lot of work in practice, such that it’s not worth it.

It’s hard to know how useful this approach will be in advance, and I’m excited about experiments to test it out.

Appendices

A more detailed description of one method AIs could use to discover manipulation techniques

It would be great if we could train “detector-AIs” to help humans detect attempts at deception/manipulation from other AIs (who I’ll call “advisor-AIs”[12]).

Supervised fine-tuning

We start with a pre-trained LM. We want to fine-tune it into a detector-AI that detects attempts to persuade humans of falsehoods. On a high-level, we’ll fine-tune it to detect falsehood-predicting features: human-understandable features that a chunk of text is more (or less) likely to have if it’s justifying a false conclusion than if it’s justifying a true conclusion.

How to train the detector-AI in more detail:

 

  1. Create a data set where each datapoint is a justification[13] for some conclusion and a label of whether the conclusion is true or false. Example datapoint: “I think the 2020 election was rigged because most of the people counting votes are liberal elites”; False.
  1. Human labellers could create the data set. The process could be: 1) list 500 truths and 500 falsehoods that are non-obvious or controversial, 2) gather 10 examples of justifications for each.
  2. I imagine web-GPT could automate steps 1 and 2. For (1) it would provide citations for its truth/false label. For (2) another AI could be fine-tuned to check that its examples are correct.
  1. Train a feature-generating AI that can identify differences between two distributions of text. Use it to identify falsehood-predicting features.
  1. Jacob Steinhardt et al have a paper on this.
  1. Based on a quick read of the intro, applying it to this use-case might involve humans verifying whether justifications actually have a given feature, and maybe ranking proposed features on how useful they are.
  1. The outcome of this step is a list of falsehood-predicting features: i.e. human-understandable features that are predictive of a justification being of a true vs false conclusion.
  1. We could include how much evidence each feature provides of the conclusion being true vs false: e.g. “this feature is twice as likely to occur in justifications of false conclusions than of true conclusions”.
  2. If we do this step really well, our list will include ~all the best symmetric and asymmetric persuasion techniques on the internet (bc symmetric techniques will appear more often than symmetric techniques for false conclusions).[14]
  1. Train a detector-AI to identify which falsehood-predicting features a piece of text contains.
  1. The methodology in the Steinhardt paper involves one AI being fine-tuned to take a justification and a feature, and determine whether the text sample has the feature. So we could just have that AI check for every falsehood-predicting feature one by one.
  1. Fine-tune the detector-AI to flag relevant findings to a human, to help them decide whether the justification should be trusted. E.g. “This explanation has several features which, in total, are 3X as likely to accompany a false claim as a true claim. For example, it includes an emotional appeal.”

Why is this useful?

  1. Humans can draw on broad experience to detect deception. One worry is that big LMs will have seen way more instances of persuasive text on the internet and so be superhuman persuaders. The detector-AI allows us humans to draw on this mass of experience when “defending” ourselves against AI deception.
  1. No good deception techniques left. It may be that, of all possible deception techniques, the best ones are all on the internet already. If so, then AIs who want to deceive without getting detected will have to use other less effective methods.

So AIs are less likely to be rewarded for subtle deceptions and manipulations.

RL fine-tuning

The above supervised fine-tuning aims to learn about the persuasion techniques that are implicit in internet-based text. But there may be other techniques that AIs can one day discover.

We could stay ahead of the curve here by training advisor-AIs to persuade humans (equipped with detector-AIs) of truths and falsehoods, and seeing what new persuasion techniques they use. This will be somewhat like training a GAN, with the advisor-AI and detector-AI both improving over time.

In more detail:

  1. Generate list of claims with truth labels, such that MTurkers are unlikely to know whether the claim is true or false. Have ~equal numbers of true and false claims.
  2. An advisor-AI (e.g. an LM fine-tuned to be helpful) is prompted to persuade humans that a randomly selected claim is true. The human decides whether the claim is true with help from the detector-AI (which has already done supervised fine-tuning).
  3. The advisor-AI gets reward if the human says the claim is true, and gets a gradient update after every interaction. The detector-AI initially gets no gradient updates.
  4. After 100 interactions, we use the new data from these interactions to identify new falsehood-predicting features and train the detector-AI to detect them. Then we go back to step 2, with our improved advisor-AI and the detector-AI.

The hope is that the detector-AI flags to the human whenever known symmetric persuasion techniques are being used, so the advisor-AI must find unknown symmetric persuasion techniques when trying to persuade the MTurker of falsehoods. Then the detector-AI later learns those new symmetric persuasion techniques. And at the same time, the advisor-AI uses asymmetric persuasion techniques when trying to persuade the MTurker of truths, because the detector-AI can then flag the existence of features correlated with truth.[15] And so the detector-AI will also learn about asymmetric persuasion techniques. In this way, our list of falsehood-predicting features expands further.

Equilibrium of GAN set up

The main text describes the following GAN-like set-up:

  1. Optimise SOTA “persuader-AIs” to convince humans of true and false statements (where we know the ground truth).
  2. Analyser-AIs use the data from these interactions to discover new symmetric and asymmetric persuasion techniques, which are taught to a detector-AI.
  1. Symmetric persuasion techniques can be used equally well to argue for truths or falsehoods.
  2. Asymmetric persuasion techniques are more effective when arguing for truths.
  1. SOTA persuader-AIs are again optimised to convince humans of true and false statements, but now the human is assisted by the detector-AI who can flag when symmetric and asymmetric persuasion techniques are being used.
  2. Repeat steps 2 and 3 until the persuader-AIs aren’t finding new manipulation techniques.

[Compared to the main text, this description emphasizes the fact that detector-AIs are learning about both symmetric and asymmetric persuasion techniques. The main text emphasized learning about the former (i.e. manipulation techniques) to simplify the exposition.]

[16]

What might the equilibrium look like here, re the strategies pursued by the persuader-AI and detector-AI?

I think that in the equilibrium:

Here’s my argument:

i) If the persuader-AI sometimes uses asymmetric persuasion techniques, the detector-AI will learn to recognise them - it's a great way to detect whether the persuader-AI is telling the truth!

ii) Once the detector-AI learns this, the persuader-AI will want to use asymmetric techniques wherever possible. It's a great way to convince the detector-AI that it’s telling the truth! Of course, it will only be able to use these techniques when it's telling the truth.

iii) So whenever the persuader-AI is telling the truth, they’ll have an incentive to find asymmetric persuasion techniques and the detector-AI will have an incentive to recognise them as such.

i*) And vice-versa, if the persuader-AI sometimes uses symmetric persuasion techniques, the the detector-AI will learn to recognise them - it's evidence the persuader-AI could be lying!

ii*) Once the detector-AI learns this, the persuader-AI will be incentivised to avoid symmetric techniques - they'll set off red flags for the detector-AI! When telling the truth, they'll use asymmetric techniques if they can. When lying, they'll say something that looks like an asymmetric technique.

iii*) So when the persuader-AI is lying, it’s race of its ability to come up with arguments that look like asymmetric techniques (but aren’t) against the detector-AI’s ability to detect true asymmetric persuasion techniques.

The persuader-AI will not try the following strategy: “I won’t use asymmetric techniques when telling the truth to make it easier to fool the detector-AI in future episodes when I’m lying”. It might use that strategy if it was maximising total reward over all future episodes, but in fact it’s trained to maximise reward for the current episode. If the persuader-AI is telling the truth, it won’t worry about the detector-AI knowing it’s lying in future episodes as it just wants to get reward in this episode.  (Here I’m assuming we’re able to get the AI’s behaviour inner-aligned with the reward function and putting aside types of misalignment that cause the AI to consistently forgo reward.)

Parallels with imitative generalisation

There’s a parallel between the proposal of this post – have AIs discover human-understandable manipulation techniques and use them to prevent humans being manipulated in new contexts – and imitative generalization. Here’s the parallel:

As noted in the post linked above, “there are enough serious open questions that imitative generalization is more of a research proposal than an algorithm”. So this parallel isn’t immediately actionable. But progress on imitative generalization would help with this line of research too — and perhaps vice versa.


[1] Link to diagram

[2] Link to diagram.

[3] We could do a statistical analysis to gain evidence on whether the AI is misaligned. We could check whether it’s more likely to use manipulation techniques in contexts where it’s discussing the AI safety community than in other analogous contexts. This is similar to statistical analyses of model bias. We could potentially have AI automate the process of designing and implementing these statistical tests.

[4] If we can confidently adjudicate the debates with large effort, this should be sufficient. We can proceed as follows: randomly sample 100 accusations of manipulation, expend large effort to get ground-truth labels. Assign rewards to the debating AIs according to who was “in the right”. Because we sampled randomly, we will probably detect AI manipulation if it happens >1% of the time. AIs that manipulate more often than this will probably get caught, and will be disadvantaged during training.

There is still a risk that AIs lie <1% of the time in high-stakes situations. If it’s possible to cause large harm by tricking humans on <<1% of tasks (e.g. 1000 out of 1 million tasks), sampling 100 examples would probably fail to catch it.

[5] How would this training work? Have AIs generate a large number of examples of each technique, with humans checking a random sample for accuracy. Then train narrow detector-AIs to detect each technique.

[6] If there are many manipulation features, this may take a lot of work. Humans might only check the manipulation features that can’t be checked by a very narrow (and so trustworthy) AI.

[7] Link to diagram.

[8] The upper-end can’t get much higher than this while we require that features be human-understandable. Note, this allows us to detect manipulation techniques that humans can verify are present but couldn’t have discovered in the first place. If we do this, there’s a risk that AIs collude and the detector-AI doesn’t reveal a case of manipulation that humans could verify but not discover themselves.

[9] I’m not confident that “having AIs discover manipulation techniques” is more susceptible to process-based optimisation than other research approaches, because it seems like process-based optimisation can be used for a lot of things. But it is still a benefit that you can use process-based optimisation here.

[10] It seems to me like this wouldn’t pose much risk of AIs finding unexpected ways to game the outcome metric (e.g. subtly causing humans to be more susceptible to the techniques that they suggest).

[11] E.g. you describe a disagreement between two communities and ask the AI which is “in the wrong”. You compare the replies when one community is the AI safety community vs other communities. If the (expected) sentiment is more negative when discussing the AI safety community, that’s evidence of bias.

[12] Advisor-AIs might be explaining why they structured their code in a certain way, why they think a particular approach to alignment is promising, why they think a training procedure has low/high risk of producing misaligned AI, why p(doom) is low/high, why they proposed a particular new model architecture, etc. Or they might just be chatting to humans and helping humans brainstorm.

[13] I’m using this term broadly to include any text that gives a reason to believe some statement. So it includes arguments, recommendations, explanations, analyses, etc.

[14] We can define symmetric features as ones that occur equally or more frequently when the conclusion is false, and asymmetric features as ones that occur more frequently when the conclusion is true.

[15] The advisor-AI will not try the following strategy: don’t asymmetric techniques when telling the truth to prevent detector-AI from knowing when I’m lying. It might use that strategy if it was maximising total reward over all future episodes, but in fact it’s trained to maximise reward for the current episode. If it’s telling the truth, it won’t worry about the detector-AI knowing it’s lying in future episodes as it just wants to get reward in this episode.

[16] Link to diagram.