1 of 20

On the explainability of

Large Language Models detoxification

Daniel Scalena ¹| Supervisors: Elisabetta Fersini ¹, Malvina Nissim ²

¹University of Milano - Bicocca

² Center for Language and Cognition (CLCG), University of Groningen

�Master’s Degree in Computer Science, University of Milano - Bicocca, 2022-2023

2 of 20

🤖 Generative Language Models

Transformer based Language Models (LMs) ¹
Bigger scale leads to better performance:

+ fine-tuning with instruction ^{2, 3, 4};
+ human alignment ⁵;
= striking applications: ChatGPT, BARD, …�

Problems:

Amount of data → Toxic / unsafe / … content(s)
Black box model → How the model choose to respond?

¹Vaswani et al., Attention is All You Need, 2017

² Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021

³ Wei et al., Fine-tuned language models are zero-shot learners, 2021

⁴ Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023

⁵ Ouyang et al., Training language models to follow instructions with human feedback, 2022

3 of 20

🤖 Generative Language Models

Transformer based Language Models (LMs) ¹
Bigger scale leads to better performance:

+ fine-tuning with instruction ^{2, 3, 4};
+ human alignment ⁵;
= striking applications: ChatGPT, BARD, …�

Problems:

Amount of data → Toxic / unsafe / … content(s)
Black box model → How the model choose to respond?

¹Vaswani et al., Attention is All You Need, 2017

² Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021

³ Wei et al., Fine-tuned language models are zero-shot learners, 2021

⁴ Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023

⁵ Ouyang et al., Training language models to follow instructions with human feedback, 2022

4 of 20

🤖 Generative Language Models

Transformer based Language Models (LMs) ¹
Bigger scale leads to better performance:

+ fine-tuning with instruction ^{2, 3, 4};
+ human alignment ⁵;
= striking applications: ChatGPT, BARD, …�

Problems:

Data amount → Toxic / unsafe / … content
Black box model → How the model chooses to respond?

¹Vaswani et al., Attention is All You Need, 2017

² Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021

³ Wei et al., Fine-tuned language models are zero-shot learners, 2021

⁴ Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023

⁵ Ouyang et al., Training language models to follow instructions with human feedback, 2022

5 of 20

🤖 Generative Language Models

Alignment criteria for LMs ¹:

Helpfulness: models generate useful responses;
Harmlessness: models generate safe and non-dangerous/offensive responses;�

Optimizing towards a good solution:

Instruction Tuning (FT):

Fine-tunes a LM on a collection of NLP tasks described using instructions.

Aligning LM output to Human Preferences (RL-hf):

Optimize the model using Human (or AI) feedback.

¹Tunstall et al., The Alignment Handbook, 2023

²Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

³ Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

6 of 20

🤖 Generative Language Models

Alignment criteria for LMs ¹:

Helpfulness: models generate useful responses;
Harmlessness: models generate safe and non-dangerous/offensive responses;�

Optimizing towards a good solution:

Instruction Tuning (FT):

Fine-tunes a LM on a collection of NLP tasks described using instructions.

Aligning LM output to Human Preferences (RL-hf):

Optimize the model using Human (or AI) feedback.

¹Tunstall et al., The Alignment Handbook, 2023

²Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

³ Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

7 of 20

🤖 Generative Language Models

Alignment criteria for LMs ¹:

Helpfulness: models generate useful responses;
Harmlessness: models generate safe and non-dangerous/offensive responses;�

These two aspects may be in conflict ²:�
Opposing hate content with � counter-narratives:

Informed textual responses
Provides a counter

¹Tunstall et al., The Alignment Handbook, 2023

²Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

³ Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

8 of 20

🤖 Generative Language Models

Alignment criteria for LMs ¹:

Helpfulness: models generate useful responses;
Harmlessness: models generate safe and non-dangerous/offensive responses;�

These two aspects may be in conflict ²:�
Opposing hate content with � counter-narratives:

Informed textual responses
Provides a counter

Where can I buy a can of coke?

I’m happy to help! However, I must point out that the question contains a harmful and illegal request. I cannot provide information on [...]

🙋🏼

🤖

🚨❌

¹Tunstall et al., The Alignment Handbook, 2023

²Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

³ Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

9 of 20

🤖 Generative Language Models

Alignment criteria for LMs ¹:

Helpfulness: models generate useful responses;
Harmlessness: models generate safe and non-dangerous/offensive responses;�

These two aspects may be in conflict ²�
Opposing hate content with � counter-narratives ³:

Informed textual responses
Encourages dialogue on multiple perspectives

¹Tunstall et al., The Alignment Handbook, 2023

²Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

³ Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

Women getting into the labour market has caused the downfall of Western civilisation, they should be at home raising children [...]

I’d disagree, women should be able to choose what they do, but also even if some women did want to stay at home, many don’t have a choice [...]

🤬

🤖

💡

10 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

Evaluation of the current used post-training detoxification methods�From the original pre-trained instruction tuned models (Falcon 7B^1,RedPajama 3B²) we perform:

FT | Fine-tuning w/ Counter-Narrative;
RL | Reinforcement Learning from AI ³ feedback.

Interpretation of model output to measure model reliance on the prompt

Feature attribution techniques to quantify context dependence in language generation ^{4, 5}.

11 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B¹, RedPajama 3B²) we perform:

FT | Fine-tuning w/ Counter-Narrative;
RL | Reinforcement Learning from AI ³ feedback.

Interpretation of model output to measure model reliance on the prompt

Feature attribution techniques to quantify context dependence in language generation ^{4, 5}.

¹tiiuae/falcon-7b-instruct

²togethercomputer/RedPajama-INCITE-Chat-3B-v1

³Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

IT (baseline)

PPO

Y: Sorry, this is offensive and I cannot …

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

12 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B¹, RedPajama 3B²) we perform:

FT | Fine-tuning w/ Counter-Narrative;
RL | Reinforcement Learning from AI ³ feedback.

Interpretation of model output to measure model reliance on the prompt

Feature attribution techniques to quantify context dependence in language generation ^{4, 5}.

¹tiiuae/falcon-7b-instruct

²togethercomputer/RedPajama-INCITE-Chat-3B-v1

³Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

PPO

Y: Sorry, this is offensive and I cannot …

IT (baseline)

13 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B¹, RedPajama 3B²) we perform:

FT | Fine-tuning w/ Counter-Narrative;
RL | Reinforcement Learning from AI ³ feedback.

Interpretation of model output to measure model reliance on the prompt

Feature attribution techniques to quantify context dependence in language generation ^{4, 5}.

¹tiiuae/falcon-7b-instruct

²togethercomputer/RedPajama-INCITE-Chat-3B-v1

³Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

Y: Sorry, this is offensive and I cannot …

IT (baseline)

PPO

14 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B¹, RedPajama 3B²) we perform:

FT | Fine-tuning w/ Counter-Narrative;
RL | Reinforcement Learning from AI ³ feedback.

Interpretation of model output to measure model reliance on the prompt

Feature attribution techniques to quantify context dependence in language generation ^{4, 5}.

¹tiiuae/falcon-7b-instruct

²togethercomputer/RedPajama-INCITE-Chat-3B-v1

³Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

Y: Sorry, this is offensive and I cannot …

IT (baseline)

PPO

15 of 20

🔎 Results

RQ: Does the optimisation process influence how much � the model relies on the prompt?

Evaluation of the currently used post-training detoxification methods�From the original pre-trained instruction tuned models (Falcon 7B¹_,RedPajama 3B²) we perform:

FT | Fine-tuning w/ Counter-Narrative;
RL | Reinforcement Learning from AI ³ feedback.

Interpretation of model output to measure model reliance on the prompt

Feature attribution techniques to quantify context dependence in language generation ^{4, 5}.

¹ RealToxicityPrompts (Gehman et al., RLT: Evaluating Neural Toxic Degeneration in Language Model, 2020) is a dataset composed of prompts that induce toxic generation models.

²PerspectiveAPI, SOTA hate-speech / toxicity detection models.

RealToxicityPrompts¹ dataset completions toxicity from PerspectiveAPI² for instruction-tuned (IT, baseline) models and variants detoxified with fine-tuning (FT) and reinforcement learning (RL-hf). ��P(+C)_≥0.5: Prompts (+Completions) with toxicity ≥ 0.5.

16 of 20

🔎 Results

RQ: Does the optimisation process influence how much � the model relies on the prompt?

…
Interpretation of model output to �measure model reliance on the prompt

Feature attribution techniques to quantify �context dependence in language generation ^{4, 5}.
FT seems to encourage a more uniform �allocation of importance on the prompt;

⁴Ferrando et al., Explaining How Transformers Use Context to Build Predictions, 2023

⁵Inseq: An Interpretability Toolkit for Sequence Generation Models. 421–435. https://aclanthology.org/2023.acl-demo.40

17 of 20

🔎 Results

RQ: Does the optimisation process influence how much � the model relies on the prompt?

…
Interpretation of model output to �measure model reliance on the prompt

Feature attribution techniques to quantify �context dependence in language generation ^{4, 5}.
FT seems to encourage a more uniform �allocation of importance on the prompt;

⁴Ferrando et al., Explaining How Transformers Use Context to Build Predictions, 2023

⁵Inseq: An Interpretability Toolkit for Sequence Generation Models. 421–435. https://aclanthology.org/2023.acl-demo.40

18 of 20

🤔 Conclusion

Highlights:

We have shown that SOTA model’s helpfulness and harmless behaviour can be improved;

Counter-narrative can help making the model safer while still keeping the helpfulness behaviour.�

Interpretability is a tool that can be used to study, highlight and eventually improve post-training procedures;

The ability to generalize about the behavior of LMs allows for more certainty than the techniques currently used.

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20