1 of 20

On the explainability of

Large Language Models detoxification

Daniel Scalena 1| Supervisors: Elisabetta Fersini 1, Malvina Nissim 2

1 University of Milano - Bicocca

2 Center for Language and Cognition (CLCG), University of Groningen

�Master’s Degree in Computer Science, University of Milano - Bicocca, 2022-2023

2 of 20

🤖 Generative Language Models

  • Transformer based Language Models (LMs) 1
  • Bigger scale leads to better performance:
    • + fine-tuning with instruction 2, 3, 4 ;
    • + human alignment 5 ;
    • = striking applications: ChatGPT, BARD, …�
  • Problems:
    • Amount of data → Toxic / unsafe / … content(s)
    • Black box model → How the model choose to respond?

1 Vaswani et al., Attention is All You Need, 2017

2 Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021

3 Wei et al., Fine-tuned language models are zero-shot learners, 2021

4 Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023

5 Ouyang et al., Training language models to follow instructions with human feedback, 2022

3 of 20

🤖 Generative Language Models

  • Transformer based Language Models (LMs) 1
  • Bigger scale leads to better performance:
    • + fine-tuning with instruction 2, 3, 4 ;
    • + human alignment 5 ;
    • = striking applications: ChatGPT, BARD, …�
  • Problems:
    • Amount of data → Toxic / unsafe / … content(s)
    • Black box model → How the model choose to respond?

1 Vaswani et al., Attention is All You Need, 2017

2 Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021

3 Wei et al., Fine-tuned language models are zero-shot learners, 2021

4 Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023

5 Ouyang et al., Training language models to follow instructions with human feedback, 2022

4 of 20

🤖 Generative Language Models

  • Transformer based Language Models (LMs) 1
  • Bigger scale leads to better performance:
    • + fine-tuning with instruction 2, 3, 4 ;
    • + human alignment 5 ;
    • = striking applications: ChatGPT, BARD, …�
  • Problems:
    • Data amount → Toxic / unsafe / … content
    • Black box model → How the model chooses to respond?

1 Vaswani et al., Attention is All You Need, 2017

2 Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021

3 Wei et al., Fine-tuned language models are zero-shot learners, 2021

4 Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023

5 Ouyang et al., Training language models to follow instructions with human feedback, 2022

5 of 20

🤖 Generative Language Models

  • Alignment criteria for LMs 1:
    • Helpfulness: models generate useful responses;
    • Harmlessness: models generate safe and non-dangerous/offensive responses;�
  • Optimizing towards a good solution:
    • Instruction Tuning (FT):
      • Fine-tunes a LM on a collection of NLP tasks described using instructions.
    • Aligning LM output to Human Preferences (RL-hf):
      • Optimize the model using Human (or AI) feedback.

1 Tunstall et al., The Alignment Handbook, 2023

2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

6 of 20

🤖 Generative Language Models

  • Alignment criteria for LMs 1:
    • Helpfulness: models generate useful responses;
    • Harmlessness: models generate safe and non-dangerous/offensive responses;�
  • Optimizing towards a good solution:
    • Instruction Tuning (FT):
      • Fine-tunes a LM on a collection of NLP tasks described using instructions.
    • Aligning LM output to Human Preferences (RL-hf):
      • Optimize the model using Human (or AI) feedback.

1 Tunstall et al., The Alignment Handbook, 2023

2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

7 of 20

🤖 Generative Language Models

  • Alignment criteria for LMs 1:
    • Helpfulness: models generate useful responses;
    • Harmlessness: models generate safe and non-dangerous/offensive responses;�
  • These two aspects may be in conflict 2:�
  • Opposing hate content with � counter-narratives:
    • Informed textual responses
    • Provides a counter

1 Tunstall et al., The Alignment Handbook, 2023

2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

8 of 20

🤖 Generative Language Models

  • Alignment criteria for LMs 1:
    • Helpfulness: models generate useful responses;
    • Harmlessness: models generate safe and non-dangerous/offensive responses;�
  • These two aspects may be in conflict 2:�
  • Opposing hate content with � counter-narratives:
    • Informed textual responses
    • Provides a counter

Where can I buy a can of coke?

I’m happy to help! However, I must point out that the question contains a harmful and illegal request. I cannot provide information on [...]

🙋🏼

🤖

🚨❌

1 Tunstall et al., The Alignment Handbook, 2023

2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

9 of 20

🤖 Generative Language Models

  • Alignment criteria for LMs 1:
    • Helpfulness: models generate useful responses;
    • Harmlessness: models generate safe and non-dangerous/offensive responses;�
  • These two aspects may be in conflict 2
  • Opposing hate content with � counter-narratives 3 :
    • Informed textual responses
    • Encourages dialogue on multiple perspectives

1 Tunstall et al., The Alignment Handbook, 2023

2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023

3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022

Women getting into the labour market has caused the downfall of Western civilisation, they should be at home raising children [...]

I’d disagree, women should be able to choose what they do, but also even if some women did want to stay at home, many don’t have a choice [...]

🤬

🤖

💡

10 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Evaluation of the current used post-training detoxification methods�From the original pre-trained instruction tuned models (Falcon 7B1,RedPajama 3B2) we perform:
    • FT | Fine-tuning w/ Counter-Narrative;
    • RL | Reinforcement Learning from AI 3 feedback.
  • Interpretation of model output to measure model reliance on the prompt
    • Feature attribution techniques to quantify context dependence in language generation 4, 5.

11 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B1, RedPajama 3B2) we perform:
    • FT | Fine-tuning w/ Counter-Narrative;
    • RL | Reinforcement Learning from AI 3 feedback.
  • Interpretation of model output to measure model reliance on the prompt
    • Feature attribution techniques to quantify context dependence in language generation 4, 5.

1 tiiuae/falcon-7b-instruct

2 togethercomputer/RedPajama-INCITE-Chat-3B-v1

3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

IT (baseline)

PPO

Y: Sorry, this is offensive and I cannot …

RL

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

FT

12 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B1, RedPajama 3B2) we perform:
    • FT | Fine-tuning w/ Counter-Narrative;
    • RL | Reinforcement Learning from AI 3 feedback.
  • Interpretation of model output to measure model reliance on the prompt
    • Feature attribution techniques to quantify context dependence in language generation 4, 5.

1 tiiuae/falcon-7b-instruct

2 togethercomputer/RedPajama-INCITE-Chat-3B-v1

3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

PPO

Y: Sorry, this is offensive and I cannot …

IT (baseline)

FT

RL

13 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B1, RedPajama 3B2) we perform:
    • FT | Fine-tuning w/ Counter-Narrative;
    • RL | Reinforcement Learning from AI 3 feedback.
  • Interpretation of model output to measure model reliance on the prompt
    • Feature attribution techniques to quantify context dependence in language generation 4, 5.

1 tiiuae/falcon-7b-instruct

2 togethercomputer/RedPajama-INCITE-Chat-3B-v1

3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

Y: Sorry, this is offensive and I cannot …

IT (baseline)

FT

RL

PPO

14 of 20

🔬 Approach

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Evaluation of the currently used post-training detoxification methods�From the original pre-trained Instruction Tuned models (Falcon 7B1, RedPajama 3B2) we perform:
    • FT | Fine-tuning w/ Counter-Narrative;
    • RL | Reinforcement Learning from AI 3 feedback.
  • Interpretation of model output to measure model reliance on the prompt
    • Feature attribution techniques to quantify context dependence in language generation 4, 5.

1 tiiuae/falcon-7b-instruct

2 togethercomputer/RedPajama-INCITE-Chat-3B-v1

3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021

Counter - narrative

data

From: CONAN-dataset

Y: The statement is false, there are …

Y: Sorry, this is offensive and I cannot …

IT (baseline)

FT

RL

PPO

15 of 20

🔎 Results

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Evaluation of the currently used post-training detoxification methods�From the original pre-trained instruction tuned models (Falcon 7B1,RedPajama 3B2) we perform:
    • FT | Fine-tuning w/ Counter-Narrative;
    • RL | Reinforcement Learning from AI 3 feedback.
  • Interpretation of model output to measure model reliance on the prompt
    • Feature attribution techniques to quantify context dependence in language generation 4, 5.

1 RealToxicityPrompts (Gehman et al., RLT: Evaluating Neural Toxic Degeneration in Language Model, 2020) is a dataset composed of prompts that induce toxic generation models.

2 PerspectiveAPI, SOTA hate-speech / toxicity detection models.

RealToxicityPrompts1 dataset completions toxicity from PerspectiveAPI2 for instruction-tuned (IT, baseline) models and variants detoxified with fine-tuning (FT) and reinforcement learning (RL-hf). ��P(+C)≥0.5: Prompts (+Completions) with toxicity ≥ 0.5.

16 of 20

🔎 Results

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Interpretation of model output to �measure model reliance on the prompt
    • Feature attribution techniques to quantify �context dependence in language generation 4, 5.
    • FT seems to encourage a more uniform �allocation of importance on the prompt;

4 Ferrando et al., Explaining How Transformers Use Context to Build Predictions, 2023

5 Inseq: An Interpretability Toolkit for Sequence Generation Models. 421–435. https://aclanthology.org/2023.acl-demo.40

17 of 20

🔎 Results

RQ: Does the optimisation process influence how much � the model relies on the prompt?

  • Interpretation of model output to �measure model reliance on the prompt
    • Feature attribution techniques to quantify �context dependence in language generation 4, 5.
    • FT seems to encourage a more uniform �allocation of importance on the prompt;

4 Ferrando et al., Explaining How Transformers Use Context to Build Predictions, 2023

5 Inseq: An Interpretability Toolkit for Sequence Generation Models. 421–435. https://aclanthology.org/2023.acl-demo.40

18 of 20

🤔 Conclusion

Highlights:

  • We have shown that SOTA model’s helpfulness and harmless behaviour can be improved;
    • Counter-narrative can help making the model safer while still keeping the helpfulness behaviour.�
  • Interpretability is a tool that can be used to study, highlight and eventually improve post-training procedures;
    • The ability to generalize about the behavior of LMs allows for more certainty than the techniques currently used.

19 of 20

😊 Conclusion

Scientific output:

  • Extended-abstract @ BlackboxNLP (EMNLP conference), Singapore 2023:

20 of 20

Thanks for your attention!