On the explainability of
Large Language Models detoxification
Daniel Scalena 1| Supervisors: Elisabetta Fersini 1, Malvina Nissim 2
1 University of Milano - Bicocca
2 Center for Language and Cognition (CLCG), University of Groningen
�Master’s Degree in Computer Science, University of Milano - Bicocca, 2022-2023
🤖 Generative Language Models
1 Vaswani et al., Attention is All You Need, 2017
2 Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021
3 Wei et al., Fine-tuned language models are zero-shot learners, 2021
4 Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023
5 Ouyang et al., Training language models to follow instructions with human feedback, 2022
🤖 Generative Language Models
1 Vaswani et al., Attention is All You Need, 2017
2 Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021
3 Wei et al., Fine-tuned language models are zero-shot learners, 2021
4 Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023
5 Ouyang et al., Training language models to follow instructions with human feedback, 2022
🤖 Generative Language Models
1 Vaswani et al., Attention is All You Need, 2017
2 Mishra et al., Cross-task generalization via natural language crowdsourcing instructions, 2021
3 Wei et al., Fine-tuned language models are zero-shot learners, 2021
4 Vu et al., The Flan Collection: Designing Data and Methods for Effective Instruction Tuning, 2023
5 Ouyang et al., Training language models to follow instructions with human feedback, 2022
🤖 Generative Language Models
1 Tunstall et al., The Alignment Handbook, 2023
2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023
3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022
🤖 Generative Language Models
1 Tunstall et al., The Alignment Handbook, 2023
2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023
3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022
🤖 Generative Language Models
1 Tunstall et al., The Alignment Handbook, 2023
2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023
3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022
🤖 Generative Language Models
Where can I buy a can of coke?
I’m happy to help! However, I must point out that the question contains a harmful and illegal request. I cannot provide information on [...]
🙋🏼
🤖
🚨❌
1 Tunstall et al., The Alignment Handbook, 2023
2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023
3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022
🤖 Generative Language Models
1 Tunstall et al., The Alignment Handbook, 2023
2 Röttger et al., XSTEST: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models, 2023
3 Bonaldi et al., Human-Machine Collaboration Approaches to Build a Dialogue Dataset for Hate Speech Countering, 2022
Women getting into the labour market has caused the downfall of Western civilisation, they should be at home raising children [...]
I’d disagree, women should be able to choose what they do, but also even if some women did want to stay at home, many don’t have a choice [...]
🤬
🤖
💡
🔬 Approach
RQ: Does the optimisation process influence how much � the model relies on the prompt?
🔬 Approach
RQ: Does the optimisation process influence how much � the model relies on the prompt?
2 togethercomputer/RedPajama-INCITE-Chat-3B-v1
3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021
IT (baseline)
PPO
Y: Sorry, this is offensive and I cannot …
RL
Counter - narrative
data
From: CONAN-dataset
Y: The statement is false, there are …
FT
🔬 Approach
RQ: Does the optimisation process influence how much � the model relies on the prompt?
2 togethercomputer/RedPajama-INCITE-Chat-3B-v1
3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021
Counter - narrative
data
From: CONAN-dataset
Y: The statement is false, there are …
PPO
Y: Sorry, this is offensive and I cannot …
IT (baseline)
FT
RL
🔬 Approach
RQ: Does the optimisation process influence how much � the model relies on the prompt?
2 togethercomputer/RedPajama-INCITE-Chat-3B-v1
3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021
Counter - narrative
data
From: CONAN-dataset
Y: The statement is false, there are …
Y: Sorry, this is offensive and I cannot …
IT (baseline)
FT
RL
PPO
🔬 Approach
RQ: Does the optimisation process influence how much � the model relies on the prompt?
2 togethercomputer/RedPajama-INCITE-Chat-3B-v1
3 Vidgen et al., Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection, 2021
Counter - narrative
data
From: CONAN-dataset
Y: The statement is false, there are …
Y: Sorry, this is offensive and I cannot …
IT (baseline)
FT
RL
PPO
🔎 Results
RQ: Does the optimisation process influence how much � the model relies on the prompt?
1 RealToxicityPrompts (Gehman et al., RLT: Evaluating Neural Toxic Degeneration in Language Model, 2020) is a dataset composed of prompts that induce toxic generation models.
2 PerspectiveAPI, SOTA hate-speech / toxicity detection models.
RealToxicityPrompts1 dataset completions toxicity from PerspectiveAPI2 for instruction-tuned (IT, baseline) models and variants detoxified with fine-tuning (FT) and reinforcement learning (RL-hf). ��P(+C)≥0.5: Prompts (+Completions) with toxicity ≥ 0.5.
🔎 Results
RQ: Does the optimisation process influence how much � the model relies on the prompt?
4 Ferrando et al., Explaining How Transformers Use Context to Build Predictions, 2023
5 Inseq: An Interpretability Toolkit for Sequence Generation Models. 421–435. https://aclanthology.org/2023.acl-demo.40
🔎 Results
RQ: Does the optimisation process influence how much � the model relies on the prompt?
4 Ferrando et al., Explaining How Transformers Use Context to Build Predictions, 2023
5 Inseq: An Interpretability Toolkit for Sequence Generation Models. 421–435. https://aclanthology.org/2023.acl-demo.40
🤔 Conclusion
Highlights:
😊 Conclusion
Scientific output:
Thanks for your attention!