Weight Poisoning Attacks
on Pre-trained Models
Keita Kuritra, Paul Michel, Graham Neubig
Language Technologies Institute Carnegie Mellon University
ACL 2020
紹介者:金輝燦(TMU M1 小町研究室)
2020/07/01 @論文紹介2020
1
導入
2
Poisoned model
Spam detection model
Attacker
Victim
fine-tune
spam or non-spam
Control
導入
3
先行研究 [Gu+, 2017]
4
提案手法 - RIPPLe
5
提案手法 - RIPPLe
6
提案手法 - Embedding Surgery
7
提案手法 - Embedding Surgery
8
実験設定
9
評価指標 - Label Flip Rate (LFR)
10
negative or toxic or spam
positive or non-toxic or non-spam
実験結果 - Sentimen Classification, Toxicity Detection
11
実験結果 - Spam Detection
12
実験結果
13
実験結果
14
Defenses against Poisoned Models
15
Conclutions
16