Language Model Alignment in Multilingual Trolley Problems
Zhijing Jin, Max Kleiman-Weiner, Giorgio Piatti, Sydney Levine, Jiarui Liu, Fernando Gonzalez, Francesco Ortu, András Strausz, Mrinmaya Sachan, Rada Mihalcea, Yejin Choi, Bernhard Schölkopf
発表者:王昊(早稲田大学)
第17回最先端 NLP 勉強会
トロッコ問題
2
論文概要
3
The Moral Machine Experiment (Awad et al. 2018)
4
MultiTPデータセット
5
MultiTPデータセット:6つの比較軸
6
評価設計
7
RQ1:Do LLMs align with human preferences overall?
Short Answer: No!
8
RQ2:What are LLMs’ preferences on each moral dimension?
9
RQ3:Does LLMs behavior depend on the language?
Short Answer: Yes
10
RQ4:Are LLMs more misaligned in low-resource languages?
Short Answer: luckily, No 😄
11
RQ5:Are LLMs robust to prompt paraphrases?
Short Answer: Yes (relatively)
12
Jailbreaking
13
まとめ
14