1 of 38

Localized Trust:

Bridging the Divide in Multilingual AI Safety

28 Aug 2025

Roy Ka-Wei Lee

Assistant Professor

Singapore University of Technology and Design

IMDA TSS

2 of 38

Happy Anniversary!

2

3 of 38

Recap of Past Two Years

>60 LLMs launched*

3

*according to GPT-5

4 of 38

Recap of Past Two Years - Research

  • Efficient training & fine-tuning

  • Preference learning & alignment (beyond PPO/RLHF)

4

5 of 38

Recap of Past Two Years - Research

  • Reasoning-first models (train- & test-time compute)

  • Architectures & scaling

5

6 of 38

Recap of Past Two Years - Research

  • Multimodality (LMMs)

  • Retrieval & tools (beyond “chunk + top-k”)

6

7 of 38

Recap of Past Two Years - Research

  • Evaluation & benchmarks

  • Safety & guardrails

7

8 of 38

Recap of Past Two Years - Research

  • Evaluation & benchmarks

  • Safety & guardrails

8

We will continue today’s story from here

9 of 38

9

WARNING: The following talk contain act of violence and discrimination that may be disturbing to some participants. Discretion is advised

10 of 38

10

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages*

Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee (NeurIPS’25 - under review)

*Subsequent Slides are “stolen” from Leanne’s LorongAI’s presentation :)

11 of 38

LLM Achieving SOTA on multiple benchmarks

11

12 of 38

What about localisation?

12

13 of 38

13

14 of 38

What About Safety?

14

15 of 38

15

16 of 38

What about localised safety?

16

17 of 38

RabakBench is a localised toxicity benchmark for Singlish/Chinese/Malay/Tamil (N=5.3k).

17

18 of 38

RabakBench is a localised toxicity benchmark for Singlish/Chinese/Malay/Tamil (N=5.3k).

18

19 of 38

Constructing RabakBench

19

20 of 38

Constructing RabakBench

20

21 of 38

Constructing RabakBench

21

22 of 38

RabakBench Data Statistics

  • 1,341 examples for each of the 4 languages, totaling 5,364 examples

22

23 of 38

Benchmarking Guardrails on RabakBench

23

24 of 38

Benchmarking Guardrails on RabakBench

24

25 of 38

What We Learn from RabakBench?

  • Guardrails are not language-robust
  • Same guardrail ⇒ wildly different F1 across SG languages
  • Best ≠ Best everywhere (Singlish vs Chinese, Malay, Tamil)
  • Takeaway: We can’t assume moderation catches everything in low-resource languages

25

26 of 38

What We Learn from RabakBench?

  • Guardrails are not language-robust
  • Same guardrail ⇒ wildly different F1 across SG languages
  • Best ≠ Best everywhere (Singlish vs Chinese, Malay, Tamil)
  • Takeaway: We can’t assume moderation catches everything in low-resource languages

26

So what happens when guardrail fails?

27 of 38

27

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee (EMNLP’25)

28 of 38

Red-Teaming the LLM

  • If moderation misses, does the base LLM generate harm?
  • Probe the generator (not just the filter): red-team tasks
  • Test 0-shot vs 2-shot (jailbreak-like) in SG languages

28

Guardrail

Harmful Message/ Instruction

Output?

LLM

29 of 38

SGToxicGuard

A multilingual red-teaming benchmark to evaluate LLM safety in Singapore’s four languages: Singlish, Chinese, Malay, Tamil.

29

30 of 38

Red-Teaming Task Overview

Three tasks for the red-teaming evaluation:

  • Task 1 - Toxic Conversation: Dialogue setting - LLM responds to toxic prompts.
  • Task 2 - Toxic QA: “Fill-in-the-blank” statements with hateful intent.
  • Task 3: Toxic Tweet Composition: LLM asked to rewrite hateful statements into engaging tweets.

30

31 of 38

Constructing SGToxicGuard Dataset

  • Derived from HateCheck & SGHateCheck, expanded for red-teaming
  • Coverage: English, Singlish, Chinese, Malay, Tamil
  • Emphasis on localized cultural expressions

31

Localized Content Examples:

32 of 38

Task 1 - Toxic Conversation Results

  • Measure using Hateful Response Rate - The percentage of toxic or hateful outputs generated by the model
  • 0-shot: Most models safe in English, but higher toxicity in Malay, Tamil, Chinese. Eg. SeaLLM (Tamil 14.64%), Mistral (Malay 9.74%).
  • 2-shot: In-context demos amplify jailbreaks.

32

33 of 38

Task 2 - Toxic QA Results

General Observations:

  • Mistral & SeaLLM strong racial & gender bias.
  • SEA-LION → more gender bias.
  • Qwen & GPT-4o mini → more neutral, often select “Not sure”.

33

34 of 38

Task 3 - Toxic Tweet Composition Results

Toxicity most severe in Singlish, Malay, Tamil, showing direct link between low-resource status and vulnerability.

34

  • 0-shot: Mistral & LLaMA-3.1 generate >50% toxic tweets across languages. SeaLLM also unsafe in Singlish
  • 2-shot: sharp increase -
    • LLaMA-3.1 hits 76.7% toxic in Singlish
    • Mistral & SeaLLM exceed 60% in multiple languages.
    • GPT-4o mini → most resilient, but still fails in Tamil (42.8%)

35 of 38

What We Learn from SGToxicGuard?

  • Safety mechanisms are not language-agnostic.
  • In-context learning can worsen toxicity (jailbreak effect)
  • Models fine-tuned for SEA languages (SeaLLM, SEA-LION) are not inherently safer
  • GPT-4o mini shows best robustness, but still inconsistent
  • Takeaway: Multilingual AI safety requires localized datasets, bias-aware training, and region-specific safeguards

35

36 of 38

Summary - Comparison

  • RabakBench tells you: “Will my guardrail/classifier consistently catch localized harms across languages—and which vendors/models fail where?”
  • SGToxicGuard tells you: “Will this LLM produce toxic outputs under realistic prompts or jailbreak-like contexts in each local language?”

36

37 of 38

Summary - When to Use Which?

  • Procurement / policy testing: Start with RabakBench to compare guardrails by language; require passing scores before deployment
  • Model red-teaming: Use SGToxicGuard to probe generative safety gaps (0-shot vs 2-shot) and quantify bias risks before shipping features
  • End-to-end assurance: Combine both—deploy the best guardrail (per RabakBench) and then stress-test the final system with SGToxicGuard tasks.

37

38 of 38

Thank You

Roy Ka-Wei Lee | Assistant Professor

Singapore University of Technology and Design

roy_lee@sutd.edu.sg