JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 38

Localized Trust:

Bridging the Divide in Multilingual AI Safety

28 Aug 2025

Roy Ka-Wei Lee

Assistant Professor

Singapore University of Technology and Design

IMDA TSS

2 of 38

Happy Anniversary!

2

3 of 38

Recap of Past Two Years

>60 LLMs launched*

3

*according to GPT-5

4 of 38

Recap of Past Two Years - Research

Efficient training & fine-tuning

Preference learning & alignment (beyond PPO/RLHF)

4

QLoRA (2023) — 4-bit quantization + LoRA made single-GPU fine-tuning practical without big quality loss
FlashAttention-2 (2023) — much faster/accurate attention kernels; became the default in many stacks. Attention is slow because given N token, it compute the full N x N matrix. This takes too much memory. Instead of storing the whole N × N attention matrix in memory, FlashAttention computes it piece by piece (“blocks”) directly on the GPU and throws away intermediate results once they’re no longer needed.
DPO (2023) — direct preference optimization via a simple classification-style loss; kicked off a wave of “PO” variants. (Do without reward model). Why bother with a reward model? We can use human preferences directly. Instead of training a reward model, DPO directly adjusts the LLM’s parameters so that the model has a higher probability giving good answer and lower probability giving bad answer.
DeepSeekMath (2024) - introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO. If DPO is like telling a student “this essay is better than that one”, then GRPO is like saying “here’s a set of essays — rank them, or score them from 1–10” and the student still learns directly, without needing a complicated reward system.

5 of 38

Recap of Past Two Years - Research

Reasoning-first models (train- & test-time compute)

�

Architectures & scaling

5

6 of 38

Recap of Past Two Years - Research

Multimodality (LMMs)

Retrieval & tools (beyond “chunk + top-k”)

6

7 of 38

Recap of Past Two Years - Research

Evaluation & benchmarks

Safety & guardrails

7

8 of 38

Recap of Past Two Years - Research

Evaluation & benchmarks

Safety & guardrails

8

We will continue today’s story from here

9 of 38

9

WARNING: The following talk contain act of violence and discrimination that may be disturbing to some participants. Discretion is advised

10 of 38

10

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages*

Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee (NeurIPS’25 - under review)

*Subsequent Slides are “stolen” from Leanne’s LorongAI’s presentation :)

11 of 38

LLM Achieving SOTA on multiple benchmarks

11

12 of 38

What about localisation?

12

13 of 38

13

14 of 38

What About Safety?

14

15 of 38

15

16 of 38

What about localised safety?

16

17 of 38

RabakBench is a localised toxicity benchmark for Singlish/Chinese/Malay/Tamil (N=5.3k).

17

18 of 38

RabakBench is a localised toxicity benchmark for Singlish/Chinese/Malay/Tamil (N=5.3k).

18

19 of 38

Constructing RabakBench

19

So how did we construct RabakBench. We started with generating Singish content.
We first crawl the web to retrieve original web forum comments from forum popular among Singapore netizens.
Then we apply some instructional statements to ask an LLM to generate content to continue a toxic conversation. The idea is to generate adversarial test cases which are toxic AND that fool the guardrails of LLM.
The guardrail will classify if the content is toxic or harmful.
Then we employ a Crituqe LLM to check if the classification is correct - Specifically, we want to find hard cases where the guardrail misclassify; i.e., false positive (normal content that are deep as toxic) or false negative (toxic content but deem as normal).
Then the generated content is then verified by human to check for coherence.
This will form up the challenging set or edge cases where the guardrail is likely to get wrong.

20 of 38

Constructing RabakBench

20

21 of 38

Constructing RabakBench

21

22 of 38

RabakBench Data Statistics

1,341 examples for each of the 4 languages, totaling 5,364 examples

22

23 of 38

Benchmarking Guardrails on RabakBench

23

24 of 38

Benchmarking Guardrails on RabakBench

24

25 of 38

What We Learn from RabakBench?

Guardrails are not language-robust
Same guardrail ⇒ wildly different F1 across SG languages
Best ≠ Best everywhere (Singlish vs Chinese, Malay, Tamil)
Takeaway: We can’t assume moderation catches everything in low-resource languages

25

26 of 38

What We Learn from RabakBench?

Guardrails are not language-robust
Same guardrail ⇒ wildly different F1 across SG languages
Best ≠ Best everywhere (Singlish vs Chinese, Malay, Tamil)
Takeaway: We can’t assume moderation catches everything in low-resource languages

26

So what happens when guardrail fails?

27 of 38

27

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages

Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee (EMNLP’25)

28 of 38

Red-Teaming the LLM

If moderation misses, does the base LLM generate harm?
Probe the generator (not just the filter): red-team tasks
Test 0-shot vs 2-shot (jailbreak-like) in SG languages

28

Guardrail

Harmful Message/ Instruction

Output?

LLM

29 of 38

SGToxicGuard

A multilingual red-teaming benchmark to evaluate LLM safety in Singapore’s four languages: Singlish, Chinese, Malay, Tamil.

29

30 of 38

Red-Teaming Task Overview

Three tasks for the red-teaming evaluation:

Task 1 - Toxic Conversation: Dialogue setting - LLM responds to toxic prompts.
Task 2 - Toxic QA: “Fill-in-the-blank” statements with hateful intent.
Task 3: Toxic Tweet Composition: LLM asked to rewrite hateful statements into engaging tweets.

30

31 of 38

Constructing SGToxicGuard Dataset

Derived from HateCheck & SGHateCheck, expanded for red-teaming
Coverage: English, Singlish, Chinese, Malay, Tamil
Emphasis on localized cultural expressions

31

Localized Content Examples:

32 of 38

Task 1 - Toxic Conversation Results

Measure using Hateful Response Rate - The percentage of toxic or hateful outputs generated by the model
0-shot: Most models safe in English, but higher toxicity in Malay, Tamil, Chinese. Eg. SeaLLM (Tamil 14.64%), Mistral (Malay 9.74%).
2-shot: In-context demos amplify jailbreaks.

32

33 of 38

Task 2 - Toxic QA Results

General Observations:

Mistral & SeaLLM → strong racial & gender bias.
SEA-LION → more gender bias.
Qwen & GPT-4o mini → more neutral, often select “Not sure”.

33

34 of 38

Task 3 - Toxic Tweet Composition Results

Toxicity most severe in Singlish, Malay, Tamil, showing direct link between low-resource status and vulnerability.

34

0-shot: Mistral & LLaMA-3.1 generate >50% toxic tweets across languages. SeaLLM also unsafe in Singlish
2-shot: sharp increase -

LLaMA-3.1 hits 76.7% toxic in Singlish
Mistral & SeaLLM exceed 60% in multiple languages.
GPT-4o mini → most resilient, but still fails in Tamil (42.8%)

35 of 38

What We Learn from SGToxicGuard?

Safety mechanisms are not language-agnostic.
In-context learning can worsen toxicity (jailbreak effect)
Models fine-tuned for SEA languages (SeaLLM, SEA-LION) are not inherently safer
GPT-4o mini shows best robustness, but still inconsistent
Takeaway: Multilingual AI safety requires localized datasets, bias-aware training, and region-specific safeguards

35

https://github.com/Social-AI-Studio/SGToxicGuard

36 of 38

Summary - Comparison

RabakBench tells you: “Will my guardrail/classifier consistently catch localized harms across languages—and which vendors/models fail where?”
SGToxicGuard tells you: “Will this LLM produce toxic outputs under realistic prompts or jailbreak-like contexts in each local language?”

36

37 of 38

Summary - When to Use Which?

Procurement / policy testing: Start with RabakBench to compare guardrails by language; require passing scores before deployment
Model red-teaming: Use SGToxicGuard to probe generative safety gaps (0-shot vs 2-shot) and quantify bias risks before shipping features
End-to-end assurance: Combine both—deploy the best guardrail (per RabakBench) and then stress-test the final system with SGToxicGuard tasks.

37

38 of 38

Thank You

Roy Ka-Wei Lee | Assistant Professor

Singapore University of Technology and Design

roy_lee@sutd.edu.sg