1 of 15

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts:

The Case of South East Asian Languages

Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Samuel Cahyawijaya, Holy Lovenia, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Long Phan, Yin Lin Tan, Alham Fikri Aji

7 December 2023

CALCS Workshop @ EMNLP 2023

2 of 15

Our Team

2

Alham F. Aji

MBZUAI

Genta I. Winata

Bloomberg

Samuel Cahyawijaya

HKUST

Ruochen Zhang

Brown University

Yin Lin Tan

Stanford University & NUS

Jan C. Blaise Cruz

Samsung R&D Institute Philippines

Holy Lovenia

AI Singapore

Zheng-Xin Yong

Brown University

Jessica Zosa Forde

Brown University

Skyler Wang

UC Berkeley

Lintang Sutawika

EleutherAI

Long Phan

VietAI Research

Thamar Solorio

MBZUAI

Rowena Garcia

University of Potsdam

Arjun Subramonian

UCLA

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

3 of 15

Research Question

Code-mixing (or code-switching) is common is SEA but code-mixed data are difficult to collect.

  • Code-mixing is frequently occurred in colloquial settings and spoken communication
  • Consolidating code-mixing data across social media and digital messaging platforms may be curtailed by legal guardrails and scalability issues

Can we use multilingual LLMs to generate code-mixed data for SEA languages?

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

4 of 15

Method: Prompting LLMs

5 topics: food, family, traffic, Artificial Intelligence, and weather. 7 languages: Chinese, Indonesian, Malay, Tagalog, Tamil, Vietnamese, Singlish. 6 prompt templates.

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

5 of 15

Method: LLM generation evaluation

Level of code-mixing

1

Naturalness

2

Accurateness

3

0 - None

1 - Loanwords

2 - Topic-related nouns

3 - Linguistic elements (e.g., phrase, clause, conjugation, etc.)

1 - Not natural

2 - Foreigners might say it, but not natives

3 - Natives say it

  • Appropriate response
  • Failure to follow instructions
  • Inaccurate explanations

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

6 of 15

Models to prompt

ChatGPT

1

InstructGPT (davinci-002)

2

InstructGPT (davinci-003)

3

FLAN T5 XXL

4

BLOOMZ

5

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

7 of 15

Results - Can LLMs generate code-mixed text?

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

8 of 15

Results - Can LLMs generate code-mixed text?

See our paper (Yong et al., 2023) for breakdown analysis according to topics, languages and templates

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

9 of 15

Results - What influence(s) level of code-mixing?

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

10 of 15

Misunderstanding of the instruction when generating code-mixed conversations.

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

11 of 15

Results - Are the generated code-mixes natural?

Huge variance in naturalness.

Singlish is mostly natural, but had some semantic inaccuracies

  • ChatGPT generate incorrect expression “sotong and chilli sauce” to describe close family bonds, where “sotong” is a Malay word for “squid.”

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

12 of 15

Failure Cases - Failure to follow instructions and give correct explanations

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

13 of 15

Takeaways

  • ChatGPT has shown relative success in generating code-mixed texts, but we advise researchers to exercise heavy caution.
    • Singlish: we find that syntactically-sound responses may contain semantic inaccuracies that are difficult for non-native speakers to detect.
  • Multilingual ≠ code-mixing ability
    • A concurrent work (Zhang et al., 2023) evaluates how multilingual LLMs perform on existing code-switching benchmarks. → Saturday, 9 Dec 2023 — 11AM @ Central 3
  • We cannot confidently identify how ChatGPT code-mix due to the lack of transparency.
  • When generating code-mixed data with LLMs, we need to incorporate code-mixed output recognition and generation capabilities in LLMs.

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

14 of 15

15 of 15

Method: LLM generation evaluation

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages