Prompting Multilingual Large Language Models to Generate Code-Mixed Texts:
The Case of South East Asian Languages
Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Samuel Cahyawijaya, Holy Lovenia, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Long Phan, Yin Lin Tan, Alham Fikri Aji
7 December 2023
CALCS Workshop @ EMNLP 2023
Our Team
2
Alham F. Aji
MBZUAI
Genta I. Winata
Bloomberg
Samuel Cahyawijaya
HKUST
Ruochen Zhang
Brown University
Yin Lin Tan
Stanford University & NUS
Jan C. Blaise Cruz
Samsung R&D Institute Philippines
Holy Lovenia
AI Singapore
Zheng-Xin Yong
Brown University
Jessica Zosa Forde
Brown University
Skyler Wang
UC Berkeley
Lintang Sutawika
EleutherAI
Long Phan
VietAI Research
Thamar Solorio
MBZUAI
Rowena Garcia
University of Potsdam
Arjun Subramonian
UCLA
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Research Question
Code-mixing (or code-switching) is common is SEA but code-mixed data are difficult to collect.
Can we use multilingual LLMs to generate code-mixed data for SEA languages?
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Method: Prompting LLMs
5 topics: food, family, traffic, Artificial Intelligence, and weather. 7 languages: Chinese, Indonesian, Malay, Tagalog, Tamil, Vietnamese, Singlish. 6 prompt templates.
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Method: LLM generation evaluation
Level of code-mixing
1
Naturalness
2
Accurateness
3
0 - None
1 - Loanwords
2 - Topic-related nouns
3 - Linguistic elements (e.g., phrase, clause, conjugation, etc.)
1 - Not natural
2 - Foreigners might say it, but not natives
3 - Natives say it
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Models to prompt
ChatGPT
1
InstructGPT (davinci-002)
2
InstructGPT (davinci-003)
3
FLAN T5 XXL
4
BLOOMZ
5
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Results - Can LLMs generate code-mixed text?
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Results - Can LLMs generate code-mixed text?
See our paper (Yong et al., 2023) for breakdown analysis according to topics, languages and templates
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Results - What influence(s) level of code-mixing?
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Misunderstanding of the instruction when generating code-mixed conversations.
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Results - Are the generated code-mixes natural?
Huge variance in naturalness.
Singlish is mostly natural, but had some semantic inaccuracies
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Failure Cases - Failure to follow instructions and give correct explanations
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Takeaways
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Method: LLM generation evaluation
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages