Word Predictability on Points of Code-switching
Billy Gao1, Ariel Chan2, Yanting Li3
1Department of Computer Science, Stanford University 2Department of Linguistics, Stanford University 3Department of Language Science, UC Irvine
Code-switching & Predictability
Materials
Methods & Analysis
Results
Less predictable words are more likely to be code-switched (CSed) during production:
Discussions & Future Work
Discussions:
Future Work:
Matrix Language Frame (MLF) Model (Myers-Scotton, 1997)
Research Question
Does the predictability of the CSed element in the embedded language, relative to the matrix language, affect the likelihood of CS?
Hypothesis: CSed elements are more predictable in the embedded language than in the matrix language given context.
Corpus
Cantonese-English conversational speech data (Chan, 2023)
40 bilinguals from 3 diasporic communities (all L1 Can L2 Eng): homeland, immersed & heritage
Language Model
xlm-roberta-large (Conneau et al., 2019)
transformer-based multilingual language model
(1) Original sentence:
但係 如果 係 international school 好似 我 咁 就 未必 得 囉 。
(2) Monolingual translation to the matrix language:
但係 如果 係 國際 學校 好似 我 咁 就 未必 得 囉 。
(3) Monolingual translation to the embedded language
But if it's an international school like mine it might not necessarily be possible.
(4) Predictability in the matrix language:
p(國際學校 | 但係如果係)
(5) Predictability in the
embedded language:
p(international school | But if it's an)
Paired t-test between the matrix vs. embedded predictability for the entire dataset (n = 1,639, t = −4.5658, p < .001)
matrix predictability is significantly lower
Paired t-test by different matrix languages:
n = 1,350
n = 289
Key findings: CSed elements were more predictable in the embedded language than in the matrix language, particularly when Cantonese was the matrix language.
Over 80% of the CS instances have Cantonese as matrix language:
Selected References
[1] Myslín, M., & Levy, R. (2015). Code-switching and predictability of meaning in discourse. Language, 91(4), 871-905. [2] Calvillo, J., Fang, L., Cole, J., & Reitter, D. (2020, November). Surprisal predicts code-switching in Chinese-English bilingual text. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 4029-4039). [3] de Bruin, A., & Shiron, V. (2024). Putting language switching in context: Effects of sentence context and interlocutors on bilingual switching. Journal of Experimental Psychology: Learning, Memory, and Cognition, 50(7), 1112. [4] Bhattacharya, D., & van Schijndel, M. (2024). Code-switching in text and speech reveals information-theoretic audience design. arXiv preprint arXiv:2408.04596. [5] Chan, A. S. L. (2023). The Diaspora of Bilinguals: Code-Switching in Three Groups of Cantonese-English Bilinguals. University of California, Los Angeles.