1 of 1

Word Predictability on Points of Code-switching

Billy Gao1, Ariel Chan2, Yanting Li3

1Department of Computer Science, Stanford University 2Department of Linguistics, Stanford University 3Department of Language Science, UC Irvine

Code-switching & Predictability

Materials

Methods & Analysis

Results

Less predictable words are more likely to be code-switched (CSed) during production:

  • Corpus analysis (Myslín & Levy, 2015; Calvillo et al., 2020; Bhattacharya & van Schijndel, 2024)
  • Experiments (de Bruin & Shiron, 2024)

Discussions & Future Work

Discussions:

  • Bilinguals CS to reduce cognitive efforts during speech production, favoring the speaker-oriented account
  • Novelty: examined the effect of predictability on bidirectional CS through MLF

Future Work:

  • Increases sample size where English is the matrix language
  • Includes additional factors (e.g., word length, word frequency, POS)

Matrix Language Frame (MLF) Model (Myers-Scotton, 1997)

  • Matrix language: providing the grammatical framework
  • Embedded language: inserted in the form of words or phrases

Research Question

Does the predictability of the CSed element in the embedded language, relative to the matrix language, affect the likelihood of CS?

Hypothesis: CSed elements are more predictable in the embedded language than in the matrix language given context.

Corpus

Cantonese-English conversational speech data (Chan, 2023)

40 bilinguals from 3 diasporic communities (all L1 Can L2 Eng): homeland, immersed & heritage

Language Model

xlm-roberta-large (Conneau et al., 2019)

transformer-based multilingual language model

  1. Identified CS sentences
  2. Determined the matrix vs. embedded languages
  3. Obtained monolingual translations (2) and (3) using GPT-4 (Achiam et al., 2023)
  4. Filtered out sentences where the translations have drastically different word order
  5. Calculated the predictability of the CSed element in the matrix language (4) and in the embedded language (5) using xlm-roberta-large (Conneau et al., 2019)
  6. Compared the two predictabilities

(1) Original sentence:

但係 如果 係 international school 好似 我 咁 就 未必 得 囉 。

(2) Monolingual translation to the matrix language:

但係 如果 係 國際 學校 好似 我 咁 就 未必 得 囉 。

(3) Monolingual translation to the embedded language

But if it's an international school like mine it might not necessarily be possible.

(4) Predictability in the matrix language:

p(國際學校 | 但係如果係)

(5) Predictability in the

embedded language:

p(international school | But if it's an)

Paired t-test between the matrix vs. embedded predictability for the entire dataset (n = 1,639, t = −4.5658, p < .001)

matrix predictability is significantly lower

Paired t-test by different matrix languages:

n = 1,350

n = 289

Key findings: CSed elements were more predictable in the embedded language than in the matrix language, particularly when Cantonese was the matrix language.

Over 80% of the CS instances have Cantonese as matrix language:

Selected References

[1] Myslín, M., & Levy, R. (2015). Code-switching and predictability of meaning in discourse. Language, 91(4), 871-905. [2] Calvillo, J., Fang, L., Cole, J., & Reitter, D. (2020, November). Surprisal predicts code-switching in Chinese-English bilingual text. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 4029-4039). [3] de Bruin, A., & Shiron, V. (2024). Putting language switching in context: Effects of sentence context and interlocutors on bilingual switching. Journal of Experimental Psychology: Learning, Memory, and Cognition, 50(7), 1112. [4] Bhattacharya, D., & van Schijndel, M. (2024). Code-switching in text and speech reveals information-theoretic audience design. arXiv preprint arXiv:2408.04596. [5] Chan, A. S. L. (2023). The Diaspora of Bilinguals: Code-Switching in Three Groups of Cantonese-English Bilinguals. University of California, Los Angeles.