RFP: On The Bias Towards Base-10 Numbers In Language Models
- I recently did a non-scientific experiment where a BERT NMT model is used to translate non-Base-10 number words.
- The usual methods of using transfer learning did not work well.
- Due to very limited examples and time the experiment was not very scientific.
- Proposal: create two artificial languages, with two different methods of creating number-words each. Put these through NMT models with the same process.
- Language A has Base-10 numbers, while Language B has Base-24 numbers.
- Two different, consistent methods of “generating” number words (e.g. “twenty four” vs “four and twenty”).
- Arch should be something like Zhu, Xia et al (2020), but far simpler (maybe just a single decoder layer?)
- Questions
- Is my initial finding that BERT-models has a base-10 number bias valid?
- Does this finding also apply to other big language models?
- What if we trained from scratch (e.g. is this bias baked into the architecture of modern NMT architectures) ?
Requester: Xuanyi Chew, chewxy@gmail.com