Application deadline: rolling basis Location: Université de Toulon, La Garde

Start date: around beginning 2024 Duration: 1 year (extendable up to 2.5y)

Supervisor: Prof. Ricard Marxer <ricard.marxer@lis-lab.fr>

Apply: HERE

Context

Recent deep learning (DL) developments have been key to breakthroughs in many artificial intelligence (AI) tasks such as automatic speech recognition (ASR) [1] and speech enhancement [2]. In the past decade the performance of such systems on reference corpora has consistently increased driven by improvements in data-modeling and representation learning techniques. However our understanding of human speech perception has not benefited from such advancements. This internship sets the ground for a project that proposes to gain knowledge about our perception of speech by means of large-scale data-driven modeling and statistical methods. By leveraging modern deep learning techniques and exploiting large corpora of data we aim to build models capable of predicting human comprehension of speech at a higher level of detail than any other existing approaches [3].

This post-doc position is funded by the ANR JCJC project MIM (Microscopic Intelligibility Modeling). It aims at exploiting AI methods for predicting and describing speech perception at the stimuli, listener and sub-word level. The project also comprises a PhD fellow who will work closely with the post-doctoral researcher and the principal investigator.

Subject

The main role of the post-doc is to investigate and propose models that predict listeners’ responses to the noisy speech stimuli. We will target predictions at different levels of granularity such as predicting the type of confusion, which phones are misperceived or how a particular phone is confused. Existing corpora of such data are available and will be used. The recruited researcher will therefore be able to focus on the development and analysis of the models.

In the MIM project, we focus on a corpora of consistent confusions: speech-in-noise stimuli that evoke the same misrecognition among multiple listeners. In order to simplify this first approach to microscopic intelligibility prediction, we will restrict to single-word data. This should reduce the lexical factors to aspects such as usage frequency and neighborhood density, significantly limiting the complexity of the required language model. Consistent confusions are valuable experimental data about the human speech perception process. They provide targets for how intelligibility models should dif-ferentiate from automatic speech recognition (ASR) systems. While ASR models are optimised to recognise what has been uttered, the proposed models should output what has been perceived by a set of listeners.

Several models regularly used in speech recognition tasks will be trained and evaluated in predicting the misperceptions of the consistent confusion corpora. We will first focus on well established models such as GMM-HMM and/or simple deep learning architectures. Advanced neural topologies such as TDNNs, CTC-based or attention-based models will also be explored, even though the relatively small amount of training data in the corpora is likely to be a limiting factor. As a starting point we envisage solving the 3 tasks described in [3].

The work of the selected candidate will focus on using deep learning approaches and more specifically in self-supervised or semi-supervised techniques. Several research directions will be explored, including but not limited to:

perceptual-based loss functions
advanced speech representation learning pipelines
self-supervised and low-resource learning

Profile

The candidate shall have the following profile:

PhD in one of the following fields: machine learning, computer science, applied mathematics, statistics, signal processing
Good English written and spoken language skills
Programming skills, preferably in Python
Experience in one of the main DL frameworks (e.g. PyTorch, Tensorflow)

Notions in speech or audio processing

Monthly Salary

Depends on the experience. The salary is compatible with the costs of living in Toulon (e.g., rent prices are 50% lower in Toulon than in Paris) (More info).

Environment

DYNI is a team of the LIS laboratory (UMR 7020 CNRS) at the University of Toulon. The team is composed of 5 faculty members including a chair in AI and 2 AI ANR JCJC laureats. The team also comprises 5 post-docs, 6 PhD students.

Ricard Marxer (supervisor of this position) heads the team (dyni.pages.lis-lab.fr) and is the director of a Erasmus Mundus Joint Master’s degree MIR in AI and robotics (www.master-mir.eu).

References

Barker, J., Marxer, R., Vincent, E., & Watanabe, S. (2017). The third ’CHiME’ speech separation and recognition challenge: Analysis and outcomes. Computer Speech & Language, 46, 605–626.
Marxer, R., & Barker, J. (2017). Binary Mask Estimation Strategies for Constrained Imputation-Based Speech Enhancement. In Proc. Interspeech 2017 (pp. 1988–1992).
Marxer, R., Cooke, M., & Barker, J. (2015). A framework for the evaluation of microscopic intelligibility models. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 2558–2562).

Laboratoire d’Informatique et Systèmes - LIS – UMR CNRS 7020 – Université de Toulon

Campus de la Garde – Bât X – CS 60584 – 83041 TOULON Cedex 09

Tél. : 33 (0)4 94 14 28 33 - www.lis-lab.fr