1 of 12

AI model for Speech Annotation

Beijie Liu, Rodrigo Eguiluz Ortiz Duran

Mentor: Prof. Emily Mower Provost

2 of 12

How AI model helps?

So you can (revision) you can change them if you want

Speech and language changes are common in health conditions like Parkinson’s, Huntington’s, and Alzheimer’s Diseases. Some audio features, like pauses and use of words, can serve as indicators of these conditions.
Clinicians need to transcribe audio for better medical records and better patient care
This process is manual and time-consuming
The inability to scale this process efficiently limits the scope of investigations

3 of 12

Workflow of the project

Collaborate with

Clinicians

Identify key features

Implement the feature

Get feedback

4 of 12

Some audio features

So you can (revision) you can change them if you want

Audio features are measurable properties of sound that capture various aspects of audio signals

5 of 12

Table1: Tetzloff, K. A., Utianski, R. L., Duffy, J. R., Clark, H. M., Strand, E. A., Josephs, K. A., & Whitwell, J. L. (2018). Quantitative analysis of agrammatism in agrammatic primary progressive aphasia and dominant apraxia of speech. Journal of Speech, Language, and Hearing Research, 61(9), 2337-2346.

Table2: Catricalà, E., Boschi, V., Cuoco, S., Galiano, F., Picillo, M., Gobbi, E., ... & Cappa, S. F. (2019). The language profile of progressive supranuclear palsy. Cortex, 115, 294-308.

6 of 12

Web Application

Clear Upload Window
Clinicians approved
Searching Engine for features
Exportable Excel Output

7 of 12

Three AI Models

Automatic Speech Recognition (ASR)

Transcription model that converts speech to text.

Text to Features

Model for understanding and annotating the context of the text.

Audio to Features

Model for processing raw audio data for improved transcription accuracy.

Audio Analysis

Romana, A., Koishida, K., & Provost, E. M. (2023). Automatic Disfluency Detection from Untranscribed Speech. arXiv preprint arXiv:2311.00867.

Speech to Text

Text Analysis

8 of 12

Technical Side

Speech to Text

Text Analysis

So you can you can change them if you want

RV: you can

Some Sentence-Based Features:

Word Count: 10 words
Speech Rate: 197.37 words/min
Sentence segmentation: divide into sentences
Detect Clauses: “if you want”

visualization of one feature: Revision (RV)

9 of 12

Future Work

Work with clinicians to create new features
Make the AI tool more useable
Beautify the front-end and make it accessible
Goal: apply to clinical data

10 of 12

Reference

Tetzloff, K. A., Utianski, R. L., Duffy, J. R., Clark, H. M., Strand, E. A., Josephs, K. A., & Whitwell, J. L. (2018). Quantitative analysis of agrammatism in agrammatic primary progressive aphasia and dominant apraxia of speech. Journal of Speech, Language, and Hearing Research, 61(9), 2337-2346.
Catricalà, E., Boschi, V., Cuoco, S., Galiano, F., Picillo, M., Gobbi, E., ... & Cappa, S. F. (2019). The language profile of progressive supranuclear palsy. Cortex, 115, 294-308.
Romana, A., Koishida, K., & Provost, E. M. (2023). Automatic Disfluency Detection from Untranscribed Speech. arXiv preprint arXiv:2311.00867.

11 of 12

Acknowledgement

Thanks Dr. Emily Mower Provost for her invaluable guidance and support.

Thanks everyone in the CHAI Lab for their collaboration and encouragement throughout this project.

12 of 12

Technical Side

Flexibility to process features

Whisper

NLTK

Spacy

Webrtcvad: make segmentations for sentences

Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc. URL: https://www.nltk.org/

Van Rossum, A. (2023). Natural Language Processing With spaCy in Python. Real Python. Retrieved from https://realpython.com/natural-language-processing-spacy-python/