1 of 24

NOC

Nanou & Hayley

2 of 24

Emotional Speech Recognition

3 of 24

4 of 24

Should we build robots that feel human emotions?

5 of 24

Emotions

“Our feelings and emotions are triggered by stimuli, either external or internal (such as a memory), and manifest themselves in terms of physical signs – pulse rate, perspiration, facial expressions, gesture, and tone of voice. We might cry or laugh, shudder in disgust, or shrink in defeat.

Unlike our language, much of these emotions are expressed spontaneously and automatically, without any conscious control. We learn to recognize emotions in other human beings from birth. Babies are soothed by the gentle humming of a lullaby even before they are born. They respond to the smiling face of a parent at birth and are certainly capable of expressing their own emotions from day one”.

6 of 24

People express their emotions through…

Humans’ Speech

Facial Expressions

Body Pose

Audio

Visual

Physiological Signal

7 of 24

Emotions in speech are represented by not just the pitch, but the chroma, the tempo, the speed, of that voice.

8 of 24

-Speech input is represented by the frequency components of the audio.

-Machine learning needs to first perform feature engineering to extract these characteristics.

For tone of voice, feature engineering typically extracts 1000-2500 characteristics from the input audio, and this process slows down the whole emotion recognition process.

Emotion recognition process is slow.

9 of 24

Recent breakthroughs in neural networks or deep learning

enabled by both machine speedup and massive amounts of data for learning, have led to vast improvements in machine learning:

Like the convolutional neural networks (CNN), can automatically learn the characteristics during the learning process, without an explicit and delayed feature engineering process or human design. This is perhaps the most important contribution of deep learning to the field of AI.

10 of 24

11 of 24

Classifiers

KNN ( K - nearest neighbor )

GMM (Gaussian Mixture Model)

SVM (Support Vector Machine)

Hidden Markov Models (HMM),

Logistic Regressions (Logit),

CNN (Convolution Neural Network)

12 of 24

Experiment conducted SVM or CNN?

13 of 24

Experiment conducted

The commonly used features in speech recognition are

-MFCC (Mel Frequency Cepstrum Coefficient),

-prosodic features,

-sound quality characteristics and acoustic features.

-in order to better identify emotions in the voice, emotional point information [are also added, such as context keywords, etc. Typically, this will cause improvements of the recognition rate.

14 of 24

15 of 24

16 of 24

CNN is a specially designed multi-layer perceptron to identify two-dimension shapes. Therefore dimensional information retained in waveform points is effectively utilized by CNN. CNN model due to its characteristics of adaptive feature extraction, it is applied for image recognition and emotion recognition in voice signals. In the emotional speech recognition, based on the test of two classic characteristics of the speech signal, we propose that directly use waveform points to characterize the emotional speech signals. It neither loss information, but also take advantage of the natural correlation information between the waveform to identify emotion.

CNN

17 of 24

18 of 24

19 of 24

Convolutional Neural Network model (CNN)

20 of 24

Experiment Conclusion

21 of 24

API / SDK

Beyond Verbal API

Free API trial

Good Vibrations SDK / API

Open Vokaturi API / SDK

Free open source License

22 of 24

Example : Moodies App

Beyond Verbal API

23 of 24

Beyond Verbal API

Beyond Verbal decodes human vocal intonations into their undelining emotions, in real time - enabling voice anbled devices or apps to understand our emotions.

24 of 24

Conclusion

Still, will robots be conscious?

If a robot develops analytical skills, learning ability, communication, and even emotional intelligence, will it have a conscience? Will it be sentient? Can it dream?