NOC
Nanou & Hayley
Emotional Speech Recognition
Should we build robots that feel human emotions?
Emotions
“Our feelings and emotions are triggered by stimuli, either external or internal (such as a memory), and manifest themselves in terms of physical signs – pulse rate, perspiration, facial expressions, gesture, and tone of voice. We might cry or laugh, shudder in disgust, or shrink in defeat.
Unlike our language, much of these emotions are expressed spontaneously and automatically, without any conscious control. We learn to recognize emotions in other human beings from birth. Babies are soothed by the gentle humming of a lullaby even before they are born. They respond to the smiling face of a parent at birth and are certainly capable of expressing their own emotions from day one”.
People express their emotions through…
Humans’ Speech
Facial Expressions
Body Pose
Audio
Visual
Physiological Signal
Emotions in speech are represented by not just the pitch, but the chroma, the tempo, the speed, of that voice.
-Speech input is represented by the frequency components of the audio.
-Machine learning needs to first perform feature engineering to extract these characteristics.
For tone of voice, feature engineering typically extracts 1000-2500 characteristics from the input audio, and this process slows down the whole emotion recognition process.
Emotion recognition process is slow.
Recent breakthroughs in neural networks or deep learning
enabled by both machine speedup and massive amounts of data for learning, have led to vast improvements in machine learning:
Like the convolutional neural networks (CNN), can automatically learn the characteristics during the learning process, without an explicit and delayed feature engineering process or human design. This is perhaps the most important contribution of deep learning to the field of AI.
Classifiers
KNN ( K - nearest neighbor )
GMM (Gaussian Mixture Model)
SVM (Support Vector Machine)
Hidden Markov Models (HMM),
Logistic Regressions (Logit),
CNN (Convolution Neural Network)
Experiment conducted SVM or CNN?
Experiment conducted
The commonly used features in speech recognition are
-MFCC (Mel Frequency Cepstrum Coefficient),
-prosodic features,
-sound quality characteristics and acoustic features.
-in order to better identify emotions in the voice, emotional point information [are also added, such as context keywords, etc. Typically, this will cause improvements of the recognition rate.
CNN is a specially designed multi-layer perceptron to identify two-dimension shapes. Therefore dimensional information retained in waveform points is effectively utilized by CNN. CNN model due to its characteristics of adaptive feature extraction, it is applied for image recognition and emotion recognition in voice signals. In the emotional speech recognition, based on the test of two classic characteristics of the speech signal, we propose that directly use waveform points to characterize the emotional speech signals. It neither loss information, but also take advantage of the natural correlation information between the waveform to identify emotion.
CNN
Convolutional Neural Network model (CNN)
Experiment Conclusion
API / SDK
Example : Moodies App
Beyond Verbal API
Beyond Verbal decodes human vocal intonations into their undelining emotions, in real time - enabling voice anbled devices or apps to understand our emotions.
Conclusion
Still, will robots be conscious?
If a robot develops analytical skills, learning ability, communication, and even emotional intelligence, will it have a conscience? Will it be sentient? Can it dream?