Voice recognition has always been portrayed as a dynamic and effortless way to interface with a computer in movies and TV shows. However, in real life this is rarely the case. In the past few years we have seen voice recognition coming to users in simple to operate clients such as Siri, Cortona, or Google Now. These programs had their initial popularity, but users aren't enjoying their experience. In fact, 85% of iOS users had never activated Siri in 2013. The goal of my project was to achieve voice recognition like we imagine it could be.
The most defining characteristic of voice recognition as compared to other forms of user input is that it requires no physical contact or input device. It is the way we naturally interact with people. Interacting with a computer in the same we do with people will emotionally appeal to users. There is a certain pleasure from completing an action with effortless voice control.
I am designing this for use in my “Smart Mirror”, which is currently in development. The device's primary source of input will be speaking, augmented by motion and face detection. The rest of this document will be contextually oriented towards that implementation.
Design and Ideation
Before starting any work, I had to first consider the real problem with existing voice recognition solutions. The voice recognition we imagine in movies and TV is just as capable as a human at understanding us and replying. Although we can't fully make that happen yet, I believe there are a couple areas that we can improve on now. First and foremost being the speed of interaction. People hear what we say instantly and so should this. In today's world our attention is valuable, and we don't like to give it out easily. For this reason, my second priority was to make the voice interface work just through audio, so the user wouldn't need to look at or touch the device. When we do look at a person when talking to them, we get feedback from them though their facial expressions and movements. This sets up my third goal, providing meaningful and contextual visual feedback when the user looks at the device.
To solve the speed issue, the first task was to take everything off line and run it locally on the device. Internet access and Internet voice recognition are just too slow right now. I also implemented a fundamental change in how voice commands work, to make my device nearly instant. I do not accommodate free-form voice input. There are only a certain set of commands accepted at any point, and each command is displayed on the screen in a way that is contextually relevant. This way the user does not have to memorize commands. Although this does hurt my second goal of “audio only” interaction, it comes with great benefit to the rest of the project.
I chose to pursue this goal using the Android operating system, as I do with most projects. Although my implementation uses Android, it could easily be ported to any GNU / Linux based operating system. Next up was finding existing attempts at something like this. The most prominent example that I found was CMU Sphinx, a voice recognition library from Carnegie Mellon University. After some permutations, I was able to import the library and get the samples running on Android. I currently write and compile applications using Android Studio, which I have enjoyed more than Eclipse.
By referencing the source code of samples I was able to discover how the library works, and then begin to implement it for my project. I set up the “keyphrase” to activate listening, and the built the lists of commands. CMU Sphinx has an excellent strategy of recognizing commands on a per syllable basis. Each time it hears a new syllable it cross checks with the list of accepted commands. This opens up the possibility of it knowing what you are going to say before you say it. There are drawbacks in accuracy with using this strategy, mainly false positives. For this reason I devised a system which implements both the partial result and final result methods to recognize user input. After much fine tuning, my desired result was achieved. My device recognizes commands between 100 and 300 milliseconds after you finish speaking, just enough to not feel like the device is interrupting you, but not enough to detect any delay. For example, asking the device for the temperature will speak a result faster than asking a person holding the weather page of today's paper.
The largest improvement to recognition accuracy and false positive reduction was getting a better microphone. Using a standard lapel microphone yielded the best results because it is optimized for close range recording, while filtering background noise. The device is now quite difficult to operate from across a room, and I see that as a good thing. While standing in front of the device it works effortlessly.
Conclusions and Continuations
Video (Open in New Tab)
For now, my goals regarding speed and accuracy are complete. On average, to reach the information I am looking for takes approximately 1200 to 2000 milliseconds, depending on how slow my response is. This time includes the “Ok Mirror” keyphrase as well as the command, such as “News”. Each voice command is followed by a contextually reflective tone sound, which indicates what state the device is in. After a short amount of use these audio cues can subconsciously communicate with the user.
In the future, I plan to achieve a “conversational” experience with my device. By this I mean an exchange of dialogue between the user and device that feels natural and relaxed. This will apply to input requiring more information such as creating a calendar event or sending an SMS.
I hope to push the conversational aspect even further by one day teaching my software to initiate conversations with a user when it is necessary and appropriate. This is perhaps the most exciting concept I have in the future of my work as of right now.