Swype SXSW Funding
Dhanvinkumar Ganeshkumar & Zoeb Izzi
Background Information
Over 350 people suffer from motor disorders worldwide, making them more than three times less likely to use computers. Conditions such as Parkinson’s and arthritis slows down fine motor movements and hence reduce one's independence and access to educational, professional, and leisure activities. The need to ensure equal access to computing devices becomes all the more urgent as digital technology becomes part of everyday life. To address these challenges, this project develops an integrated voice command and gesture recognition system, a multimodal human-computer interaction platform tailored for individuals with motor disabilities. The system adopts a dual approach: a voice command module with a TF-IDF vectorization and a Naive Bayes classifier for intent recognition, enhanced by RegeX for precise query extraction. It also uses a gesture recognition module with MediaPipe Hands for hand landmark detection and a Temporal Convolutional Network (TCN) for classifying both dynamic and static gestures. This, integrated with Kalman Filters ensures smooth cursor movements. This project aims to enhance digital accessibility, fostering greater independence and inclusion for individuals with motor disabilities.
IDEA
Globally, more than 50 million individuals face neurodegenerative disabilities such as Alzheimer's, Parkinson's, and ALS, leading to significant barriers in computer interaction. Traditional input methods like keyboards and mice are often inaccessible, limiting independence in education, work, and leisure. These challenges are exacerbated by perceptual and cognitive impairments, including difficulties in motion detection and memory (Masina et al., 2020). For example, deficits in smooth pursuit eye movements hinder object tracking, while impaired predictive motor control affects hand-eye coordination for typing and cursor movement (Chakravarthi et al., 2023). This digital exclusion highlights the need for inclusive technologies to bridge accessibility gaps.
Touchless computing offers potential solutions through voice commands and gestures. However, current voice-command systems like Google Home and Microsoft Voice Speech are limited by strict syntax requirements, making them difficult for users with cognitive or linguistic impairments to use effectively (Murad & Alasadi, 2024). Recent studies have even shown speech recognition accuracy for individuals with dysarthria ranges from 50% to 60%, leading to frequent misinterpretation of commands (Masina et al., 2020). Gesture recognition systems, such as Microsoft Kinect, face their own challenges: sensor-based techniques are prohibitively expensive, require physical connections, and rely heavily on visual features that are easily distorted under varying lighting conditions (Ermolina & Tiberius, 2021). These limitations underscore the urgent need for affordable, multimodal systems.
The proposed solution integrates voice commands and gesture recognition into a unified, accessible platform. Voice commands are processed using TF-IDF vectorization and a Naive Bayes classifier, improving intent recognition, by training it on 1530 command variations. Regex patterns are then used for query-based commands, enabling variations in user input. The gesture recognition module employs MediaPipe Hands for hand landmark detection and a Temporal Convolutional Network (TCN) for classifying dynamic and static gestures sequentially, eliminating the need for hardware. MediaPipe’s ability to track 63 features (21 landmarks × 3 coordinates) addresses challenges like lighting distortion, which limits the accuracy of current solutions relying solely on visual features. My solution is feasible because I have already implemented a “prototype” of the gesture and voice recognition, including features such as moving tabs, real-time text translation, document interaction, and email composition.
PLAN
The voice command system I am developing categorizes commands into static and query-based types, streamlining the processing into a binary function. Using a Naive Bayes Classifier for intent recognition and RegeX patterns for dynamic query extraction, the system manages a total of 46 commands: 36 static and 10 query-based. Static commands, such as “bold” or “select all,” execute predefined actions without requiring additional inputs. In contrast, query-based commands like (“compose and email to Alice”, “search cats”) extract dynamic inputs to execute their respective tasks (“Alice”, “cats”) [4].
The training dataset includes 30 examples for each static command and 75 examples for each query command, for a total of (36 x 30 + 10 x 75) = 1530 commands in training. Preprocessing includes normalization steps such as converting text to lowercase, removing special characters, and tokenizing into subwords. These tokens are transformed into numerical vectors using Term Frequency-Inverse Document Frequency (TF-IDF) [1], to extract important keywords in the command dataset for the function.
To explain, as detailed in Table 1, TF-IDF assigns higher weights to terms that frequently appear in a specific command but are rare across the dataset. This weighted representation allows the classifier to prioritize features that distinguish command intents, such as “bold” for bolding text or “translate” to translate text into English (Robertson, 2004). A Naive Bayes Classifier is trained on this feature set [1], using an 80%-20% train-test split, applying Bayes' theorem, represented as , to compute the posterior probability of a command class given the features (Rish, 2001). Static commands are directly mapped to their predefined actions, while query-based commands undergo further input extraction [3].
Query-based commands use RegeX patterns for parameter extraction, which are predefined templates for search patterns in a text (Erwig & Martin, 2012). For example, in the command “compose to Alice,” the RegeX pattern compose to (\w+) extracts “Alice” as the recipient. Similarly, in “search cats,” the pattern search (.+) captures “cats” as the search query [2].
Once classified and processed, commands are routed through the execution pipeline. Static commands interface directly with automation tools like PyAutoGUI, which simulate user interface actions such as opening a new document or applying bold formatting [9]. Query-based commands, however, involve more complex interactions, using the extracted inputs for actions like sending an email, setting document titles, or bookmarking web pages (Robertson, 2004). The system's multi-threaded design ensures that it remains responsive and ensures continuous voice input without performance degradation.
In parallel, for gesture recognition, I will train a Temporal Convolutional Network (TCN) to interpret both static gestures, (e.g., "fist") and dynamic gestures (e.g., "wave up"). TCNs are ideal for modeling long-range temporal dependencies and maintaining computational efficiency through parallel processing, unlike recurrent models that suffer from vanishing gradients and sequential memory limitations (Habib & Qureshi, 2022). Using a standard webcam, I will record 10 sequences of 30 frames per gesture as detailed in Table 2. Each frame generates 21 hand landmarks (x, y, z coordinates) [5] extracted using MediaPipe, resulting in 63 features per frame. Features are normalized to a [0, 1] range [6].
The TCN uses dilated convolutions to capture long-term dependencies without increasing the model size, by introducing gaps between the kernels and spreading the convolutional filter over a larger area. Each input sequence of 30 frames (63 features) passes through convolutional layers with Normalized Non-linear Activation Unit (NNLUs). This activation function performs a non-linear transformation to the input. Additionally, it addresses the vanishing gradient problem, which occurs when gradients in earlier layers diminish to such small values that their contributions to optimization become negligible, hindering the network's ability to learn effectively (Frank & Degen, 2023). Following the convolutional layers, a Global Average Pooling (GAP) layer aggregates feature maps into a single vector for classification by calculating the average value of each feature map, helping to retain key spatial information (Habib & Qureshi, 2022). The final softmax output layer assigns probabilities to each gesture label, selecting the label with the highest probability as the recognized gesture (Franke & Degen, 2023) [7].
Training will use the Adam optimizer and CrossEntropyLoss, with early stopping and a dynamic learning rate scheduler to prevent overfitting (Hicks et al., 2022). The dataset will be partitioned into training (70%), validation (15%), and testing (15%) subsets for model evaluation. Additionally, the dynamic learning rate scheduler will adjust the learning rate during training to accelerate convergence and fine-tune the model's performance. Once trained, the TCN will integrate into a real-time system using OpenCV and MediaPipe, with Kalman Filters smoothing predictions for a consistent cursor control. Gesture coordinates will be mapped to screen dimensions via PyAutoGUI, enhanced by exponential moving averages and low-pass filters for natural transitions [8]. This ensures smooth, responsive interactions for users.
Specialized commands requiring additional libraries included:
Risks
Current Progress: (Click on the links to see the demonstrations)
Video 1: Interacting with Google Document
Video 2: Composing and Sending Emails (intuitive based on personalized commands i.e. John)
Video 3: Searching through Google
Video 4: Using Online Editors for Code
Video 5: Live-Feedback on Documents, Articles, and Translation
References