SXSW Funding

Swype SXSW Funding

Dhanvinkumar Ganeshkumar & Zoeb Izzi

Background Information

Over 350 people suffer from motor disorders worldwide, making them more than three times less likely to use computers. Conditions such as Parkinson’s and arthritis slows down fine motor movements and hence reduce one's independence and access to educational, professional, and leisure activities. The need to ensure equal access to computing devices becomes all the more urgent as digital technology becomes part of everyday life. To address these challenges, this project develops an integrated voice command and gesture recognition system, a multimodal human-computer interaction platform tailored for individuals with motor disabilities. The system adopts a dual approach: a voice command module with a TF-IDF vectorization and a Naive Bayes classifier for intent recognition, enhanced by RegeX for precise query extraction. It also uses a gesture recognition module with MediaPipe Hands for hand landmark detection and a Temporal Convolutional Network (TCN) for classifying both dynamic and static gestures. This, integrated with Kalman Filters ensures smooth cursor movements. This project aims to enhance digital accessibility, fostering greater independence and inclusion for individuals with motor disabilities.

IDEA

Globally, more than 50 million individuals face neurodegenerative disabilities such as Alzheimer's, Parkinson's, and ALS, leading to significant barriers in computer interaction. Traditional input methods like keyboards and mice are often inaccessible, limiting independence in education, work, and leisure. These challenges are exacerbated by perceptual and cognitive impairments, including difficulties in motion detection and memory (Masina et al., 2020). For example, deficits in smooth pursuit eye movements hinder object tracking, while impaired predictive motor control affects hand-eye coordination for typing and cursor movement (Chakravarthi et al., 2023). This digital exclusion highlights the need for inclusive technologies to bridge accessibility gaps.

Touchless computing offers potential solutions through voice commands and gestures. However, current voice-command systems like Google Home and Microsoft Voice Speech are limited by strict syntax requirements, making them difficult for users with cognitive or linguistic impairments to use effectively (Murad & Alasadi, 2024). Recent studies have even shown speech recognition accuracy for individuals with dysarthria ranges from 50% to 60%, leading to frequent misinterpretation of commands (Masina et al., 2020). Gesture recognition systems, such as Microsoft Kinect, face their own challenges: sensor-based techniques are prohibitively expensive, require physical connections, and rely heavily on visual features that are easily distorted under varying lighting conditions (Ermolina & Tiberius, 2021). These limitations underscore the urgent need for affordable, multimodal systems.

The proposed solution integrates voice commands and gesture recognition into a unified, accessible platform. Voice commands are processed using TF-IDF vectorization and a Naive Bayes classifier, improving intent recognition, by training it on 1530 command variations. Regex patterns are then used for query-based commands, enabling variations in user input. The gesture recognition module employs MediaPipe Hands for hand landmark detection and a Temporal Convolutional Network (TCN) for classifying dynamic and static gestures sequentially, eliminating the need for hardware. MediaPipe’s ability to track 63 features (21 landmarks × 3 coordinates) addresses challenges like lighting distortion, which limits the accuracy of current solutions relying solely on visual features. My solution is feasible because I have already implemented a “prototype” of the gesture and voice recognition, including features such as moving tabs, real-time text translation, document interaction, and email composition.

PLAN

The voice command system I am developing categorizes commands into static and query-based types, streamlining the processing into a binary function. Using a Naive Bayes Classifier for intent recognition and RegeX patterns for dynamic query extraction, the system manages a total of 46 commands: 36 static and 10 query-based. Static commands, such as “bold” or “select all,” execute predefined actions without requiring additional inputs. In contrast, query-based commands like (“compose and email to Alice”, “search cats”) extract dynamic inputs to execute their respective tasks (“Alice”, “cats”) [4].

The training dataset includes 30 examples for each static command and 75 examples for each query command, for a total of (36 x 30 + 10 x 75) = 1530 commands in training. Preprocessing includes normalization steps such as converting text to lowercase, removing special characters, and tokenizing into subwords. These tokens are transformed into numerical vectors using Term Frequency-Inverse Document Frequency (TF-IDF) [1], to extract important keywords in the command dataset for the function.

To explain, as detailed in Table 1, TF-IDF assigns higher weights to terms that frequently appear in a specific command but are rare across the dataset. This weighted representation allows the classifier to prioritize features that distinguish command intents, such as “bold” for bolding text or “translate” to translate text into English (Robertson, 2004). A Naive Bayes Classifier is trained on this feature set [1], using an 80%-20% train-test split, applying Bayes' theorem, represented as , to compute the posterior probability of a command class given the features (Rish, 2001). Static commands are directly mapped to their predefined actions, while query-based commands undergo further input extraction [3].

Query-based commands use RegeX patterns for parameter extraction, which are predefined templates for search patterns in a text (Erwig & Martin, 2012). For example, in the command “compose to Alice,” the RegeX pattern compose to (\w+) extracts “Alice” as the recipient. Similarly, in “search cats,” the pattern search (.+) captures “cats” as the search query [2].

Once classified and processed, commands are routed through the execution pipeline. Static commands interface directly with automation tools like PyAutoGUI, which simulate user interface actions such as opening a new document or applying bold formatting [9]. Query-based commands, however, involve more complex interactions, using the extracted inputs for actions like sending an email, setting document titles, or bookmarking web pages (Robertson, 2004). The system's multi-threaded design ensures that it remains responsive and ensures continuous voice input without performance degradation.

In parallel, for gesture recognition, I will train a Temporal Convolutional Network (TCN) to interpret both static gestures, (e.g., "fist") and dynamic gestures (e.g., "wave up"). TCNs are ideal for modeling long-range temporal dependencies and maintaining computational efficiency through parallel processing, unlike recurrent models that suffer from vanishing gradients and sequential memory limitations (Habib & Qureshi, 2022). Using a standard webcam, I will record 10 sequences of 30 frames per gesture as detailed in Table 2. Each frame generates 21 hand landmarks (x, y, z coordinates) [5] extracted using MediaPipe, resulting in 63 features per frame. Features are normalized to a [0, 1] range [6].

The TCN uses dilated convolutions to capture long-term dependencies without increasing the model size, by introducing gaps between the kernels and spreading the convolutional filter over a larger area. Each input sequence of 30 frames (63 features) passes through convolutional layers with Normalized Non-linear Activation Unit (NNLUs). This activation function performs a non-linear transformation to the input. Additionally, it addresses the vanishing gradient problem, which occurs when gradients in earlier layers diminish to such small values that their contributions to optimization become negligible, hindering the network's ability to learn effectively (Frank & Degen, 2023). Following the convolutional layers, a Global Average Pooling (GAP) layer aggregates feature maps into a single vector for classification by calculating the average value of each feature map, helping to retain key spatial information (Habib & Qureshi, 2022). The final softmax output layer assigns probabilities to each gesture label, selecting the label with the highest probability as the recognized gesture (Franke & Degen, 2023) [7].

Training will use the Adam optimizer and CrossEntropyLoss, with early stopping and a dynamic learning rate scheduler to prevent overfitting (Hicks et al., 2022). The dataset will be partitioned into training (70%), validation (15%), and testing (15%) subsets for model evaluation. Additionally, the dynamic learning rate scheduler will adjust the learning rate during training to accelerate convergence and fine-tune the model's performance. Once trained, the TCN will integrate into a real-time system using OpenCV and MediaPipe, with Kalman Filters smoothing predictions for a consistent cursor control. Gesture coordinates will be mapped to screen dimensions via PyAutoGUI, enhanced by exponential moving averages and low-pass filters for natural transitions [8]. This ensures smooth, responsive interactions for users.

Specialized commands requiring additional libraries included:

Document Summarization: Implemented using Hugging Face’s BART model to generate concise and accurate summaries of selected text.
Explanation: Built using web-scraping tools, including requests, BeautifulSoup, re, and pandas, to extract and organize relevant content from web sources for explanatory purposes.
Real-Time Translation: Developed with Hugging Face’s MarianMT for multilingual translation and langdetect for automatic language detection, ensuring seamless text translation across various languages.

Risks

If the gesture dataset lacks diversity, resulting in poor model accuracy, I will apply data augmentation techniques such as rotation, scaling, flipping, and adding noise to simulate various conditions and increase the number of samples in the dataset.
If the Temporal Convolutional Network (TCN) overfits the training data, I will incorporate dropout layers which randomly deactivates neurons during training, ensuring the model generalizes well to unseen data. Early stopping will also be implemented to halt training when validation loss starts increasing, ensuring the model keeps its generalization capabilities.
If RegeX patterns fail to accurately extract parameters in query-based voice commands due to phrasing variations, I will train a Named Entity Recognition (NER) to address these limitations. Unlike RegeX, which relies on predefined static patterns, NER dynamically identifies key entities such as names (e.g., “Alice” in “compose to Alice”), objects (e.g., “cats” in “search cats”), or dates within text by training it on an annotated dataset to understand the semantics of the command.

Current Progress: (Click on the links to see the demonstrations)

Video 1: Interacting with Google Document

Video 2: Composing and Sending Emails (intuitive based on personalized commands i.e. John)

Video 3: Searching through Google

Video 4: Using Online Editors for Code

Video 5: Live-Feedback on Documents, Articles, and Translation

References

Chakravarthi, Bharatesh & M, Prabhu & Imandi, Raju & B N, Pavan Kumar. (2023). A Comprehensive Review of Leap Motion Controller-Based Hand Gesture Datasets. 1-7. 10.1109/NEleX59773.2023.10421030.
Ermolina A, Tiberius V. Voice-Controlled Intelligent Personal Assistants in Health Care: International Delphi Study. J Med Internet Res. 2021 Apr 9;23(4):e25312. doi: 10.2196/25312. PMID: 33835032; PMCID: PMC8065565.
Erwig, Martin & Gopinath, Rahul. (2012). Explanations for Regular Expressions. 7212. 394-408. 10.1007/978-3-642-28872-2_27.
Franke, M., & Degen, J. (2023, September 28). The softmax function: Properties, motivation, and interpretation. https://doi.org/10.31234/osf.io/vsw47
Hicks SA, Strümke I, Thambawita V, Hammou M, Riegler MA, Halvorsen P, Parasa S. On evaluation metrics for medical applications of artificial intelligence. Sci Rep. 2022 Apr 8;12(1):5979. doi: 10.1038/s41598-022-09954-8. PMID: 35395867; PMCID: PMC8993826.
Karimi, Zohreh. (2021). Confusion Matrix.
Kumar, V. & Subba, B. "A TfidfVectorizer and SVM based sentiment analysis framework for text data corpus," 2020 National Conference on Communications (NCC), Kharagpur, India, 2020, pp. 1-6, doi: 10.1109/NCC48643.2020.9056085. Keywords: {Support vector machines; Sentiment analysis; Social networking (online); Annotations; Machine learning; Motion pictures; Real-time systems; Sentiment Analysis; TfidfVectorizer; Support Vector Machine (SVM); Amazon dataset; IMDB dataset}.
Masina F, Orso V, Pluchino P, Dainese G, Volpato S, Nelini C, Mapelli D, Spagnolli A, Gamberini L. Investigating the Accessibility of Voice Assistants With Impaired Users: Mixed Methods Study. J Med Internet Res. 2020 Sep 25;22(9):e18431. doi: 10.2196/18431. PMID: 32975525; PMCID: PMC7547392.
Masina F, Orso V, Pluchino P, Dainese G, Volpato S, Nelini C, Mapelli D, Spagnolli A, Gamberini L. Investigating the Accessibility of Voice Assistants With Impaired Users: Mixed Methods Study. J Med Internet Res. 2020 Sep 25;22(9):e18431. doi: 10.2196/18431. PMID: 32975525; PMCID: PMC7547392.
Murad, Bothina & Alasadi, Abbas. (2024). Advancements and Challenges in Hand Gesture Recognition: A Comprehensive Review. Iraqi Journal for Electrical and Electronic Engineering. 20. 154-164. 10.37917/ijeee.20.2.13.
Nazre, Rukmini & Budke, Rujuta & Oak, Omkar & Sawant, Suraj & Joshi, Amit. (2024). A Temporal Convolutional Network-based Approach for Network Intrusion Detection. 10.48550/arXiv.2412.17452.
Rish, Irina. (2001). An Empirical Study of the Naïve Bayes Classifier. IJCAI 2001 Work Empir Methods Artif Intell. 3.
Robertson, Stephen. (2004). Understanding Inverse Document Frequency: On Theoretical Arguments for IDF. Journal of Documentation - J DOC. 60. 503-520. 10.1108/00220410410560582.
Sarkar, Ushnish & Chakraborti, Archisman & Samanta, Tapas & Pal, Sarbajit & Das, Amitabha. (2024). Enhancing ASL Recognition with GCNs and Successive Residual Connections. 10.48550/arXiv.2408.09567.
Sugimura, Daisuke & Yasukawa, Yusuke & Hamamoto, Takayuki. (2016). Using Motion Blur to Recognize Hand Gestures in Low-light Scenes. 308-316. 10.5220/0005673603080316.
What is the Naive Bayes Algorithm? | Data Basecamp. (n.d.). https://databasecamp.de/en/ml/naive-bayes-algorithm