Whitepaper
AI Vocal Detection
in Music
engineering@wavelets.ai
This paper describes an industry need for a technology to detect AI Generated content, particularly AI Generated music. We propose an initial solution and architecture to solve this through the detection of AI vocals in music. Our solution has achieved 0.93 precision with a 0.77 recall evaluated at song level AI vocal detection.
Generative AI technology is advancing at a rapid pace. It is now possible to generate music from text prompts, and create synthetic AI vocals mimicking nearly any artist. This explosion of Generative AI technology raises many challenges for the music industry especially around detection. We propose an initial approach to AI music detection through classifying and identifying AI Vocals in music. By focusing on AI Vocal detection, we can hone in on a key indicator of AI music. Also, as the voice is a fundamental and unique part of identity – especially for creators — faking an artist’s voice can feel so much more violative than copyright or IP infringement.
The proliferation of tools to easily generate AI vocals for music through open source tools (e.g. So-VITS-SVC, RVC, Diff-SVC) has made it very easy for anyone to create AI vocals leading to a quickly growing number of tracks and content containing AI generated vocals. As of now, there are very limited ways of identifying and tagging content with AI vocals outside of content metadata, but the need across the industry is clear:
While there is extensive academic research in Deepfake and Synthetic Speech detection, our experiments with these solutions have shown that they just do not work for music. With this in mind, our team has undertaken original research in AI Vocal detection in Music and have set out to build an effective set of tools that can detect AI Vocals in Music.
Based on extensive conversations with stakeholders in the music industry, a solution to detecting AI Vocals in music would need to meet the following:
At a high level, our solution can be divided into 3 layers.
Dataset: A curated, representative dataset for model training and evaluation
Machine Learning Models: We use an ensemble of Neural Networks and other machine learning models designed and optimized for this specific classification problem
Access Layer: Provide a way for users to upload and score audio files and interpret predictions
Our core data consists of two separate representative datasets, one to train our models and the other to evaluate and optimize model performance. These datasets contain a proprietary mixture of:
These datasets are growing and evolving as we find new examples and modes of generating vocals being used online.
Our team is constantly iterating, testing, and experimenting with our in-house Machine Learning architectures and models, leveraging academic research from Speech Recognition, Audio Classification, Synthetic Speech Generation, Synthetic Speech Detection and even Topological Data Analysis. Our solution at a high level is an ensemble of Models.
We test and experiment with different model architectures, features, embeddings and hyperparameters along with the best mixture of training data to optimize for model performance.
Our team is building out ways to access the different models we have built, through a UI, website and through API integrations depending on customer needs. Our detection + classification system predicts on 2.5 - 5 second slices of audio, so users are able to see precise locations where the model believes there are AI Vocals. These predictions (probability scores between 0-1) are processed and returned within < 30 seconds.
We are excited to share some promising initial results based on our current evaluation dataset and V0 architecture. Our system has achieved 0.93 precision while holding a 0.77 recall with our current approach. This evaluation is based on correctly identifying AI vocals in at least 25% of the vocals in a given music track with a model score threshold of 0.65. These evaluation tracks are verified by a human expert as containing AI vocals and are determined to not be part of the Model’s training data. We are continuously improving our models and training data and will update performance metrics as we build and evaluate.