1 of 18

Voice AI & Generative Audio

acmAI F’25

https://acmcsuf.com/ai/f25-voiceai

2 of 18

Meet the Board

Joann

Hannah

Batu

Sachin

Nathan

Kai

Dylan

Omar

Braedon

Shane

3 of 18

4 of 18

Voice AI in daily life

Siri / Alexa / Google Assistant
Voice typing & dictation
AI customer service calls
YouTube auto-captions
Me listening to my friend’s phones
Customer Service agents
Taco Bell drive thru
McDonalds drive thru
Voice changer for voice chat, ie spongebob in discord vc
Cloning my friends and family’s voices
Vocaloid ai

5 of 18

How Voice AI Works - The Pipeline

Speech-to-Text (STT) → Text Processing/LLM → Text-to-Speech (TTS)
Three separate AI models working together
Each step is modular and replaceable

6 of 18

Speech to text

Audio waveforms → Text transcription
OpenAI Whisper (common)
99%+ accuracy in English
Supports 50+ languages

7 of 18

Text processing via an LLM

The brain
GPT-4, Claude, Gemini
Process text → Generate intelligent responses
Trained on trillions of words
Context-aware conversations

8 of 18

Text to speech

Text → Natural-sounding speech
Traditional TTS vs using Neural Networks (image below)
ElevenLabs, OpenAI TTS, Google TTS
Multiple voices and emotions
Real-time generation (< 1 second)

9 of 18

Popular Models

STT: Whisper, AssemblyAI, Google Speech
LLM: GPT-4, Sonnet, Gemini, etc….
TTS: OpenAI TTS, ElevenLabs, Google
All-in-one: GPT-4o (native audio)

10 of 18

Demo of voice AI

Github link: https://github.com/ACM-AI-F25/voice-ai-demo

Architecture

Frontend: React app with voice recording (Lucide for icons)
Backend: Node.js server with OpenAI APIs

OpenAI API for whisper-1, gpt-4, tts-1 models
MediaRecorder API: Capture audio from user mic
WebAudio API: decode audio and apply effects
Multer as file upload middleware

Microphone → Whisper → GPT → TTS → Speakers

11 of 18

Generative Music

AI creating original songs from text
Suno, Udio, MusicGen, Stable Audio
Describe music → Get a complete track
30-60 seconds to generate

12 of 18

13 of 18

How Music Generation Works

There is 2 domains: symbolic (MIDI format, etc.) OR audio

Symbolic: highly editable
Audio: More detail and realistic, read-to-listen

Abstracted: Text prompt → Trained model → Audio waveform

Trained on millions of songs
Can specify genre, mood, instruments

14 of 18

15 of 18

Generative Music Platform/Tool comparison

Suno AI for content creation
Udio for full songs
MusicGen by Meta (via Replicate) for developers
AudioCraft; Stable Audio for sound effects
AIVA or Mubert API for Game Soundtracks

*Make sure you look into commercial rights and attribution required

16 of 18

Demo | Music Generation Integration

Replicate API for MusicGen (stereo-large model)

Creates "prediction" (job to generate music)
Returns prediction ID
Poll prediction status every 2 seconds
When status = "succeeded", provides audio URL

~$0.01-0.05 per generation
20-60 seconds generation time
Requires paid credits

17 of 18

Voice AI Improvement Considerations

Streaming responses (faster feedback)
Voice cloning
Real-time models (GPT-4o audio)
Emotion detection & sentiment analysis
Multi-language support

18 of 18

Up Next⏱️

Startups

3:00 PM

@ CS401