1 of 18

Voice AI & Generative Audio

acmAI F’25

2 of 18

Meet the Board

Joann

Hannah

Batu

Sachin

Nathan

Kai

Dylan

Omar

Braedon

Shane

3 of 18

4 of 18

Voice AI in daily life

  • Siri / Alexa / Google Assistant
  • Voice typing & dictation
  • AI customer service calls
  • YouTube auto-captions
  • Me listening to my friend’s phones
  • Customer Service agents
  • Taco Bell drive thru
  • McDonalds drive thru
  • Voice changer for voice chat, ie spongebob in discord vc
  • Cloning my friends and family’s voices
  • Vocaloid ai

5 of 18

How Voice AI Works - The Pipeline

  • Speech-to-Text (STT) → Text Processing/LLM → Text-to-Speech (TTS)
  • Three separate AI models working together
  • Each step is modular and replaceable

6 of 18

Speech to text

  • Audio waveforms → Text transcription
  • OpenAI Whisper (common)
  • 99%+ accuracy in English
  • Supports 50+ languages

7 of 18

Text processing via an LLM

  • The brain
  • GPT-4, Claude, Gemini
  • Process text → Generate intelligent responses
  • Trained on trillions of words
  • Context-aware conversations

8 of 18

Text to speech

  • Text → Natural-sounding speech
  • Traditional TTS vs using Neural Networks (image below)
  • ElevenLabs, OpenAI TTS, Google TTS
  • Multiple voices and emotions
  • Real-time generation (< 1 second)

9 of 18

Popular Models

  • STT: Whisper, AssemblyAI, Google Speech
  • LLM: GPT-4, Sonnet, Gemini, etc….
  • TTS: OpenAI TTS, ElevenLabs, Google
  • All-in-one: GPT-4o (native audio)

10 of 18

Demo of voice AI

Github link: https://github.com/ACM-AI-F25/voice-ai-demo

Architecture

  • Frontend: React app with voice recording (Lucide for icons)
  • Backend: Node.js server with OpenAI APIs
    • OpenAI API for whisper-1, gpt-4, tts-1 models
    • MediaRecorder API: Capture audio from user mic
    • WebAudio API: decode audio and apply effects
    • Multer as file upload middleware
  • Microphone → Whisper → GPT → TTS → Speakers

11 of 18

Generative Music

  • AI creating original songs from text
  • Suno, Udio, MusicGen, Stable Audio
  • Describe music → Get a complete track
  • 30-60 seconds to generate

12 of 18

13 of 18

How Music Generation Works

  • There is 2 domains: symbolic (MIDI format, etc.) OR audio
    • Symbolic: highly editable
    • Audio: More detail and realistic, read-to-listen
  • Abstracted: Text prompt → Trained model → Audio waveform
    • Trained on millions of songs
    • Can specify genre, mood, instruments

14 of 18

15 of 18

Generative Music Platform/Tool comparison

  • Suno AI for content creation
  • Udio for full songs
  • MusicGen by Meta (via Replicate) for developers
  • AudioCraft; Stable Audio for sound effects
  • AIVA or Mubert API for Game Soundtracks

*Make sure you look into commercial rights and attribution required

16 of 18

Demo | Music Generation Integration

  • Replicate API for MusicGen (stereo-large model)
    • Creates "prediction" (job to generate music)
    • Returns prediction ID
    • Poll prediction status every 2 seconds
    • When status = "succeeded", provides audio URL
  • ~$0.01-0.05 per generation
  • 20-60 seconds generation time
  • Requires paid credits

17 of 18

Voice AI Improvement Considerations

  • Streaming responses (faster feedback)
  • Voice cloning
  • Real-time models (GPT-4o audio)
  • Emotion detection & sentiment analysis
  • Multi-language support

18 of 18

Up Next⏱️

Startups

3:00 PM

@ CS401