1 of 23

x-to-audio: General Audio Synthesis From Various Input Prompt

Keisuke Imoto a

Doshisha University

1/23

Jun. 27, 2024

2 of 23

What is generative AI?

ChatGPT (4o) answers

Claude (3.5 Sonnet) answers

2/23

3 of 23

What is generative AI?

Can be also regarded as input-to-output conversion system

3/23

text-to-speech

text-to-text

speech-to-speech

text-to-image

4 of 23

What is generative AI?

Can be also regarded as input-to-output conversion system
How to obtain desirable output?

Model structure?
Training dataset?
Input prompt/query/feature?

4/23

text-to-speech

text-to-text

speech-to-speech

text-to-image

5 of 23

x-to-audio synthesis (xTA)

Speech synthesis

text-to-speech (TTS) synthesis
voice conversion (speech-to-speech conversion)

Singing voice/music synthesis

singing voice synthesis (musical notes/text/musical context-to-voice)
music synthesis (musical notes/Instrument label/context-to-music)

General audio synthesis

environmental sound synthesis (event label/text/onomatopoeia-to-audio)
Foley sound synthesis (event label/text/onomatopoeia-to-audio)

5/23

6 of 23

Applications of general audio synthesis

Automatic generation of movie & game contents
Reproduce environmental sounds in VR/AR space
Control animal behavior

6/23

7 of 23

General audio synthesis using event label (label-to-audio)

label-to-audio based on conditional WaveNet [Okamoto+ 2019]
label-to-audio based on SampleRNN [Kong+ 2019]

Generate waveform using synthetic model conditioned by event label

7/23

※ https://deepmind.com/blog/article/wavenetgenerative-model-raw-audio

Wave generation using WaveNet

8 of 23

Quiz

Which is synthesized sound?

Alarm clock

Maracas

Coffee grinder

8/23

9 of 23

Quiz

Which is synthesized sound?

Alarm clock

Maracas

Coffee grinder

9/23

10 of 23

Quiz

Which is synthesized sound?

Alarm clock

Maracas

Coffee grinder

10/23

11 of 23

Challenges in label-to-audio synthesis

Event labels do not have enough information to control fine

structure of sounds

label-to-audio system cannot synthesize various sounds from single label

11/23

Synthesized

sound 1

Synthesized

sound 2

Synthesized

sound 3

Synthesized

sound 4

12 of 23

text-to-audio (TTA) synthesis

DiffSound [Yang+ 2022]
AudioLDM [Liu+ 2023]
Tango [Ghosal+ 2023]

Diffusion model-based TTA
Utilize text description as input

12/23

Training and sampling process

of AudioLDM

13 of 23

Example of text-to-audio (TTA) synthesis

AudioLDM [Liu+ 2023]

Demo page

13/23

14 of 23

Onomatopoeia-to-audio synthesis

Onomatopoeia can finely control synthesized sound

Duration (beep vs beeeeeeeeeep)
Pitch (peep vs beep)
Timbre (beep vs pirororororo)

Onoma-to-wave [Okamoto+ 2021]

Transformer encoder/decoder + Griffin-Lim phase reconstruction

14/23

peep

beeeeeep

15 of 23

Example of synthesized sounds

Input onomatopoeic word (seq2seq, Transformer)

/ z u z a: / （ズザー）

Input sound event (WaveNet)

Tearing paper

15/23

16 of 23

Variations of synthesized sounds w/ onoma-to-wave

Synthesized cup clinking sounds with various onomatopoeic words

16/23

Synthesized sound

by /ch i N q/ (チンッ)

Synthesized sound by /ch i: N q/ (チィンッ)

Synthesized sound by /p i N q/ (ピンッ)

17 of 23

voice-to-audio conversion

Vocal imitation can represent fine structure of sounds

Duration (beep vs beeeeeeeeeep)
Pitch (peep vs beep)
Timbre (beep vs pirororororo)

Voice-to-Foley [Okamoto+ 2023]

Quantizer: BYOL-A [Niizumi+ 2021]

+ k-means

Decoder: Tacotron2 [Shen+ 2018]

Attention + 2 LSTM layers

17/23

peep

beeeeeep

18 of 23

Example of synthesized sounds from vocal imitation

18/23

19 of 23

Pitch- and rhythm-changed input vocal imitation

Voice-to-audio can synthesize sounds reflect pitch and rhythm of input vocal imitation

19/23

20 of 23

Image-to-speech/audio synthesis

Visual-text-to-speech (vTTS) [Nakano+ 2022]
Visual-onomatopoeia-to-audio (visual Onoma-to-wave)

[Ohnaka+ 2023]

20/23

21 of 23

Example of synthesized sound by visual Onoma-to-wave

Synthesized sound from visual onomatopoeia with

repetitions

Input image: 　　 /pi q/ →

Input image: 　　　　　　　 /pi q pi q pi q/ →

Input image: 　　　　　　　　　　　 /pi q pi q pi q pi q pi q/

→

21/23

22 of 23

Example of synthesized sound by visual Onoma-to-wave

Synthesized sound from visual onomatopoeia with

repetitions of different width

Input image: 　　　　　　　　　　　　　　　　　 →

22/23

23 of 23

Conclusion

Generative AI in audio domain is moving towards synthesis

of general sounds, not just speech or music

Important to consider input other than text description

Difficult to verbalize all characteristics of general sounds
Need to consider system input that does not need to be verbalized

x-to-audio system

text-to-audio
onomatopoeia-to-audio
voice-to-audio
image-to-audio

23/23