1 of 23

x-to-audio: General Audio Synthesis From Various Input Prompt

Keisuke Imoto a

Doshisha University

1/23

Jun. 27, 2024

2 of 23

What is generative AI?

  • ChatGPT (4o) answers

  • Claude (3.5 Sonnet) answers

2/23

3 of 23

What is generative AI?

  • Can be also regarded as input-to-output conversion system

3/23

text-to-speech

text-to-text

speech-to-speech

text-to-image

4 of 23

What is generative AI?

  • Can be also regarded as input-to-output conversion system
  • How to obtain desirable output?
    • Model structure?
    • Training dataset?
    • Input prompt/query/feature?

4/23

text-to-speech

text-to-text

speech-to-speech

text-to-image

5 of 23

x-to-audio synthesis (xTA)

  • Speech synthesis
    • text-to-speech (TTS) synthesis
    • voice conversion (speech-to-speech conversion)

  • Singing voice/music synthesis
    • singing voice synthesis (musical notes/text/musical context-to-voice)
    • music synthesis (musical notes/Instrument label/context-to-music)

  • General audio synthesis
    • environmental sound synthesis (event label/text/onomatopoeia-to-audio)
    • Foley sound synthesis (event label/text/onomatopoeia-to-audio)

5/23

6 of 23

Applications of general audio synthesis

  • Automatic generation of movie & game contents
  • Reproduce environmental sounds in VR/AR space
  • Control animal behavior

6/23

7 of 23

General audio synthesis using event label (label-to-audio)

  • label-to-audio based on conditional WaveNet [Okamoto+ 2019]
  • label-to-audio based on SampleRNN [Kong+ 2019]
    • Generate waveform using synthetic model conditioned by event label

7/23

※ https://deepmind.com/blog/article/wavenetgenerative-model-raw-audio

Wave generation using WaveNet

8 of 23

Quiz

  • Which is synthesized sound?

Alarm clock

Maracas

Coffee grinder

8/23

9 of 23

Quiz

  • Which is synthesized sound?

Alarm clock

Maracas

Coffee grinder

9/23

10 of 23

Quiz

  • Which is synthesized sound?

Alarm clock

Maracas

Coffee grinder

10/23

11 of 23

Challenges in label-to-audio synthesis

  • Event labels do not have enough information to control fine

structure of sounds

    • label-to-audio system cannot synthesize various sounds from single label

11/23

Synthesized

sound 1

Synthesized

sound 2

Synthesized

sound 3

Synthesized

sound 4

12 of 23

text-to-audio (TTA) synthesis

  • DiffSound [Yang+ 2022]
  • AudioLDM [Liu+ 2023]
  • Tango [Ghosal+ 2023]
    • Diffusion model-based TTA
    • Utilize text description as input

12/23

Training and sampling process

of AudioLDM

13 of 23

Example of text-to-audio (TTA) synthesis

  • AudioLDM [Liu+ 2023]
    • Demo page

13/23

14 of 23

Onomatopoeia-to-audio synthesis

  • Onomatopoeia can finely control synthesized sound
    • Duration (beep vs beeeeeeeeeep)
    • Pitch (peep vs beep)
    • Timbre (beep vs pirororororo)
  • Onoma-to-wave [Okamoto+ 2021]
    • Transformer encoder/decoder + Griffin-Lim phase reconstruction

14/23

peep

beeeeeep

15 of 23

Example of synthesized sounds

  • Input onomatopoeic word (seq2seq, Transformer)
    • / z u z a: / (ズザー)
  • Input sound event (WaveNet)
    • Tearing paper

15/23

16 of 23

Variations of synthesized sounds w/ onoma-to-wave

  • Synthesized cup clinking sounds with various onomatopoeic words

16/23

Synthesized sound

by /ch i N q/ (チンッ)

Synthesized sound by /ch i: N q/ (チィンッ)

Synthesized sound by /p i N q/ (ピンッ)

17 of 23

voice-to-audio conversion

  • Vocal imitation can represent fine structure of sounds
    • Duration (beep vs beeeeeeeeeep)
    • Pitch (peep vs beep)
    • Timbre (beep vs pirororororo)

  • Voice-to-Foley [Okamoto+ 2023]
    • Quantizer: BYOL-A [Niizumi+ 2021]

+ k-means

    • Decoder: Tacotron2 [Shen+ 2018]
      • Attention + 2 LSTM layers

17/23

peep

beeeeeep

18 of 23

Example of synthesized sounds from vocal imitation

18/23

19 of 23

Pitch- and rhythm-changed input vocal imitation

  • Voice-to-audio can synthesize sounds reflect pitch and rhythm of input vocal imitation

19/23

20 of 23

Image-to-speech/audio synthesis

  • Visual-text-to-speech (vTTS) [Nakano+ 2022]
  • Visual-onomatopoeia-to-audio (visual Onoma-to-wave)

[Ohnaka+ 2023]

20/23

21 of 23

Example of synthesized sound by visual Onoma-to-wave

  • Synthesized sound from visual onomatopoeia with

repetitions

    • Input image:    /pi q/

    • Input image:         /pi q pi q pi q/

    • Input image:             /pi q pi q pi q pi q pi q/

21/23

22 of 23

Example of synthesized sound by visual Onoma-to-wave

  • Synthesized sound from visual onomatopoeia with

repetitions of different width

    • Input image:                   →

22/23

23 of 23

Conclusion

  • Generative AI in audio domain is moving towards synthesis

of general sounds, not just speech or music

  • Important to consider input other than text description
    • Difficult to verbalize all characteristics of general sounds
    • Need to consider system input that does not need to be verbalized

  • x-to-audio system
    • text-to-audio
    • onomatopoeia-to-audio
    • voice-to-audio
    • image-to-audio

23/23