2 of 38

OVERVIEW

Transition to Audio and research papers

Final project options

Self introduction
Motivating modeling TEXTURES

Musically and technically

Introduce various model components

GANs, RNNs, Self-organizing maps data sets and representation

Put it all together in one big hybrid system

Intro

3 of 38

NEXT 5 WEEKS

“Architecture first approach”

GAN (e.g. Sound Model Factory)
Audio Representation (Codecs)
VAE (e.g. Rave)
DDSP (e.g. DDSP)
Test to audio (briefly)
Transformers (Vampnet, DacSynthFormer)

4 of 38

Eyes-free

Games

Sound Modeling

Musical expectation

Modeling Gamakas in Carnatic Music

Anticipatory

Improvisation

Voice-controlled synthesis

Vibrotactile Musical Experience for the Deaf

Mobile platform for audience engagement

w/ Suranga Nanayakkara

w/ Srikumar Subramanian

w/ Trevor Penney, Annett Schirmer

w/ Stefano Fasciani

Arts and Creativity Lab

National University of Singapore

Sonic Bard

5 of 38

A CENTURY AGO

Edgard Varese: “Liberation of sound”, and music as “organized sound”

Percy Grainger, “Free Music Machines” [link]

Luigi Russolo, Art of Noises manifesto (1913) [link]

Nikolai Kulbin, 1909ish

Francis Dhomont, Points de fuite (1982)

Analog, recording, & then computers….

6 of 38

MUSIC AND SOUND DESIGN

Since early 20^th century, music admits “all sound”
Music from Brian O’Reilly (Lassalle College of the Arts)

https://vimeo.com/dendriform
Analog circuits

For audio and video

Media production audio

Ambience, sound effects

7 of 38

AUDIO VS SYMBOLIC

Discussion

Models and modeling

8 of 38

SOUND MODELS

What it is:

A space of sounds
A means of navigation

What it is not:

Universal
Beyond the scope:

Mapping between human gesture and navigation

But programming sound is hard….

9 of 38

OBJECTIVE: SOUND-TO-MODEL (“S2M”)

Define a synthesizer with a collection of sounds.

Look, ma – no programming!

We want a sound model *maker* - a factory.
The maker is universal, the golems (sound models) are specific and expressive.

Synth models: interactive, real-time, play forever, generalize

S2M is also a good term for search …..

Data-driven modeling

https://Freesound.org

10 of 38

NOVELTY

How can we generate audio that

We’ve never heard before?
We don’t have data for?
We can’t possibly get data for?

“Out of domain”

11 of 38

MORPHING

Many paths, different descriptions
Defining spaces

“world models”
Musical listening

“Generalization”, Tweening sounds and Interpolation

Manually:

Karlheinz Stockhausen

Gesang der Junglinge (1956) for electronics, tape manipulation

Trevor Wishart

Redbird: A Political Prisoner's Dream (1973-77)

link1, link2, link3

The issue of time

Slaney (1993) – Identify t/f “correspondences”

Compare to image morphing

Sounding object characteristics

Pruvost, L., Scherrer, B., Aramaki, M., Ystad, S., & Kronland-Martinet, R. (2015). Perception-based interactive sound synthesis of morphing solids' interactions. In SIGGRAPH Asia 2015 Technical Briefs (pp. 1-4).

12 of 38

MODELING & PLAYABILITY

The space of sounds

Create the latent space that organizes the discrete dataset points
generate novel sounds, “interpolating” between dataset points

Parameters

Reasonably smooth and consistent
Model must be RESPONSIVE, have an immediate “parameter response time”

Play forever

13 of 38

GANSYNTH

(Was) State of the art for instrument tone generation
Excellent interpolation

Trained on 2D and 2-channel spectrograms,

Magnitude & “instantaneous frequency” (IF – time derivative of phase)

Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis."

arXiv preprint arXiv:1902.08710 (2019).

But musical instruments ….

instantaneous frequency

14 of 38

GANS: A CLOSER LOOK

“GANs are unstable?”
Workflow

Gaussian noise+conditional params
Upscale (eg convolutional layers)
Output to Discriminator
Discriminator downscales until ouput

Progressive GANS

Add layers during training (downscaling reals, too!)

From discriminator to critic

params

Real

Fake

Generates n-seconds of sounds for each parameter configuration

15 of 38

OBJECTIVE FUNCTIONS

Kullback-Leibler Divergence

Make it symmetric: JSD(P || Q) = (KL(P || M) + KL(Q || M)) / 2

“Jensen-Shannon Divergence”

Binary classification

Output values in [0,1]
D() – the probability that the data is real
binary cross-entropy
Minimizes JSD

Discriminator as critic

Output values in [-inf, +inf]
Wasserstein Distance (minimize over all possible waves to match Pr to Pg)

16 of 38

GAN “LATENT SPACE”

What is a latent space, and in particular what is it for the GAN?
The concept of “responsive” real time

params

Real

Fake

Generates n-seconds of sounds for each parameter configuration

17 of 38

RNN: TRAINING DATA

NSynth database samples

Parameters:

Instrument ID: Trumpet and Clarinet
Pitch: 12 chromatic tones for each instrument, [E4 – E5]:

Can it generalize? ….

Wyse, L. Real-valued parametric condition of an RNN for interactive sound synthesis. In Proceedings of the Proceedings of the 6th International Workshop on Musical Metacreation, ACM Conference on Computational Creativity. Salamanca, Spain, June, 2018.

x₁

y₁

input

GRU

…pn

Audio

params

output

layers

18 of 38

GENERALIZATION (RNN)

Train: Trumpet and Clarinet,

12 pitches, spanning octave: E4, E5

Generate: Clarinet,

Continuous sweep, spanning octave: E4, E5

note playability

19 of 38

TRUMPINET

Generate: Trumpinet (mid-point inst=.5 instrument),

Continuous sweep, spanning octave: E4, E5

Train: Trumpet (inst=0, and Clarinet, inst=1

12 pitches, spanning octave: E4, E5

Where could we get *that* data?

20 of 38

IN THE DIGITAL LUTHIER’S TOOL SET

Recurrent Neural Network (RNN)

Problem – interpolation

Generates n-seconds of sounds for each parameter configuration

Generative Adversarial Network (GAN)

Problem: Fundamentally not real-time

Generates 1 sample

at a time

Real data

Fake data

params

Real

Fake

RNN

params

The “Performer”

The “Interpolator”

~ 128 Dims

Sound complexity?

21 of 38

WHAT KIND OF SOUND?

Texture

Existing generative modeling work on speech and music (traditional note-based)
Existing classification of “environmental” (any and all) sounds
Limited work on generative textures

What is a texture?

A sound we can describe at a “steady state”
There exists a scale, or window size, such that the signal stands a high probably of being drawn from the same description independent of the location of the window.

Examples

Dripping

Engine

Wind

Fire

Clarinet

Pops

Dripping2

Thus we are talking about a huge space (compared to speech or musical notes), but with some constraints amenable to modeling.

22 of 38

SYNTHETIC AUDIO TEXTURE DATASETS

Nested Hierarchy supporting multiple overlapping time scales
DSSynth each have the same (abstract) interface and produce sound.
Designed to create texture for NN training

Explicit time scales that interact
Parameters are labels
Seedable for reproducibility
(not real-time)

Parameters

Argh! Why am I still coding synths!?

clap

clapper

applause

SynTex

[Weblink]

Syntex: parametric audio texture datasets for conditional training of instrumental interfaces., L. Wyse, Ravikumar, P.T (New Interfaces for Musical Expression, NIME 2022)

23 of 38

PURE “STYLE” -- VARIATION AND EXTENSION

Original short clip

Continuation

24 of 38

WHAT ABOUT “DYNAMIC” TEXTURES?

The “problem” of time, and the texture solution

The structure of the description remains the same, while the values of the parameterization change

We (designers, composers) decide what is content and what is style.

Water fill

But are images good representations for textures?

Content?

25 of 38

NATURE OF THE GAN LATENT SPACE

Lumpy

Some neighborhoods barely change the sound at all, Other regions sound changes quickly.
Possibility that “you cain’t get thar from heya”

One-D straight tranectories may go through regions that sound nothing like the start and destination points.

26 of 38

GAN SUBSPACE SELECTION AND SMOOTHING

Sound Model Factory: An Integrated System Architecture for Generative Audio Modelling, L. Wyse, C. Gupta, P. Kamath (EvoMusArt, 2022)

OK – Let’s put it all together!

Teuvo Kohonen’s Self-Organizing Map (“SOM”)

27 of 38

THE BIG PICTURE

“Interpolator”

“Performer”

28 of 38

RNN TEXTURE MODELING

Trained on segments with constant rates in [2,32]/sec
Synthesized by varying the conditioning parameter

Regular

Random

x₁

y₁

input

GRU

…pn

Audio

params

output

layers

unfilling

recorded

Recorded

data, too:

29 of 38

DESIGNING NAVIGATION

GAN Generator

Designers choice!

Parameters from labeled data (pitch, roughnes)

128 D “latent” space

30 of 38

BOREILLY TEXTURE EXAMPLE

Original sounds from different synthesizers

GAN trained unconditionally

Generate lots of random sounds from the 128D latent space
Choose 4 to make a 2D subspace

4 points define a 2D space.

128 D “latent” space

Sample it on a grid.

https://animatedsound.com/icassp2022/oreilly_grid2/

31 of 38

NAVIGATION

[Weblink]

32 of 38

FAKE TWEEN DATA FOR TRAINING MORPHS

With no tween data for training,

The RNN fails to morph

With FAKE tween data for training,

The RNN can learn.

Real time continuous tweening of pitch and timbre

[Weblink]

Pitch conditioned, instruments located in z-space

34 of 38

OTHER NAVIGATION STRATEGIES

Can we create a latent space with smoother “in between” regions?
Can we locate sounds hidden in the latent space?
Can we find semantic directions in the latent space?

35 of 38

RECENT EXPLORATIONS

https://animatedsound.com/research/morphgan_icassp2023/

Gupta, C., Kamath, P., Wei, Y.Z, Li Z., Nanayakkara, S., Wyse, L. (2023). Towards. Controllable Audio Texture Morphing. International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

36 of 38

QUERY BY SYNTHESIS

https://guided-control-by-prototypes.s3.ap-southeast-1.amazonaws.com/audio-guided-generation/index.html

Kamath, P., Li, Z., Gupta, C., Jaidka, K., Nanayakkara, S., & Wyse, L. (2023). An Example-Based Framework for Perceptually Guided Audio Texture Generation. ACM Multimedia

37 of 38

DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs

Also good for how to gt control of the latent space in a GAN

38 of 38

DATA-DRIVEN SOUND MODELING

Data driven synthesizer design

Sound2Model
Designer’s role

Dataset, navigation strategies

Concepts

Latent parametric space
Framing “content” and “style”
Textures and morphability
Universal Factory -> expressive models

Sound

Model

Factory

NUS colleagues:�Chitralekha Gupta, Purnima Kamath

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38