1 of 38

DATA-DRIVEN SOUND AND INTERACTION DESIGN

Lonce Wyse

2 of 38

OVERVIEW

  • Transition to Audio and research papers
    • Final project options
  • Self introduction
  • Motivating modeling TEXTURES
    • Musically and technically
  • Introduce various model components
      • GANs, RNNs, Self-organizing maps data sets and representation
  • Put it all together in one big hybrid system

Intro

3 of 38

NEXT 5 WEEKS

  • “Architecture first approach”
    • GAN (e.g. Sound Model Factory)
    • Audio Representation (Codecs)
    • VAE (e.g. Rave)
    • DDSP (e.g. DDSP)
    • Test to audio (briefly)
    • Transformers (Vampnet, DacSynthFormer)​

4 of 38

Eyes-free

Games

Sound Modeling

Musical expectation

Modeling Gamakas in Carnatic Music

Anticipatory

Improvisation

Voice-controlled synthesis

Vibrotactile Musical Experience for the Deaf

Mobile platform for audience engagement

w/ Suranga Nanayakkara

w/ Srikumar Subramanian

w/ Trevor Penney, Annett Schirmer

w/ Stefano Fasciani

Arts and Creativity Lab

National University of Singapore

Sonic Bard

5 of 38

A CENTURY AGO

Edgard Varese: “Liberation of sound”, and music as “organized sound”

Percy Grainger, “Free Music Machines” [link]

Luigi Russolo, Art of Noises manifesto (1913) [link]

Nikolai Kulbin, 1909ish

Francis Dhomont, Points de fuite (1982)

Analog, recording, & then computers….

6 of 38

MUSIC AND SOUND DESIGN

  • Since early 20th century, music admits “all sound”
  • Music from Brian O’Reilly (Lassalle College of the Arts)
    • https://vimeo.com/dendriform
    • Analog circuits
      • For audio and video

  • Media production audio
    • Ambience, sound effects

7 of 38

AUDIO VS SYMBOLIC

  • Discussion
  • Models and modeling

8 of 38

SOUND MODELS

  • What it is:
    1. A space of sounds
    2. A means of navigation
  • What it is not:
    • Universal
    • Beyond the scope:
      • Mapping between human gesture and navigation

But programming sound is hard….

9 of 38

OBJECTIVE: SOUND-TO-MODEL (“S2M”)

  • Define a synthesizer with a collection of sounds.
    • Look, ma – no programming!
  • We want a sound model *maker* - a factory.
  • The maker is universal, the golems (sound models) are specific and expressive.
    • Synth models: interactive, real-time, play forever, generalize
  • S2M is also a good term for search …..

Data-driven modeling

10 of 38

NOVELTY

  • How can we generate audio that
    • We’ve never heard before?
    • We don’t have data for?
    • We can’t possibly get data for?

“Out of domain”

11 of 38

MORPHING

  • Many paths, different descriptions
  • Defining spaces
    • “world models”
    • Musical listening

“Generalization”, Tweening sounds and Interpolation

Manually:

  • Karlheinz Stockhausen

Gesang der Junglinge (1956) for electronics, tape manipulation

  • Trevor Wishart

Redbird: A Political Prisoner's Dream (1973-77)

link1, link2, link3

The issue of time

Slaney (1993) – Identify t/f “correspondences”

Compare to image morphing

Sounding object characteristics

Pruvost, L., Scherrer, B., Aramaki, M., Ystad, S., & Kronland-Martinet, R. (2015). Perception-based interactive sound synthesis of morphing solids' interactions. In SIGGRAPH Asia 2015 Technical Briefs (pp. 1-4).

12 of 38

MODELING & PLAYABILITY

  • The space of sounds
        • Create the latent space that organizes the discrete dataset points
        • generate novel sounds, “interpolating” between dataset points

    • Parameters
      • Reasonably smooth and consistent
      • Model must be RESPONSIVE, have an immediate “parameter response time”

    • Play forever

13 of 38

GANSYNTH

  • (Was) State of the art for instrument tone generation
  • Excellent interpolation

Trained on 2D and 2-channel spectrograms,

Magnitude & “instantaneous frequency” (IF – time derivative of phase)

Engel, Jesse, et al. "Gansynth: Adversarial neural audio synthesis."

arXiv preprint arXiv:1902.08710 (2019).

But musical instruments ….

instantaneous frequency

14 of 38

GANS: A CLOSER LOOK

  • “GANs are unstable?”
  • Workflow
    • Gaussian noise+conditional params
    • Upscale (eg convolutional layers)
    • Output to Discriminator
    • Discriminator downscales until ouput
  • Progressive GANS
    • Add layers during training (downscaling reals, too!)
  • From discriminator to critic

G

D

 

 

params

 

Real

?

Fake

Generates n-seconds of sounds for each parameter configuration

15 of 38

OBJECTIVE FUNCTIONS

  • Kullback-Leibler Divergence

  • Make it symmetric: JSD(P || Q) = (KL(P || M) + KL(Q || M)) / 2

“Jensen-Shannon Divergence”

  • Binary classification
    • Output values in [0,1]
    • D() – the probability that the data is real
    • binary cross-entropy
    • Minimizes JSD

  • Discriminator as critic
    • Output values in [-inf, +inf]
    • Wasserstein Distance (minimize over all possible waves to match Pr to Pg)

16 of 38

GAN “LATENT SPACE”

  • What is a latent space, and in particular what is it for the GAN?
  • The concept of “responsive” real time

G

D

 

 

params

 

Real

?

Fake

Generates n-seconds of sounds for each parameter configuration

17 of 38

RNN: TRAINING DATA

  • NSynth database samples
    • Parameters:
      • Instrument ID: Trumpet and Clarinet
      • Pitch: 12 chromatic tones for each instrument, [E4 – E5]:

Can it generalize? ….

Wyse, L. Real-valued parametric condition of an RNN for interactive sound synthesis. In Proceedings of the Proceedings of the 6th International Workshop on Musical Metacreation, ACM Conference on Computational Creativity. Salamanca, Spain, June, 2018.

x1

y1

input

GRU

GRU

GRU

p1

p2

…pn

Audio

params

output

layers

18 of 38

GENERALIZATION (RNN)

Train: Trumpet and Clarinet,

12 pitches, spanning octave: E4, E5

Generate: Clarinet,

Continuous sweep, spanning octave: E4, E5

note playability

19 of 38

TRUMPINET

Generate: Trumpinet (mid-point inst=.5 instrument),

Continuous sweep, spanning octave: E4, E5

Train: Trumpet (inst=0, and Clarinet, inst=1

12 pitches, spanning octave: E4, E5

Where could we get *that* data?

20 of 38

IN THE DIGITAL LUTHIER’S TOOL SET

  • Recurrent Neural Network (RNN)
    • Problem – interpolation

Generates n-seconds of sounds for each parameter configuration

  • Generative Adversarial Network (GAN)
    • Problem: Fundamentally not real-time

Generates 1 sample

at a time

Real data

Fake data

G

D

 

 

params

 

Real

?

Fake

RNN

x

params

y

The “Performer”

The “Interpolator”

~ 128 Dims

Sound complexity?

21 of 38

WHAT KIND OF SOUND?

  • Texture
    • Existing generative modeling work on speech and music (traditional note-based)
    • Existing classification of “environmental” (any and all) sounds
    • Limited work on generative textures
  • What is a texture?
    • A sound we can describe at a “steady state”
    • There exists a scale, or window size, such that the signal stands a high probably of being drawn from the same description independent of the location of the window.

Examples

Dripping

Engine

Wind

Fire

Clarinet

Pops

Dripping2

Thus we are talking about a huge space (compared to speech or musical notes), but with some constraints amenable to modeling.

22 of 38

SYNTHETIC AUDIO TEXTURE DATASETS

  • Nested Hierarchy supporting multiple overlapping time scales
  • DSSynth each have the same (abstract) interface and produce sound.
  • Designed to create texture for NN training
    • Explicit time scales that interact
    • Parameters are labels
    • Seedable for reproducibility
    • (not real-time)

Parameters

Argh! Why am I still coding synths!?

clap

clapper

applause

SynTex

Syntex: parametric audio texture datasets for conditional training of instrumental interfaces., L. Wyse, Ravikumar, P.T (New Interfaces for Musical Expression, NIME 2022)

23 of 38

PURE “STYLE” -- VARIATION AND EXTENSION

Original short clip

Continuation

Continuation

Continuation

24 of 38

WHAT ABOUT “DYNAMIC” TEXTURES?

  • The “problem” of time, and the texture solution

  • The structure of the description remains the same, while the values of the parameterization change
    • We (designers, composers) decide what is content and what is style.

Water fill

But are images good representations for textures?

Content?

25 of 38

NATURE OF THE GAN LATENT SPACE

  • Lumpy
    • Some neighborhoods barely change the sound at all, Other regions sound changes quickly.
    • Possibility that “you cain’t get thar from heya”
      • One-D straight tranectories may go through regions that sound nothing like the start and destination points.

26 of 38

GAN SUBSPACE SELECTION AND SMOOTHING

Sound Model Factory: An Integrated System Architecture for Generative Audio Modelling, L. Wyse, C. Gupta, P. Kamath (EvoMusArt, 2022)

OK – Let’s put it all together!

Teuvo Kohonen’s Self-Organizing Map (“SOM”)

27 of 38

THE BIG PICTURE

“Interpolator”

“Performer”

28 of 38

RNN TEXTURE MODELING

  • Trained on segments with constant rates in [2,32]/sec
  • Synthesized by varying the conditioning parameter

Regular

Random

x1

y1

input

GRU

GRU

GRU

p1

p2

…pn

Audio

params

output

layers

unfilling

recorded

Recorded

data, too:

29 of 38

DESIGNING NAVIGATION

GAN Generator

Designers choice!

Parameters from labeled data (pitch, roughnes)

128 D “latent” space

30 of 38

BOREILLY TEXTURE EXAMPLE

  • Original sounds from different synthesizers

  • GAN trained unconditionally
    • Generate lots of random sounds from the 128D latent space
    • Choose 4 to make a 2D subspace

4 points define a 2D space.

128 D “latent” space

31 of 38

NAVIGATION

32 of 38

FAKE TWEEN DATA FOR TRAINING MORPHS

With no tween data for training,

The RNN fails to morph

With FAKE tween data for training,

The RNN can learn.

Real time continuous tweening of pitch and timbre

Pitch conditioned, instruments located in z-space

33 of 38

BREAK

34 of 38

OTHER NAVIGATION STRATEGIES

  • Can we create a latent space with smoother “in between” regions?
  • Can we locate sounds hidden in the latent space?
  • Can we find semantic directions in the latent space?

35 of 38

RECENT EXPLORATIONS

Gupta, C., Kamath, P., Wei, Y.Z, Li Z., Nanayakkara, S., Wyse, L. (2023). Towards. Controllable Audio Texture Morphing. International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

36 of 38

QUERY BY SYNTHESIS

Kamath, P., Li, Z., Gupta, C., Jaidka, K., Nanayakkara, S., & Wyse, L. (2023). An Example-Based Framework for Perceptually Guided Audio Texture Generation. ACM Multimedia

37 of 38

Also good for how to gt control of the latent space in a GAN

38 of 38

DATA-DRIVEN SOUND MODELING

  • Data driven synthesizer design
    • Sound2Model
    • Designer’s role
      • Dataset, navigation strategies

  • Concepts
    • Latent parametric space
    • Framing “content” and “style”
    • Textures and morphability
    • Universal Factory -> expressive models

Sound

Model

Factory

NUS colleagues:�Chitralekha Gupta, Purnima Kamath