1 of 30

Recap: Real-time(-ish) Sequence-to-Sequence Symbolic Music Generation

with Rule-Augmented Edit-Based Models

and Knowledge Distillation

Yichen “William” Huang

2 of 30

Recap: Real-time(-ish) Sequence-to-Sequence Symbolic Music Generation

with Rule-Augmented Edit-Based Models

and Knowledge Distillation

Or, Faster PolyDis,

and how I haven’t got it to work properly

Yichen “William” Huang

3 of 30

TL;DR

Idea: Edit model for symbolic music
Task

Current experiments: Knowledge distillation from a harmonic style transfer (HST) model
Potentially: splicing, inpainting, etc.

Differences from edit models in NLP

For our tasks, the input sequence do not easily approximate the desired output (“Keep” operations are scarce)

Rule-based transformations

Polyphonic music is not sequential, Levenshtein distance/edits do not apply

Defining a new set of distances/edit operations

4 of 30

Motivation: PolyDis as an Example

We have cool models for seq2seq tasks

E.g. Harmonic style transfer as conditional seq2seq generation

5 of 30

Motivation: PolyDis as an Example

We have cool models for seq2seq tasks

E.g. Harmonic style transfer as conditional seq2seq generation

We want close-to-real-time inference

E.g. NIME, running in a DAW, etc

6 of 30

Motivation: PolyDis as an Example

We have cool models for seq2seq tasks

E.g. Harmonic style transfer as conditional seq2seq generation

We want close-to-real-time inference

E.g. NIME, running in a DAW, etc

But most models are autoregressive (AR) and takes forever to run

7 of 30

Motivation: PolyDis as an Example

PolyDis inference on an A100 (40GB)
Single sequence inference (incl. pre-/post-processing): ~1.45s
Trace visualization: recurring patterns: AR

8 of 30

Motivation: PolyDis as an Example

Can we go faster?

Non-autoregressive (NAR) architectures
What should we leverage to make the NAR task easier (modal collapse, etc.)?

9 of 30

Motivation: PolyDis as an Example

HST: PolyDis v.s. Nearest chord tones
I - IV - V - I

10 of 30

Motivation: PolyDis as an Example

Can we go faster?

Non-autoregressive (NAR) architectures
What should we leverage to make the NAR task easier (modal collapse, etc.)?

HST, splicing, etc. can be formulated as rule-based transformations + minimal edits.

How do we model edits?

11 of 30

Rel. Work: Edit-Based Models in NLP

Applications: Grammar correction, simplification, style transfer, etc

12 of 30

Rel. Work: Edit-Based Models in NLP

Applications: Grammar correction, simplification, style transfer, etc
2 components
Edit model

Sequence to edit operations
E.g. keep, delete, replace (small vocab), move to pointer, etc

13 of 30

Rel. Work: Edit-Based Models in NLP

Applications: Grammar correction, simplification, style transfer, etc
2 components
Edit model

Sequence to edit operations
E.g. keep, delete, replace (small vocab), move to pointer, etc

Insertion model

AR, NAR, Semi-AR

14 of 30

Rel. Work: Edit-Based Models in NLP

Example: EdiT5: Semi-AR with pointers

F1 69.39->68.4,57x speedup, 1.3ms latency

15 of 30

Method: Formulation

How do we apply this?
E.g. Edit-based model for disentanglement-based HST

Reformulate PolyDis training for editing (feasible, probably)

16 of 30

Method: Formulation

How do we apply this?
E.g. Edit-based model for disentanglement-based HST

Reformulate PolyDis training for editing (feasible, probably)
Behaviour cloning from PolyDis

Flexible. Can clone any seq2seq with a few tweaks on transformation/edit rules (next slide)
Not to be confused with distillation
Downside: Soft v.s. Hard labels

17 of 30

Method: Pipeline Overview

Transformation Rules

Transform the raw input to approximate the desired output sequence

Edit Rules

Deriving GT edit operations for the edit model

Edit Model
Insertion Model

18 of 30

Method: Transformation Rules

Currently: Nearest chord tone

Future work: Rule discovery

19 of 30

Method: Edit Rules

Previously

Pitch change: soft labels for notes with the same timings
Deletion: everything else
Insertion: the edit model predict #insertions at each onset

No ambiguity, relative pitches are represented in the input

20 of 30

Method: Edit Rules

Previously

Pitch change: soft labels for notes with the same timings
Deletion: everything else
Insertion: the edit model predict #insertions at each onset

No ambiguity, relative pitches are represented in the input

Obviously this was a bad idea: Onset/duration changes are frequent and should be handled by the edit model.

Currently

Distance metric: Note shifts and deletion

Onset [-1, 1]
Pitch [-3, 3]
Duration [-3, 3]
Changing one step on each dim adds a cost of 1, deletion costs 100, out-of-range operations costs inf

Optimization: MFMC (deterministic)
Insertion: Same as above

21 of 30

Method: Edit Rules: MFMC: Note to self

22 of 30

Method: Edit Rules

Future work:

Offloading more work to the edit model

Insertion (NAR): the edit model predicts the inserted pitch (soft labels) for each onset

Semi-AR insertion
Rule discovery for an efficient edit operation space

23 of 30

Method: Edit Model

Pretraind MuseBERT
Inputs

Altered atr-mat, original rel-mat (as used in MuseBERT unsupervised HST)
Chord condition: prepended embedding from a pre-trained PolyDis chord encoder
Insertion: prepended special tokens

Outputs

#Insertions / edit id classification, CE loss

24 of 30

Method: Insertion Model

2-layer MuseBERT, from scratch
Inputs

atr-mat, rel-mat: transformed from the edit outputs
Chord condition: prepended embedding from PolyDis chord encoder
Insertion: appended masked notes

Outputs

Masked note prediction, CE loss

25 of 30

Experiments: Setup

Data: Pop909 textures

Chords: Drawn from 2-bar sequences from Pop909

26 of 30

Experiments: Is it fast?

Yes
351 ms (v.s. 1454 ms from PolyDis)

27 of 30

Experiments: Is it fast?

Future Work

Optimizing CPU load: rule application/conversion
Distilling edit-MuseBERT: soft labels

28 of 30

Experiments: Does it work?

No
Evaluation: tick-level F1 (dev set)

Edit-based
F1: 0.663, prec: 0.710, recall: 0.622
Rule baseline
F1: 0.611, prec: 0.611, recall: 0.610

29 of 30

Experiments: Does it work?

What went wrong?

Pending error analysis
Behaviour cloning: Hard labels
Training efficiency issue: 0.2 epochs per 12 hours. Need to optimize the CPU load.

30 of 30

To-Do

Add encoder-decoder attention for the insertion model.

This should have been obvious, and somehow I missed it…

But the main problem is with the edit model loss:

Edit/n_inserts loss goes from 1.36 / 0.29 to 1.05 / 0.22.
A CE edit loss of 1.05 still sounds too high. What went wrong?