1 of 30

Recap: Real-time(-ish) Sequence-to-Sequence Symbolic Music Generation

with Rule-Augmented Edit-Based Models

and Knowledge Distillation

Yichen “William” Huang

1

2 of 30

Recap: Real-time(-ish) Sequence-to-Sequence Symbolic Music Generation

with Rule-Augmented Edit-Based Models

and Knowledge Distillation

Or, Faster PolyDis,

and how I haven’t got it to work properly

Yichen “William” Huang

2

3 of 30

TL;DR

  • Idea: Edit model for symbolic music
  • Task
    • Current experiments: Knowledge distillation from a harmonic style transfer (HST) model
    • Potentially: splicing, inpainting, etc.
  • Differences from edit models in NLP
    • For our tasks, the input sequence do not easily approximate the desired output (“Keep” operations are scarce)
      • Rule-based transformations
    • Polyphonic music is not sequential, Levenshtein distance/edits do not apply
      • Defining a new set of distances/edit operations

3

4 of 30

Motivation: PolyDis as an Example

  • We have cool models for seq2seq tasks
    • E.g. Harmonic style transfer as conditional seq2seq generation

4

5 of 30

Motivation: PolyDis as an Example

  • We have cool models for seq2seq tasks
    • E.g. Harmonic style transfer as conditional seq2seq generation

  • We want close-to-real-time inference
    • E.g. NIME, running in a DAW, etc

5

6 of 30

Motivation: PolyDis as an Example

  • We have cool models for seq2seq tasks
    • E.g. Harmonic style transfer as conditional seq2seq generation

  • We want close-to-real-time inference
    • E.g. NIME, running in a DAW, etc

  • But most models are autoregressive (AR) and takes forever to run

6

7 of 30

Motivation: PolyDis as an Example

  • PolyDis inference on an A100 (40GB)
  • Single sequence inference (incl. pre-/post-processing): ~1.45s
  • Trace visualization: recurring patterns: AR

7

8 of 30

Motivation: PolyDis as an Example

  • Can we go faster?
    • Non-autoregressive (NAR) architectures
    • What should we leverage to make the NAR task easier (modal collapse, etc.)?

8

9 of 30

Motivation: PolyDis as an Example

  • HST: PolyDis v.s. Nearest chord tones
  • I - IV - V - I

9

10 of 30

Motivation: PolyDis as an Example

  • Can we go faster?
    • Non-autoregressive (NAR) architectures
    • What should we leverage to make the NAR task easier (modal collapse, etc.)?

  • HST, splicing, etc. can be formulated as rule-based transformations + minimal edits.
    • How do we model edits?

10

11 of 30

Rel. Work: Edit-Based Models in NLP

  • Applications: Grammar correction, simplification, style transfer, etc

11

12 of 30

Rel. Work: Edit-Based Models in NLP

  • Applications: Grammar correction, simplification, style transfer, etc
  • 2 components
  • Edit model
    • Sequence to edit operations
    • E.g. keep, delete, replace (small vocab), move to pointer, etc

12

13 of 30

Rel. Work: Edit-Based Models in NLP

  • Applications: Grammar correction, simplification, style transfer, etc
  • 2 components
  • Edit model
    • Sequence to edit operations
    • E.g. keep, delete, replace (small vocab), move to pointer, etc
  • Insertion model
    • AR, NAR, Semi-AR

13

14 of 30

Rel. Work: Edit-Based Models in NLP

  • Example: EdiT5: Semi-AR with pointers
    • F1 69.39->68.4,57x speedup, 1.3ms latency

14

15 of 30

Method: Formulation

  • How do we apply this?
  • E.g. Edit-based model for disentanglement-based HST
    • Reformulate PolyDis training for editing (feasible, probably)

15

16 of 30

Method: Formulation

  • How do we apply this?
  • E.g. Edit-based model for disentanglement-based HST
    • Reformulate PolyDis training for editing (feasible, probably)
    • Behaviour cloning from PolyDis
      • Flexible. Can clone any seq2seq with a few tweaks on transformation/edit rules (next slide)
      • Not to be confused with distillation
      • Downside: Soft v.s. Hard labels

16

17 of 30

Method: Pipeline Overview

  • Transformation Rules
    • Transform the raw input to approximate the desired output sequence
  • Edit Rules
    • Deriving GT edit operations for the edit model
  • Edit Model
  • Insertion Model

17

18 of 30

Method: Transformation Rules

  • Currently: Nearest chord tone

  • Future work: Rule discovery

18

19 of 30

Method: Edit Rules

  • Previously
    • Pitch change: soft labels for notes with the same timings
    • Deletion: everything else
    • Insertion: the edit model predict #insertions at each onset
      • No ambiguity, relative pitches are represented in the input

19

20 of 30

Method: Edit Rules

  • Previously
    • Pitch change: soft labels for notes with the same timings
    • Deletion: everything else
    • Insertion: the edit model predict #insertions at each onset
      • No ambiguity, relative pitches are represented in the input
    • Obviously this was a bad idea: Onset/duration changes are frequent and should be handled by the edit model.
  • Currently
    • Distance metric: Note shifts and deletion
      • Onset [-1, 1]
      • Pitch [-3, 3]
      • Duration [-3, 3]
      • Changing one step on each dim adds a cost of 1, deletion costs 100, out-of-range operations costs inf
    • Optimization: MFMC (deterministic)
    • Insertion: Same as above

20

21 of 30

Method: Edit Rules: MFMC: Note to self

21

22 of 30

Method: Edit Rules

  • Future work:
    • Offloading more work to the edit model
      • Insertion (NAR): the edit model predicts the inserted pitch (soft labels) for each onset
    • Semi-AR insertion
    • Rule discovery for an efficient edit operation space

22

23 of 30

Method: Edit Model

  • Pretraind MuseBERT
  • Inputs
    • Altered atr-mat, original rel-mat (as used in MuseBERT unsupervised HST)
    • Chord condition: prepended embedding from a pre-trained PolyDis chord encoder
    • Insertion: prepended special tokens
  • Outputs
    • #Insertions / edit id classification, CE loss

23

24 of 30

Method: Insertion Model

  • 2-layer MuseBERT, from scratch
  • Inputs
    • atr-mat, rel-mat: transformed from the edit outputs
    • Chord condition: prepended embedding from PolyDis chord encoder
    • Insertion: appended masked notes
  • Outputs
    • Masked note prediction, CE loss

24

25 of 30

Experiments: Setup

  • Data: Pop909 textures

  • Chords: Drawn from 2-bar sequences from Pop909

25

26 of 30

Experiments: Is it fast?

  • Yes
  • 351 ms (v.s. 1454 ms from PolyDis)

26

27 of 30

Experiments: Is it fast?

Future Work

  • Optimizing CPU load: rule application/conversion
  • Distilling edit-MuseBERT: soft labels

27

28 of 30

Experiments: Does it work?

  • No
  • Evaluation: tick-level F1 (dev set)
    • Edit-based
    • F1: 0.663, prec: 0.710, recall: 0.622
    • Rule baseline
    • F1: 0.611, prec: 0.611, recall: 0.610

28

29 of 30

Experiments: Does it work?

  • What went wrong?
    • Pending error analysis
    • Behaviour cloning: Hard labels
    • Training efficiency issue: 0.2 epochs per 12 hours. Need to optimize the CPU load.

29

30 of 30

To-Do

  • Add encoder-decoder attention for the insertion model.
    • This should have been obvious, and somehow I missed it…
  • But the main problem is with the edit model loss:
    • Edit/n_inserts loss goes from 1.36 / 0.29 to 1.05 / 0.22.
    • A CE edit loss of 1.05 still sounds too high. What went wrong?

30