1 of 50

Music Composition Using Neural Networks

Shen Ting Ang�Data Science SG 19 Oct 2017

2 of 50

Brief Autobio

© Shen Ting Ang 2017

3 of 50

Collaborators!

My teammates for UCSD CSE253 Final Project (Mar 2016):

Patrick Hsu�Anand Desai�Feichao Qian�Olga Souverneva

© Shen Ting Ang 2017

4 of 50

Popular Topic?

23 Groups in class, 6 (including us) chose this topic back in Mar 2016.

Spoiler: Ours scored the highest.

Other popular topics: Image Captioning, Image Classification, etc.

© Shen Ting Ang 2017

5 of 50

Aims

  1. How to model music?
  2. LSTM vs RNN?
  3. If LSTM, which gating unit to use?
  4. Can we achieve polyphonic generation?

© Shen Ting Ang 2017

6 of 50

Aims

  • How to model music? Good question...
  • LSTM vs RNN? Hypothesis: LSTM but let’s test
  • If LSTM, which gating unit to use? Test/Compare
  • Can we achieve polyphonic generation? Maybe?

© Shen Ting Ang 2017

7 of 50

Big Question 1: How to model music?

Two main types of music representation:

  1. Audio - Waveforms/compressed waveforms (MP3, FLAC, etc.)
  2. Notational - MIDI/ABC

Pros and cons of each?

© Shen Ting Ang 2017

8 of 50

Waveform Representation

Raw Waveforms are hard to use for modelling!

Common feature representation: Mel-Frequency Cepstrum Coefficients (MFCC)

End result: each window of audio is represented by a vector of coefficients (size of usually about 10-50)

© Shen Ting Ang 2017

9 of 50

Waveform Representation (MFCC)

© Shen Ting Ang 2017

10 of 50

Waveform Representation (Attempt)

Data: Bach Goldberg Variations - Failed!

  • Loss did not converge even after 2000 iterations
  • Tuned various parameters - number of layers, number of nodes, unit types etc.
  • Poor processing/wrong parameters maybe...

© Shen Ting Ang 2017

11 of 50

Sheet Music

© Shen Ting Ang 2017

12 of 50

Notation Representation (MIDI)

Messaging Protocol - “Note on, note off”

Covers:

  • Pitch
  • Volume
  • Time

© Shen Ting Ang 2017

13 of 50

Notation Representation (ABC)

“Textual” representation of MIDI

  • Easy conversion ABC <-> MIDI
  • Easily converted to sheet music
  • Notation is easily understood by musicians
  • Compact representation of music
  • Supports polyphony/multiple voices

© Shen Ting Ang 2017

14 of 50

ABC Notation Example

T:291. Was frag’ ich nach der Welt�A|F E/D/ A A|HB3 B|E E A G|F E HD A|B B A G|�HF3 E|F ^G A B/c/4d/4|c B/A/ HA A|A A d =c|HB3 B|�B B e d|Hc3 A|B A B c|Hd3 A|A G/F/ E/F/4G/4 E|HD3|]

© Shen Ting Ang 2017

15 of 50

Data Set (ABC Notation)

  1. Scottish Folk songs by James Aird - Single voice (Monophony)
  2. Nottingham Music Database - Polyphony
  3. JS Bach Chorales - Contrapuntal Polyphony

© Shen Ting Ang 2017

16 of 50

Bach Chorale Example

© Shen Ting Ang 2017

17 of 50

Why use ABC Notation?

  • Simplifies to text generation problem
    • “Well-studied” problem - e.g. Bible, Project Gutenberg
    • Easier than waveform representation
  • Good source of data
  • Easier data processing
    • Waveforms are “harder” to generate well
    • Avoid issues with deciding on MFCC window size, etc.
    • Data can be used “as-is” without much further work
  • Polyphony support

© Shen Ting Ang 2017

18 of 50

Character-Level RNN Text Generation

Training:

  • Network reads training set in fixed lengths of text

Generating:

  • Start with seed text
  • Network will generate output character by character

© Shen Ting Ang 2017

19 of 50

Character-Level RNN Text Generation

© Shen Ting Ang 2017

Source: Andrej Karpathy

20 of 50

RNN vs LSTM vs GRU

  • RNN - Extension of feed-forward neural network
    • Can handle variable length input
  • LSTM - Repeating module has more complicated structure
    • “Memory unit”
    • Avoids error decay problems with RNNs
    • Gates to remove/add information - input, output, forget gates
  • Gated Recurrent Unit (GRU)
    • Update and Reset gates

© Shen Ting Ang 2017

21 of 50

Easy Implementation with Python/Keras

  • Python2 with Keras (v0.3 at that time)
  • Modified from example script for text generation in Keras - see References for link
  • ~100 lines of code

© Shen Ting Ang 2017

22 of 50

Simple Network Structure

  • Example Base Model (on the right)
  • Vary:
    • # of Layers (1 vs 3)
    • Type of Layer (RNN, LSTM, GRU)
    • Type of Activation (tanh, ReLU)
    • Type of Optimizer (RMSProp, SGD, AdaGrad, AdaDelta)

© Shen Ting Ang 2017

23 of 50

Low Footprint Execution

  • Lenovo Y50 Laptop (Geforce GTX960M, Intel i7-4720HQ) - USD$1000 in Nov 2015
  • Average training time of under 4 hours
  • Good convergence within 100-200 iterations

© Shen Ting Ang 2017

24 of 50

Evaluation and Data Sets (Recap)

© Shen Ting Ang 2017

25 of 50

LSTM has a lower loss than RNN

RNN

LSTM

© Shen Ting Ang 2017

26 of 50

Loss Decreases with Deeper Architectures

© Shen Ting Ang 2017

27 of 50

RMSprop gives fastest convergence for loss

© Shen Ting Ang 2017

28 of 50

Big Question 2: How to Evaluate?

Objective Measures?

No AUC, Accuracy, etc.

What about Loss?

Does it make sense to evaluate music on objective measures? Do these even exist?

© Shen Ting Ang 2017

29 of 50

An Objective Evaluation Method

Euler’s Gradus Suavitatis measure of melodiousness (1739) - higher is better

Aird (Original)

7.059

Aird (RNN)

6.191

Aird (LSTM)

5.743

Aird (GRU)

5.890

Bach Chorales (Original)

6.821

© Shen Ting Ang 2017

30 of 50

Subjective Evaluation by Humans

Used extensively in Text-To-Speech Generation studies!

Using a similar idea, ask participants to rate on 1-10 for:

  • Musicality
  • Style

Also: do you think this was “composed” by a computer? (yes/no)

© Shen Ting Ang 2017

31 of 50

Setting up Subjective Evaluation

25 Volunteers were presented with 10 samples:

  • Melody only: Original music from Aird Dataset, LSTM output (2 different outputs), RNN output, GRU output
  • Polyphonic: Original music from Nottingham Dataset, LSTM output using RMSProp, Adagrad, SGD, LSTM output using 256 units per layer

© Shen Ting Ang 2017

32 of 50

Range of Human Subjects

Mean Age

28.56

Median Age

26

Std Dev of Age

8.84

Music Professionals

4

Some Musical Background

15

No Musical Background

6

© Shen Ting Ang 2017

33 of 50

Human Evaluation

© Shen Ting Ang 2017

34 of 50

Human Evaluation

© Shen Ting Ang 2017

35 of 50

“Best” Network for task?

Appears to be:

  • LSTM (better than GRU/RNN)
  • 3 layers
  • RMSProp (Faster convergence than AdaDelta)

© Shen Ting Ang 2017

36 of 50

Demo Music (Used in Evaluation)

© Shen Ting Ang 2017

37 of 50

Answers

Monophonic 1: Original

Monophonic 2: LSTM

Polyphonic 1: LSTM

Polyphonic 2: Original

© Shen Ting Ang 2017

38 of 50

Sheet Music of Polyphonic Output (Nottingham)

© Shen Ting Ang 2017

39 of 50

Sheet Music of Polyphonic Output (Bach)

© Shen Ting Ang 2017

40 of 50

Defects of Generated Music

  • Does not follow Time Signature
  • Poor Endings
  • Doesn’t always learn “correct” harmonies

© Shen Ting Ang 2017

41 of 50

Discussion - WaveNets

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

  • Works on raw audio - exciting new development!
  • CNN with “dilation factors”
  • PixelCNN able to generate image pixel by pixel; Raw audio generated sample by sample
  • High-dimensional/high bandwidth

© Shen Ting Ang 2017

42 of 50

Discussion - Composition vs Arrangement

  • Composition (making music from scratch) vs Arrangement (adding harmonies to existing music/change of style)
  • More like a “classifier”? Or still a generation problem?
  • Similar ideas to speech problems - e.g. change of speaker vs change of style

© Shen Ting Ang 2017

43 of 50

Discussion - Human vs Machine?

Is machine music composition adversarial to human composers?

  • Human generates motif seed, machine composes the rest
  • Machine arrangement to change style?
  • New musical genres?
  • Complementary partners rather than adversaries

© Shen Ting Ang 2017

44 of 50

Conclusions

  • Various ways of modelling music and implementing music generation
  • ABC Notation + LSTM provide a lightweight solution with relatively good results
  • More complex architectures such as WaveNets will allow for generation using raw audio
  • Exciting new possibilities in terms of new forms of music

© Shen Ting Ang 2017

45 of 50

Acknowledgements

Teammates: Patrick, Anand, Feichao, Olga

CSE253 Teaching Staff (Prof Cottrell and TAs)

Friends and Family who volunteered their time to be evaluators

Accenture (for today’s venue) and DSSG (for the invite)

© Shen Ting Ang 2017

46 of 50

References

MFCCs:

Gradus Suvitatis:

  • http://www.mathematik.com/Piano/ (Contributed by Yitch)

© Shen Ting Ang 2017

47 of 50

References

Text Generation on RNN:

ABC Notation

  • http://abcnotation.com/

© Shen Ting Ang 2017

48 of 50

References

Music Composition with RNNs:

Google WaveNets:

© Shen Ting Ang 2017

49 of 50

References (From Yitch)

Adobe Voco - “Photoshop of Voice”

SongSim

© Shen Ting Ang 2017

50 of 50

Further Questions?

© Shen Ting Ang 2017