1 of 50

Music Composition Using Neural Networks

Shen Ting Ang�Data Science SG 19 Oct 2017

2 of 50

Brief Autobio

3 of 50

Collaborators!

My teammates for UCSD CSE253 Final Project (Mar 2016):

Patrick Hsu�Anand Desai�Feichao Qian�Olga Souverneva

4 of 50

5 of 50

Aims

How to model music?
LSTM vs RNN?
If LSTM, which gating unit to use?
Can we achieve polyphonic generation?

6 of 50

Aims

How to model music? Good question...
LSTM vs RNN? Hypothesis: LSTM but let’s test
If LSTM, which gating unit to use? Test/Compare
Can we achieve polyphonic generation? Maybe?

7 of 50

Big Question 1: How to model music?

Two main types of music representation:

Audio - Waveforms/compressed waveforms (MP3, FLAC, etc.)
Notational - MIDI/ABC

Pros and cons of each?

8 of 50

Waveform Representation

Raw Waveforms are hard to use for modelling!

Common feature representation: Mel-Frequency Cepstrum Coefficients (MFCC)

End result: each window of audio is represented by a vector of coefficients (size of usually about 10-50)

9 of 50

Waveform Representation (MFCC)

Source: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

10 of 50

Waveform Representation (Attempt)

Data: Bach Goldberg Variations - Failed!

Loss did not converge even after 2000 iterations
Tuned various parameters - number of layers, number of nodes, unit types etc.
Poor processing/wrong parameters maybe...

11 of 50

Sheet Music

12 of 50

Notation Representation (MIDI)

Messaging Protocol - “Note on, note off”

Covers:

Pitch
Volume
Time

13 of 50

Notation Representation (ABC)

“Textual” representation of MIDI

Easy conversion ABC <-> MIDI
Easily converted to sheet music
Notation is easily understood by musicians
Compact representation of music
Supports polyphony/multiple voices

14 of 50

ABC Notation Example

15 of 50

Data Set (ABC Notation)

Scottish Folk songs by James Aird - Single voice (Monophony)
Nottingham Music Database - Polyphony
JS Bach Chorales - Contrapuntal Polyphony

16 of 50

Bach Chorale Example

17 of 50

Why use ABC Notation?

Simplifies to text generation problem

“Well-studied” problem - e.g. Bible, Project Gutenberg
Easier than waveform representation

Good source of data
Easier data processing

Waveforms are “harder” to generate well
Avoid issues with deciding on MFCC window size, etc.
Data can be used “as-is” without much further work

Polyphony support

18 of 50

Character-Level RNN Text Generation

Training:

Network reads training set in fixed lengths of text

Generating:

Start with seed text
Network will generate output character by character

19 of 50

Character-Level RNN Text Generation

Source: Andrej Karpathy

20 of 50

RNN vs LSTM vs GRU

RNN - Extension of feed-forward neural network

Can handle variable length input

LSTM - Repeating module has more complicated structure

“Memory unit”
Avoids error decay problems with RNNs
Gates to remove/add information - input, output, forget gates

Gated Recurrent Unit (GRU)

Update and Reset gates

21 of 50

Easy Implementation with Python/Keras

Python2 with Keras (v0.3 at that time)
Modified from example script for text generation in Keras - see References for link
~100 lines of code

22 of 50

Simple Network Structure

Example Base Model (on the right)
Vary:

# of Layers (1 vs 3)
Type of Layer (RNN, LSTM, GRU)
Type of Activation (tanh, ReLU)
Type of Optimizer (RMSProp, SGD, AdaGrad, AdaDelta)

23 of 50

Low Footprint Execution

Lenovo Y50 Laptop (Geforce GTX960M, Intel i7-4720HQ) - USD$1000 in Nov 2015
Average training time of under 4 hours
Good convergence within 100-200 iterations

24 of 50

Evaluation and Data Sets (Recap)

Scottish Folk songs by James Aird (Aird)�Base set - single melody line (monophonic) generation
Nottingham Music Database (Nottingham)�Polyphonic Generation
JS Bach Chorales (Bach)�Polyphonic Generation - more complex

25 of 50

LSTM has a lower loss than RNN

RNN

LSTM

26 of 50

Loss Decreases with Deeper Architectures

27 of 50

RMSprop gives fastest convergence for loss

28 of 50

Big Question 2: How to Evaluate?

Objective Measures?

No AUC, Accuracy, etc.

What about Loss?

Does it make sense to evaluate music on objective measures? Do these even exist?

29 of 50

An Objective Evaluation Method

Euler’s Gradus Suavitatis measure of melodiousness (1739) - higher is better

Aird (Original)	7.059
Aird (RNN)	6.191
Aird (LSTM)	5.743
Aird (GRU)	5.890
Bach Chorales (Original)	6.821

30 of 50

Subjective Evaluation by Humans

Used extensively in Text-To-Speech Generation studies!

Using a similar idea, ask participants to rate on 1-10 for:

Musicality
Style

Also: do you think this was “composed” by a computer? (yes/no)

31 of 50

Setting up Subjective Evaluation

25 Volunteers were presented with 10 samples:

Melody only: Original music from Aird Dataset, LSTM output (2 different outputs), RNN output, GRU output
Polyphonic: Original music from Nottingham Dataset, LSTM output using RMSProp, Adagrad, SGD, LSTM output using 256 units per layer

32 of 50

Range of Human Subjects

Mean Age	28.56
Median Age	26
Std Dev of Age	8.84
Music Professionals	4
Some Musical Background	15
No Musical Background	6

33 of 50

Human Evaluation

34 of 50

Human Evaluation

35 of 50

“Best” Network for task?

Appears to be:

LSTM (better than GRU/RNN)
3 layers
RMSProp (Faster convergence than AdaDelta)

36 of 50

Demo Music (Used in Evaluation)

Monophonic 1: https://www.youtube.com/watch?v=-Fvt2lzLEGo

Monophonic 2: https://www.youtube.com/watch?v=GHOWlDg_bM4

Polyphonic 1: https://www.youtube.com/watch?v=0AAT83i3op0

Polyphonic 2: https://www.youtube.com/watch?v=rOJxTYwWRUc

37 of 50

Answers

Monophonic 1: Original

Monophonic 2: LSTM

Polyphonic 1: LSTM

Polyphonic 2: Original

38 of 50

Sheet Music of Polyphonic Output (Nottingham)

39 of 50

Sheet Music of Polyphonic Output (Bach)

40 of 50

Defects of Generated Music

Does not follow Time Signature
Poor Endings
Doesn’t always learn “correct” harmonies

41 of 50

Discussion - WaveNets

https://deepmind.com/blog/wavenet-generative-model-raw-audio/

Works on raw audio - exciting new development!
CNN with “dilation factors”
PixelCNN able to generate image pixel by pixel; Raw audio generated sample by sample
High-dimensional/high bandwidth

42 of 50

Discussion - Composition vs Arrangement

Composition (making music from scratch) vs Arrangement (adding harmonies to existing music/change of style)
More like a “classifier”? Or still a generation problem?
Similar ideas to speech problems - e.g. change of speaker vs change of style

43 of 50

Discussion - Human vs Machine?

Is machine music composition adversarial to human composers?

Human generates motif seed, machine composes the rest
Machine arrangement to change style?
New musical genres?
Complementary partners rather than adversaries

44 of 50

Conclusions

Various ways of modelling music and implementing music generation
ABC Notation + LSTM provide a lightweight solution with relatively good results
More complex architectures such as WaveNets will allow for generation using raw audio
Exciting new possibilities in terms of new forms of music

45 of 50

Acknowledgements

Teammates: Patrick, Anand, Feichao, Olga

CSE253 Teaching Staff (Prof Cottrell and TAs)

Friends and Family who volunteered their time to be evaluators

Accenture (for today’s venue) and DSSG (for the invite)

46 of 50

References

MFCCs:

Gradus Suvitatis:

http://www.mathematik.com/Piano/ (Contributed by Yitch)

47 of 50

References

Text Generation on RNN:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py (This is the base code we used)

ABC Notation

http://abcnotation.com/

48 of 50

References

Music Composition with RNNs:

Google WaveNets:

https://arxiv.org/pdf/1609.03499.pdf

49 of 50

References (From Yitch)

Adobe Voco - “Photoshop of Voice”

https://en.wikipedia.org/wiki/Adobe_Voco

SongSim

https://colinmorris.github.io/SongSim/#/oops

50 of 50

Further Questions?

LinkedIn: https://www.linkedin.com/in/angshenting/

Facebook: https://www.facebook.com/shenting