JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 22

MLDS HW1-2

TAs

ntu.mldsta@gmail.com

2 of 22

Outline

Timeline
Task Descriptions
Q&A

3 of 22

Timeline

4 of 22

Three Parts in HW1

(1-1) Deep vs Shallow:

Simulate a funtion.
Train on actual task using shallow and deep models.

(1-2) Optimization

Visualize the optimization process.
Observe gradient norm during training.
What happens when gradient is almost zero?

(1-3) Generalization

5 of 22

Schedule

3/9 :

Release HW1-1

3/16 :

Release HW1-2

3/23:

Deadline to team-up by yourselves
Release HW1-3

3/30:

Deadline to team-up by TAs

4/6:

All HW1 due (including HW1-1, HW1-2 and HW1-3)

6 of 22

Task Descriptions

7 of 22

HW1-2: Optimization

Three subtask

Visualize the optimization process.
Observe gradient norm during training.
What happens when gradient is almost zero?

Train on designed function, MNIST or CIFAR-10...

8 of 22

Visualize the Optimization Process 1/3

Requirement

Collect weights of the model every n epochs.
Also collect the weights of the model of different training events.
Record the accuracy (loss) corresponding to the collected parameters.
Plot the above results on a figure.

9 of 22

Visualize the Optimization Process 2/3

Model

.....

m2n2

mknk

Collect parameters of the model:

Reduce the dimension

.....

1st event epoch 0

1st event epoch 3

1st event epoch 6

ith event epoch 0

ith event epoch 3

ith event epoch 6

m1n1

m1n1 + m2n2 + ...... + mknk

dimension

reduction

10 of 22

Visualize the Optimization Process 3/3

DNN train on MNIST
Collect the weights every 3 epochs, and train 8 times. Reduce the dimension of weights to 2 by PCA.

layer 1 whole model

11 of 22

Observe Gradient Norm During Training 1/2

Requirement

Record the gradient norm and the loss during training.
Plot them on one figure.

p-norm

In PyTorch:

Other packages: The similar code can be applied.

12 of 22

Observe Gradient Norm During Training 2/2

MNIST

13 of 22

What Happened When Gradient is Almost Zero 1/3

Requirement

Try to find the weights of the model when the gradient norm is zero (as small as possible).
Compute the "minimal ratio" of the weights: how likely the weights to be a minima.
Plot the figure between minimal ratio and the loss when the gradient is almost zero.

Tips

Train on a small network.

14 of 22

What Happened When Gradient is Almost Zero 2/3

How to reach the point where the gradient norm is zero?

First, train the network with original loss function.

Change the objective function to gradient norm and keep training.
Or use second order optimization method, such as Newton’s method or Levenberg-Marquardt algorithm (more stable)

How to compute minimal ratio?

Compute () (hessian matrix), and then find its eigenvalues. The proportion of the eigenvalues which are greater than zero is the minimal ratio.
Sample lots of weights around , and compute . The minimal ratio is the proportion that .

15 of 22

What Happened When Gradient is Almost Zero 3/3

Train 100 times.
Find gradient norm equal to zero by change objective function.
Minimal ratio is defined as the proportion of eigenvalues greater than zero.

16 of 22

HW1-2 Report Questions (10％)

Visualize the optimization process.

Describe your experiment settings. (The cycle you record the model parameters, optimizer, dimension reduction method, etc) (1%)
Train the model for 8 times, selecting the parameters of any one layer and whole model and plot them on the figures separately. (1%)
Comment on your result. (1%)

Observe gradient norm during training.

Plot one figure which contain gradient norm to iterations and the loss to iterations. (1%)
Comment your result. (1%)

What happens when gradient is almost zero?

State how you get the weight which gradient norm is zero and how you define the minimal ratio. (2%)
Train the model for 100 times. Plot the figure of minimal ratio to the loss. (2%)
Comment your result. (1%)

Bonus (1%)

Use any method to visualize the error surface.
Concretely describe your method and comment your result.

17 of 22

Example of Bonus 1/3

Similar method as pg.10, but use TSNE to reduce dimensionality.
First train with gradient descent, then use second order optimization, finally train for about 10 epochs furthur by second order optimization.
During the 10 epochs, randomly plot nearby parameters.
Tips:

Use small model (less than 50 parameters) on small tasks (simulate function)
Fixed number of possible input (thus number of possible output is also fixed)
Scale the range of randomness according to each parameters’ rate of descending
Non-linear coloring

18 of 22

Example of Bonus 2/3

Perturb each parameter randomly within a small range and plot the resulting loss.
Scale differences between parameters are significant.

19 of 22

Example of Bonus 3/3

Plot the error surface between start and end point.

20 of 22

Allow Packages

python 3.6
TensorFlow r1.6
PyTorch 0.3 / torchvision
Keras 2.0.7 (TensorFlow backend only)
MXNet 1.1.0
CNTK 2.4
matplotlib
scikit-learn 0.19.1
Python Standard Library
If you want to use other packages, please ask TAs for permission first!

21 of 22

Submission

Deadline: 2018/4/6 23:59 (GMT+8)
Write the questions of HW1-1, HW1-2 and HW1-3 in one report.
Chinese unless you are not familiar with Chinese
At most 10 pages for HW1-1, HW1-2 and HW1-3
Your github must have several files under directory hw1/

Readme.*
Report.pdf
other code

In your Readme, state clearly how to run your program to generate the results in your report.
Files for training is required.

22 of 22

Q&A

ntu.mldsta@gmail.com