DL in NLP 2020. Spring. Quiz 3
Neural Networks. Part 2
Some questions can be not mentioned in the lecture explicitly, but you can still use logic and google.
Credit for some questions:
cs231n.stanford.edu
* Required
Email address
*
Your email
Github account
*
Your answer
What is the derivative of a sigmoid function?
sigmoid(x) * (1  sigmoid(x))
sigmoid(x)
x^2 * sigmoid(x)
sigmoid(x)^2  sigmoid(x)
Clear selection
How do we compute gradients in backpropagation algorithm?
They are estimated with finite differences
They are computed symbolically and represented in a closed form
They are computed with the rule of derivative of the composition of functions
Clear selection
Default choice of nonlinearity
Sigmoid
Tanh
ReLU
Maxout
ELU
Clear selection
Why?
Fast computation
It is OXsymmetrical
In is OYsymmetrical
It produces more complex functions with less layers
It doest not saturate in +region
It converges faster (in practice)
What's the main drawback of the ReLU activation function?
it's not symmetric around 0
it's not smooth and cannot be differentiated
it can zero out all the gradients from some point in the training process
Clear selection
y = max(0, x @ W + b), dout  downstream gradient, @  matrix multiplication, all other operation are elementwise. What is d(loss)/dW?
x.T @ dout * (y > 0)) + dy / db
x.T @ dout * (y > 0))
W @ max(0, y) * dout + dy / db
W @ max(0, y) * dout
x @ W
Clear selection
Gradients with respect to x, y, z, w. Green numbers are forward pass. Red number number is a downstream gradient. Format your answer according to the pattern: x.xx, y.yy, z.zz, w.ww. Example answer: 3.00, 9.60, 18.66, 1.00
Your answer
What is a good way of weights initialization?
All 0's
Small random numbers
Normal distribution
All = constant > 0
Clear selection
Where is the place of the BatchNorm layer in the FFNN architecture?
Before the activation function
After the activation function
Before the input layer
Clear selection
The BatchNorm layer normalizes data over which axis?
Over instance axis
Over feature axis
Both
Clear selection
What is a better way of searching neural net's hyperparameters?
Grid search
Random search
Gradient descent
Clear selection
Why?
Good combinations of hyperparameters are not probable
Random search produce more diverse sets of hyperparameters
Gradient search allows the algorithm to converge faster
Clear selection
Can we use different learning rates at different layers of a neural network?
Yes
No
Clear selection
When neural network training should be stopped?
When train loss becomes constant
When train loss is zero
When validation loss starts to increase
Clear selection
What should be done if train loss is much less than validation loss?
Probably nothing at all  the model learned everything it can, maybe is just a noise in the dataset
Increase regularization
Collect more data
Check that your train onehot encoding is consistent with your validation onehot encoding
Check for a data leak
Make a hyperparameter search
Reduce model capacity
Check that all labels are in the training data
Try to change learning rate or to schedule it differently
Check your data preprocessing algorithm
Your questions about the lecture (if any)
Your answer
A copy of your responses will be emailed to the address you provided.
Submit
Never submit passwords through Google Forms.
reCAPTCHA
Privacy
Terms
This content is neither created nor endorsed by Google.
Report Abuse

Terms of Service

Privacy Policy
Forms