SDS4299
Natural Language Processing 2
Going Deeper with Neural Nets
Any Burning Questions for us?
Slido Code: #4299
Our Team
Keith
Y2 BZA
Bharath Shankar
Y2 DSA
Vinod
Y3 BZA
Recap
01
Last Time, In SDS1101
We covered:
Text Preprocessing
Naive Bayes Classifier
We got a good result from Naive Bayes..
… But we can do better!
NLP is Hard…
Machines can’t understand language, but we can!
It’s all thanks to this!
So, why don’t we try to mimic it?
Introduction to Neural Networks
Neural networks were inspired by Neurons
Neural networks were inspired by Neurons
Signal
Signal
Signal
Signal
Signal
Input layer
Hidden layer 1
Hidden layer 2
Output layer
𝑤1
𝑤2
Inputs
𝑥1
𝑥2
𝑥0= 1
Bias
𝑤0
Perceptron
Weights
Output:
1 or 0
Σ 𝑥ᐧ𝑤 = 𝑤0 + 𝑥 1ᐧ𝑤1 + 𝑥2ᐧ𝑤2
2. Apply weighted sum to activation function
ŷ=𝘨 ( Σ 𝑥ᐧ𝑤 )
Activation functions
What is an activation function?
In this example, the neuron would activate and output 1 if 𝑤0 + 𝑥 1ᐧ𝑤1 + 𝑥2ᐧ𝑤2 ≥ 0
Non-linear activation functions
Sigmoid or Logistic Activation Function
Output is bound to [0,1] and can be used to predict probability
Non-linear activation functions
Tanh or Hyperbolic Tangent Activation Function
Similar to sigmoid, but ranges from -1 to 1
Non-linear activation functions
ReLU (Rectified Linear Unit) Activation Function
Quantifying Loss
| How do we do this? | |
| Regression | Classification | 
| Mean Squared Error (MSE) Mean Absolute Error (MAE)  | Binary Cross Entropy Hinge Loss | 
For the rest of the workshop, we will assume a classification task with binary cross entropy as our cost function
Gradient Descent
An iterative solution to optimize the loss
We cannot jump to the value, so let’s walk there instead.
Let us say that the direction of steepest descent is represented by a vector v.
Then, we have to move some distance α in direction v
Once we finish the step, let us again check for direction of steepest descent!
We then stop at the minima - Our Goal!
Gradient Descent
Choosing the learning rate α
| Takes too much time | Just right | Can diverge | 
Gradient Descent
Weight update
Let us consolidate every individual weight in the network into a huge column vector: Ө
Then, Өupdated = Ө - αv
If α is set right, we should be able to reach a minimum!
We should be able to tweak our weights to minimise the cost function now!
Backpropagation
We need the ability to calculate the direction of steepest descent!
But how?
Derivative -
Slope at a point for 2D
Gradient:
Direction of steepest ascent
Generalize to n dimensions
Backpropagation
z1
ŷ
w1
w2
L(W)
| ∂L(W) | 
| ∂w1 | 
=
| ∂L(W) | 
| ∂ŷ | 
| ∂ŷ | 
| ∂z1 | 
| ∂z1 | 
| ∂w1 | 
*
*
Repeat this for every weight in the network using gradients from later layers
How does a small change in one weight (w1) affect the final loss L(W)
x
Backpropagation
Flow of information
Backpropagation algorithm:
Backpropagation
We just need to run the backpropagation algo for each weight in the neural net, to get our gradient
We use this gradient to take a step, improving our net slightly
We finally stop when our cost function is minimized!
Dropout
At every iteration, randomly selects some neurons and removes it along with all of its incoming & outgoing connections
The probability that these neurons will be selected depends on the rate specified (also a hyperparameter)
Embeddings
02
What are Embeddings? �Why do we need them?
What?
Low dimensional continuous vector representations
Why?
‘Hello!’
1
[ 6
1
4
3
5
7
9 ]
Input
Output
Machine readable format
Can I use one-hot encoding to ‘embed’ my words?
One-hot Encoding
1 Step: Create a vector for each unique word
E.g
Is this machine readable?
Is this a good embedding?
YES!
NO :(
| I | love | data | science | 
| 1 | 0 | 0 | 0 | 
| 0 | 1 | 0 | 0 | 
| 0 | 0 | 1 | 0 | 
| 0 | 0 | 0 | 1 | 
| I |  | data |  | 
Why not?
1000 words (in our corpus) = 1000 dimensional vector
Semantic Similarity
Ideally, the vectors used should ‘capture the meaning’ of our words
Man-made equipment are close together
Animals are close together
Types of Embeddings available
Custom Embeddings
Pre-trained Embeddings
(Note: one-hot encoding is �also in this category!)
Examples:GloVe, Word2Vec
What is Word2Vec?
Single-layered Neural Network based embedding model created in 2013 by researchers at Google
CBOW
In Continuous Bag of Words,
| I | really | love | studying | data | science | 
| I | really | love | studying | data | science | 
Skip-gram
In Skip-gram,
| I | really | love | studying | data | science | 
| I | really | love | studying | data | science | 
How do we generate vectors?
In both models, we are interested in the weights of the hidden layer!
What is GloVe?
In addition to semantic similarity, it takes into account the frequency of each word in the corpus
Created in 2014 by Stanford
Let’s work it out in our Colab file:�
tinyurl.com/nlp2colab�
RNN
03
Recurrent Neural Networks
Recurrent Neural Networks (RNN)
Recurrent Neural Network
RNN
xt
yt
ht
Unfolding an RNN
RNN
ht
xt
yt
RNN
xt-1
yt-1
RNN
xt
yt
RNN
xt+1
yt+1
ht-1
ht
Recurrence relation
A recurrence relation is applied at every time step to process a sequence:
State at previous time step
The same function and set of parameters are used at every time step
Backpropagation Through Time (BPTT)
RNN
ht
xt
yt
RNN
xt-1
yt-1
RNN
xt
yt
RNN
xt+1
yt+1
ht-1
ht
L0
L1
L2
L
Let’s go back to the Colab and see how a simple RNN model is implemented!
LSTM
04
Long Short Term Memory
Problems with RNNs
Backpropagation Through Time (BPTT)
Gradient
Gradient
Gradient
Gradient Slowly Disappears!!
So what?
Neural Nets learn using the gradient
So, what happens if the gradient vanishes??
The Network Doesn’t Learn!
Specifically, the network becomes focused on the last few inputs
Remembers this
Forgets this
The network has an extremely short term memory!
How do we fix this?
We have to add some memory!
“Forget” the less useful stuff, and implement some memory!
RNNs
h1
h2
h3
Information carried along by the hidden state
Hidden state is dominated by recency bias
Implementing Memory
Key Idea
h
c
While passing along the hidden state, pass along a “cell state” as well!
Carries local info
Carries important info from the past!
Anatomy of an LSTM Cell
Seems complicated!
Let’s break it down.
3 parts to it!
Forget
Input
Output
Forget Gate
hi-1
xi
hi-1+ xi
Sigmoid
fi
Vector encodes what previous info we need to forget!
Forget vector
Input Gate
hi-1
xi
hi-1+ xi
Sigmoid
ii
Input vector
tanh
c’i
Candidate vector
ii *c’i
Update applied to cell state
Applying the Update
fi
ci-1
fi*ci-1
ii *c’i
fi*ci-1+ ii *c’i
ci
New cell state
Output Gate
ci
hi-1+ xi
tanh
Sigmoid
Multiply
hi
Putting it all together
X
X
+
X
Forget Gate
Input Gate
Output Gate
ci-1
hi-1
xi
hi-1+ xi
hi-1+ xi
hi-1+ xi
fi
ii
c’i
ii *c’i
ci
hi
ci
Time for a practical demo!
Transformers
05
Pay Attention!
LSTMs are great, but…..
They take ages to train
Still have some issues in training longer sequences
They still use ALL the info of the sequence!
Wait what?
Consider the following:
I
am
a
human
Input
Eu
sou
um
humano
Output
Machine
Translation
Position matters!
For a sequence to sequence mapping, the order and position matters!
I
am
a
human
Input
Output
This matters ..
…more than this!
We need to account for each individual input along with the context!
Encoder - Decoder Models Explained
Key Idea: Use 2 LSTMs!
Encoder LSTM
Converts Input to Fixed Length representation
Decoder LSTM
Converts Fixed Length Representation to Output
Great for sequence to sequence tasks!
A 2-part model
Encoder
Decoder
Context
Input
Output
+ All hidden states!
Why transformers are so good
Local and Global!
Idea: Use context vector + hidden state at each timestep in the decoder!
Decoder hidden state
h1
h2
h3
Hidden states at every time
Compute context
How Do We Know What’s Relevant?
We get inspired by search engines!
Query
Keys
Values
Mapped onto
Get Best Matches
Multiply
Score
From Encoder
From Decoder
Computing an Output
For a particular output:
Assign a score to each hidden state!
Softmax the scores
Scale the hidden states by the softmaxed scores
+
+
Add the hidden states!
Making a prediction!
+
+
Hidden State: Contains global info
Context Vector - contains local info
+
Output
Let’s Move to the Colab
Current State of Transformer Models
Transformers are the current state of the art models for NLP
Further research is still ongoing!
All of these models use the attention mechanism!
Conclusion
06
Summary of Results
Simple RNN
LSTM
Basic NN
78%
57%
87%
Keep in mind that all these networks are shallow!
Transformers
93%
Quick Summary
Recap on Deep Learning & NLP 1 Workshop��Embeddings��RNNs
LSTMs
What’s next?
Recap on Deep Learning & NLP 1 Workshop��Embeddings��RNNs
LSTMs
Own NLP Project?
Any Further Questions for us?
Slido Code: #4299