1 of 22

Lecture 3�Recurrent Neural Networks – Part1��Sookyung Kim�sookim@ewha.ac.kr

1

2 of 22

Types of Neural Networks

One-to-one

Vanilla Neural-net�Image classification

Many-to-one

Action recognition�(Frames → class)

Many-to-many

Frame classification�(Frames → classes)

2

3 of 22

Types of Neural Networks

One-to-many

Image captioning�(Image → words)

Many-to-many

Video captioning�(Video → words)

3

4 of 22

Recurrent Neural Networks

4

5 of 22

RNN Basics

RNNs have an internal state that is updated as a sequence is processed.

With a feedback loop, the new value is determined by its old value as well as the input.

x

RNN

y

5

6 of 22

RNN Basics

Expanded view of an RNN:

Each RNN cell takes an input x_i, updates its hidden state h_i from h_i-1, then (optionally) returns an output y_i.
The RNN cell needs to be initialized somehow (h₀).

x₃

RNN

y₃

x₁

RNN

y₁

x₂

RNN

y₂

x_T

RNN

y_T

…

h₀

h₁

h₂

h₃

h_T_-1

6

7 of 22

RNN Basics

A sequence of vectors {x₁, x₂, …, x_T} is processed by applying a recurrence formula at every time step:��
It is important to note that the same function and the same set of parameters are used at every time step.

x

RNN

y

7

8 of 22

Vanilla RNN

Then what should we do inside the RNN cell?

One simple way is taking linear transformations (with W_hh and W_xh) on the two inputs (previous hidden state h_t-1 and input x_t),
then take a nonlinearity before updating it as h_t.
For the output, we may put another linear transformation from h_t.

x_t

f_W

y_t

h_t-1

h_t

W_hh

W_xh

W_hy

8

9 of 22

Vanilla RNN

Again, all weights are shared across time!

During the forward pass, the same weights are used repetitively.

x₁

f_W

y₁

h₀

h₁

x₂

f_W

y₂

h₂

x₃

f_W

y₃

h₃

W_hh

W_xh

W_hy

9

10 of 22

Vanilla RNN

Then, where do we compare with the ground truth?

At the output {y₁, y₂, …}!
Each output is combined to compute the overall loss.

x₁

f_W

y₁

h₀

h₁

x₂

f_W

y₂

h₂

x₃

f_W

y₃

h₃

y₁

y₂

y₃

10

11 of 22

Vanilla RNN

What if our problem is many-to-one?

We output only once at the end of the sequence.
Often, intermediate hidden states are used as well when the output is determined.

x₁

f_W

h₀

h₁

x₂

f_W

h₂

x₃

f_W

y_T

h_T

y_T

11

12 of 22

Vanilla RNN

What about one-to-many?

We still must input something at each step, given the formula:
Autoregressive input: For time series data, the lagged (autoregressive) values of the time series are used as inputs to a neural network.

x₁

f_W

h₀

h₁

f_W

h₂

f_W

y₃

h₃

y₃

y₁

y₂

y₁

y₂

y₁

y₂

12

13 of 22

Vanilla RNN

Lastly, how to implement many-to-many?

Many-to-one as an encoder, then one-to-many as a decoder.
The input sequence is encoded as a single vector at the end of the encoder.
From this single vector, the decoder generates output sequence.
Called Sequence-to-sequence, or seq2seq.

x₁

f_W

h₀

h₁

f_W

h₂

f_W

h₃

x₂

x₃

[]

f_W

s₀

s₁

f_W

s₂

f_W

y₃

s₃

y₃

y₁

y₂

y₁

y₂

y₁

y₂

13

14 of 22

TensorFlow API: Vanilla RNN

Dimensionality of hidden state

>> input_shape = [32, 10, 8] # (batch_size, seq_len, dim)

>> inputs = np.random.random(input_shape).astype(np.float32)

>> simple_rnn = tf.keras.layers.SimpleRNN(� 4, return_sequences=True, return_state=True)�>> output_seq, final_state = simple_rnn(inputs)�>> print(output_seq.shape) # result: [32, 10, 4]�>> print(final_state.shape) # result: [32, 4]

tf.keras.layers.SimpleRNN(

units,� activation='tanh',� use_bias=True,

kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros', � kernel_regularizer=None,

recurrent_regularizer=None, � bias_regularizer=None, � activity_regularizer=None,

kernel_constraint=None, � recurrent_constraint=None, � bias_constraint=None,

dropout=0.0, recurrent_dropout=0.0, � return_sequences=False,� return_state=False,

go_backwards=False, stateful=False, � unroll=False, **kwargs

)

https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN

?

14

15 of 22

RNN Trade-offs

RNN Advantages:

Can process any length input.
Model size doesn’t increase for longer input.
Computation at step t can (in theory) use information from many steps back.
Same weights are applied at every timestep, so there is symmetry in how inputs are processed.�

RNN Disadvantages:

Recurrent computation is slow.
A sequence output inference is hard to be parallelized.
Vanilla RNN suffers from vanishing gradient in training.
Vanilla RNN often fails to model long-range dependence in a sequence.
In practice, difficult to access information from many steps back.

15

16 of 22

Multi-layer RNN

We may put more than one hidden layers.

x₁

f_W

h₁₀

h₁₁

x₂

f_W

h₁₂

x₃

f_W

h₁₃

f_W

h₂₁

f_W

h₂₂

f_W

h₂₃

f_W

y₁

h₃₁

f_W

y₂

h₃₂

f_W

y₃

h₃₃

y₁

y₂

y₃

h₂₀

h₃₀

16

17 of 22

So, where can we use RNN?

17

18 of 22

Image Captioning

18

19 of 22

Visual Question & Answering (VQA)

19

20 of 22

Visual Dialog (Conversation about an Image)

20

21 of 22

Visual Language Navigation

21

22 of 22

Towards Modeling Longer Dependence

22