1 of 22

Lecture 3�Recurrent Neural Networks – Part1���Sookyung Kim�sookim@ewha.ac.kr

1

2 of 22

Types of Neural Networks

One-to-one

Vanilla Neural-net�Image classification

Many-to-one

Action recognition�(Frames → class)

Many-to-many

Frame classification�(Frames → classes)

2

3 of 22

Types of Neural Networks

One-to-many

Image captioning�(Image → words)

Many-to-many

Video captioning�(Video → words)

3

4 of 22

Recurrent Neural Networks

4

5 of 22

RNN Basics

RNNs have an internal state that is updated as a sequence is processed.

  • With a feedback loop, the new value is determined by its old value as well as the input.

x

RNN

y

5

6 of 22

RNN Basics

Expanded view of an RNN:

  • Each RNN cell takes an input xi, updates its hidden state hi from hi-1, then (optionally) returns an output yi.
  • The RNN cell needs to be initialized somehow (h0).

x3

RNN

y3

x1

RNN

y1

x2

RNN

y2

xT

RNN

yT

h0

h1

h2

h3

hT-1

6

7 of 22

RNN Basics

  • A sequence of vectors {x1, x2, …, xT} is processed by applying a recurrence formula at every time step:���
  • It is important to note that the same function and the same set of parameters are used at every time step.

x

RNN

y

7

8 of 22

Vanilla RNN

  • Then what should we do inside the RNN cell?
    • One simple way is taking linear transformations (with Whh and Wxh) on the two inputs (previous hidden state ht-1 and input xt),
    • then take a nonlinearity before updating it as ht.
    • For the output, we may put another linear transformation from ht.

xt

fW

yt

ht-1

ht

Whh

Wxh

Why

8

9 of 22

Vanilla RNN

  • Again, all weights are shared across time!
    • During the forward pass, the same weights are used repetitively.

x1

fW

y1

h0

h1

x2

fW

y2

h2

x3

fW

y3

h3

Whh

Wxh

Why

9

10 of 22

Vanilla RNN

  • Then, where do we compare with the ground truth?
    • At the output {y1, y2, …}!
    • Each output is combined to compute the overall loss.

x1

fW

y1

h0

h1

x2

fW

y2

h2

x3

fW

y3

h3

y1

y2

y3

10

11 of 22

Vanilla RNN

  • What if our problem is many-to-one?
    • We output only once at the end of the sequence.
    • Often, intermediate hidden states are used as well when the output is determined.

x1

fW

h0

h1

x2

fW

h2

x3

fW

yT

hT

yT

11

12 of 22

Vanilla RNN

  • What about one-to-many?
    • We still must input something at each step, given the formula:
    • Autoregressive input: For time series data, the lagged (autoregressive) values of the time series are used as inputs to a neural network.

x1

fW

h0

h1

fW

h2

fW

y3

h3

y3

y1

y2

y1

y2

y1

y2

12

13 of 22

Vanilla RNN

  • Lastly, how to implement many-to-many?
    • Many-to-one as an encoder, then one-to-many as a decoder.
    • The input sequence is encoded as a single vector at the end of the encoder.
    • From this single vector, the decoder generates output sequence.
    • Called Sequence-to-sequence, or seq2seq.

x1

fW

h0

h1

fW

h2

fW

h3

x2

x3

[]

fW

s0

s1

fW

s2

fW

y3

s3

y3

y1

y2

y1

y2

y1

y2

13

14 of 22

TensorFlow API: Vanilla RNN

Dimensionality of hidden state

>> input_shape = [32, 10, 8] # (batch_size, seq_len, dim)

>> inputs = np.random.random(input_shape).astype(np.float32)

>> simple_rnn = tf.keras.layers.SimpleRNN(� 4, return_sequences=True, return_state=True)�>> output_seq, final_state = simple_rnn(inputs)�>> print(output_seq.shape) # result: [32, 10, 4]�>> print(final_state.shape) # result: [32, 4]

tf.keras.layers.SimpleRNN(

units,� activation='tanh',� use_bias=True,

kernel_initializer='glorot_uniform',

recurrent_initializer='orthogonal',

bias_initializer='zeros', � kernel_regularizer=None,

recurrent_regularizer=None, � bias_regularizer=None, � activity_regularizer=None,

kernel_constraint=None, � recurrent_constraint=None, � bias_constraint=None,

dropout=0.0, recurrent_dropout=0.0, � return_sequences=False,� return_state=False,

go_backwards=False, stateful=False, � unroll=False, **kwargs

)

?

14

15 of 22

RNN Trade-offs

  • RNN Advantages:
    • Can process any length input.
    • Model size doesn’t increase for longer input.
    • Computation at step t can (in theory) use information from many steps back.
    • Same weights are applied at every timestep, so there is symmetry in how inputs are processed.�
  • RNN Disadvantages:
    • Recurrent computation is slow.
    • A sequence output inference is hard to be parallelized.
    • Vanilla RNN suffers from vanishing gradient in training.
    • Vanilla RNN often fails to model long-range dependence in a sequence.
    • In practice, difficult to access information from many steps back.

15

16 of 22

Multi-layer RNN

  • We may put more than one hidden layers.

x1

fW

h10

h11

x2

fW

h12

x3

fW

h13

fW

h21

fW

h22

fW

h23

fW

y1

h31

fW

y2

h32

fW

y3

h33

y1

y2

y3

h20

h30

16

17 of 22

So, where can we use RNN?

17

18 of 22

Image Captioning

18

19 of 22

Visual Question & Answering (VQA)

19

20 of 22

Visual Dialog (Conversation about an Image)

20

21 of 22

Visual Language Navigation

21

22 of 22

Towards Modeling Longer Dependence

22