1 of 18

Neural Turing Machines (NTM)

2 of 18

Copy Task

Input: a random sequence of k-bit vectors with EOF at the bottom
Output: the same as input (without EOF)
No input when the network is outputting!

Adopted from https://github.com/snowkylin/ntm.git and ht tp://cpmarkchang.logdown.com/posts/279710-neural-network-neural-turing-machine

3 of 18

How LSTM performs?

Bad
Using 3-layers LSTM with 128 hidden states each
8-bit random vectors, random length from 1 to 10
Loss function: the cross entropy between input and output

Adopted from https://github.com/snowkylin/ntm.git and http://cpmarkchang.logdown.com/posts/279710-neural-network-neural-turing-machine

4 of 18

Why NTM is stronger than LSTM ?

Such tasks (e.g. Copy/ Reverse sequences) can be decomposed into two parts.

1. operation on the input sequence (e.g. write/erase/shift/sharpen ...)
2. memorize all the input sequence

Standard LSTM (RNN) need to do two components at the same time.

Store the input sequences.

LSTM has limited ability to memorize the long input sequences.

Memorize the operation.

Some operations (e.g. shift/sharpen) is hard to implemented.

Adopted from https://github.com/snowkylin/ntm.git and http://cpmarkchang.logdown.com/posts/279710-neural-network-neural-turing-machine

5 of 18

Advantages of NTM over LSTM

In NTM, an external memory is proposed to store the input sequences.

Release the LSTM from storing the long input sequence.

Some operation (e.g. shift/sharpen ...) are designed to interact with the external memory.
The LSTM in NTM only need to memorize the how to operate on the stored input sequences.

Adopted from https://github.com/snowkylin/ntm.git and http://cpmarkchang.logdown.com/posts/279710-neural-network-neural-turing-machine

6 of 18

Pipeline

Adopted from https://github.com/snowkylin/ntm.git and http://cpmarkchang.logdown.com/posts/279710-neural-network-neural-turing-machine

First, current input element is concatenated with the previous read vector. Note that in the first time step, the previous read vector is filled with zero.

Next, the input will processed by a controller (here we use a LSTM, but such element can also be implemented by a fully connected network) and we call its output the controller output.

Next, the controller output will processed by two successive fully connected networks to transform its dimension to the final output, we call it NTM Output.

It seems weird at the first glance, since we do not mentioned anything about external memory but we already have the output. However, we should remember that such tasks are usually contain lots of time steps. Start from the second step, the external memory is brought in to play.

As we can see, the output of the first fully connected layer is split into many components. They are read heads, write heads, add vector and erase vector. Furthermore, the read heads and write heads can be decomposed into the parameters stated in the purple box.

Next, there is a operation called addressing. Such operation will generate a head location that can determine the memory address that will be read/ erase and add.

Next, using the head location vector, we can read the content from the memory.

Also, combine the head location vector with the erase vector, we can erase the content from the memory that is no longer important.

Finally, the read vectors here will be concatenated with the input element in next time step and start to perform its effect.

7 of 18