1 of 75

#Tensorflow @martin_gorner

deep�Science !

deep� Code ...

>TensorFlow, deep learning and \

recurrent neural networks

without a PhD_

>TensorFlow, deep learning and \

recurrent neural networks

without a PhD_

@martin_gorner

O’REILLY TensorfFlow World

2 of 75

The superpower: batch normalisation

@martin_gorner

O’REILLY TensorfFlow World

3 of 75

Data “whitening”

Data: large values, different scales, skewed, correlated

@martin_gorner

O’REILLY TensorfFlow World

4 of 75

Data “whitening”

Modified data: centered around zero, rescaled...

Subtract average

Divide by std dev

@martin_gorner

O’REILLY TensorfFlow World

5 of 75

Data “whitening”

(A+B)/2

A-B

Modified data: … and decorrelated (that was almost a Principal Component Analysis)

@martin_gorner

O’REILLY TensorfFlow World

6 of 75

Data “whitening”

new�A

new�B

=

A

B

x

0.05

0.12

0.61

-1.23

+

-1.45

0.12

W ?

B ?

A network layer can do this !

Scale & rotate shift

@martin_gorner

O’REILLY TensorfFlow World

7 of 75

Fully connected network

9

...

0

1

2

softmax

200

100

60

10

30

784

OK

OK ?

OK ???

@martin_gorner

O’REILLY TensorfFlow World

8 of 75

Without batch normalisation

sigmoid

My distribution of inputs

boo-hoo

@martin_gorner

O’REILLY TensorfFlow World

Let’s look at the typical output of a dense neural network layer without batch normalisation. This layer uses the sigmoid activation function.

The graphs represent percentiles of all neuron activations over time, before the sigmoid (“logits”) and after (“activations”).

On the “logits” graph you see that by training iteration 100, the distribution of values is a roughly normal distribution centered around -4.

Look at the plot of the sigmoid on the right with a bell-curve centered on -4 printed on top: with input values following this bell curve, we will not be operating in the “sweet spot” of the sigmoid at all and almost half of the neurons will be completely saturated.

Indeed, you see on the “activations” graph that the output of the sigmoid is very skewed, with the 4 bottom color bands (almost 50% of all activations) are squashed against 0.

9 of 75

Batch normalisation

Center and re-scale logits

before the activation function

(decorrelate ? no, too complex)

Compute average and variance on mini-batch

Add learnable scale and offset

for each logit so as to restore expressiveness

“logit” = weighted sum + bias

one of each per neuron

Try α=stdev(x) and β=avg(x) and you have BN(x) = x

@martin_gorner

O’REILLY TensorfFlow World

Batch normalisation:

Before feeding the outputs of one layer into the inputs of the next, we center and rescale them. When working with mini-batched, we have 100 different instances of each value, computed on the 100 training images in the mini-bath. We can compute statistics ! By subtracting the average and dividing by the standard deviation, we have centered and re-scaled our values.
x_hat is perfectly centered and scaled across the linear portion of the sigmoid which mean that the sigmoid will work in its linear portion: that’s what we wanted but it is also not good. Neural networks only work with non-linear activation functions. We add two learnable parameters to change that. They can have values that restore the “no BN” behaviour, if that is the right thing to do, so the neural network with BN is at least as expressive as without.

Latex formulas:

\widehat{x} = \frac{x-avg_{batch} (x)}{stdev_{batch}(x)+\epsilon }

BN(x) = \alpha \widehat{x}+\beta

10 of 75

Batch normalisation

depends from:

weights, biases, images

depends from:

same weights and biases, images

only one set of weights and biases in a mini-batch

=> BN is differentiable relatively to weights, biases, α and β

It can be used as a layer in the network, gradient calculations will still work

Batch-norm α, β

x =�weighted

sum + bias

activation�fn

@martin_gorner

O’REILLY TensorfFlow World

11 of 75

With batch normalisation (sigmoid)

sigmoid

distribution of neuron output

Batch norm

@martin_gorner

O’REILLY TensorfFlow World

12 of 75

With batch normalisation (RELU)

RELU

My distribution of inputs

@martin_gorner

O’REILLY TensorfFlow World

Back to testing (using RELUs):

Batch norm is still useful with RELUs, even if they have a much bigger sweet spot interval.

To see something on the curves with RELUs, we cannot just count all the activation values across all the examples in the current mini-batch as we did for sigmoids. Indeed, RELU outputs are often 0. However a neuron outputting 0 on one images in a mini-batch, or even on the majority of images is not a dead neuron. It is dead only if it always outputs 0. So to get a percentile distribution where dead neurons can be spotted, I use the max value of each activation across a mini-batch.

These two graphs display the distribution of activations in a convolutional network being trained on the MNIST dataset. One graph shows the activations in the convolutional layers, the second one in the single dense layer. You can see that in both cases the activations (neuron outputs after the RELU) are nicely distributed across a useful portion of the RELU. There are some dead neurons in the convolutional layers (the bottom band), none at all in the dense layer.

13 of 75

Batch normalisation done right

Batch-norm α, β

x =�weighted

sum

+ b

activation�fn

biases : �no longer useful

when activation fn is RELU

α is not useful

It does not modify output distrib.

Per neuron:	relu	sigmoid
without BN	bias	bias
With�BN	β	α, β

+You can go faster: use higher learning rate

+BN also regularises: lower or remove dropout

@martin_gorner

O’REILLY TensorfFlow World

Finally some practicalities:

Biases are actually not useful when using BN. BN computes its stats across all the values of an activation in a mini-batch. If a bias is added, it is the same for all values in the mini-batch (since a bias is per neuron) so as soon as we subtract the mean, the bias will be gone. The beta factor (also called “offset”) plays the role of biases in BN.

The alpha factor (also called “scale”) is useful only if the the activation function produces a different distribution of outputs when the scale of its inputs changes. That is the case for the sigmoid. With RELUs however, any input scaling is passed as is to the output. The output distribution is just scaled but unchanged in shape. Weights further down in the layers will be able to cope with this scale easily. There is no need for a beta factor.

Summarised in this table are all the cases and where biases, BN-alphas or BN-betas should be used.

Two practical benefits of BN:

You can use higher learning rates (actually, you have to use higher rates to see a benefit).
It is a regularisation technique so you can remove or lower your dropout
With well chosen parameter (learning rate and dropout) you can achieve higher accuracy with BN.

14 of 75

Convolutional batch normalisation

W₁[4, 4, 3]

W₂[4, 4, 3]

Each neuron or patch has a value:

per image in the batch
per x position
per y position�

=> compute avg and stdev across all batchsize x width x height values

b₁ α₁ β₁

b₂ α₂ β₂

Still, one bias, scale or offset per neuron

@martin_gorner

O’REILLY TensorfFlow World

15 of 75

Batch normalisation at test time

Stats on what ?

Last batch: no
all images: yes (but not practical)
=> Exponential moving average during training

@martin_gorner

O’REILLY TensorfFlow World

16 of 75

Batch normalisation with Tensorflow

def batchnorm_layer(Ylogits, is_test, Offset, Scale, iteration, convolutional=False):

exp_moving_avg = tf.train.ExponentialMovingAverage(0.9999, iteration)

if convolutional: # avg across batch, width, height

mean, variance = tf.nn.moments(Ylogits, [0, 1, 2])

else:

mean, variance = tf.nn.moments(Ylogits, [0])

update_moving_averages = exp_moving_avg.apply([mean, variance])

m = tf.cond(is_test, lambda: exp_moving_avg.average(mean), lambda: mean)

v = tf.cond(is_test, lambda: exp_moving_avg.average(variance), lambda: variance)

Ybn = tf.nn.batch_normalization(Ylogits, m, v, Offset, Scale, variance_epsilon=1e-5)

return Ybn, update_moving_averages

Define one offset and/or scale per neuron

apply activation fn on Ybn

don’t forget to execute this (sess.run)

The code is on GitHub: goo.gl/DEOe7Z

@martin_gorner

O’REILLY TensorfFlow World

The code for BN in Tensorflow (low-level, at the matrix multiply level).

Tensorflow has a batch_normalisation function but that just subtracts the means ad divides activations by their standard deviations.

Fortunately, Tensorflow also has a “moments” function that computes the means and variances. So we can simply pass the output of one function to the other.

We have to handle the convolution case where the stats are computed not just across the minibatch but also across all the x,y positions a patch can take. That is what tf.nn.moments(Ylogits, [0, 1, 2]) does. It computes the stats across tensor directions 0 (image instances in mini-batch), 1 (x positions) and 2 (y positions).

We also compute a exponential moving average of means and variances during training. When testing, we use these stats instead.

Written like this, the batchnorm_layer function can now be inserted between the logits and that activation function in each neural network layer.

17 of 75

Demo

@martin_gorner

O’REILLY TensorfFlow World

18 of 75

99.5%

@martin_gorner

O’REILLY TensorfFlow World

19 of 75

More superpowers

high level API

@martin_gorner

O’REILLY TensorfFlow World

20 of 75

Layers

from tensorflow.contrib import layers

# this

Y = layers.relu(X, 200)

# instead of this

W = tf.Variable(tf.zeros([784, 200]))

b = tf.Variable(tf.zeros([200]))

Y = tf.nn.relu(tf.matmul(X,W) + b)

Sample: goo.gl/y1SSFy

@martin_gorner

O’REILLY TensorfFlow World

21 of 75

Model function

from tensorflow.contrib import learn, layers, metrics

def model_fn(X, Y_, mode):

Yn = … # model layers

prob = tf.nn.softmax(Yn)

digi = tf.argmax(prob, 1)

predictions = {"probabilities": prob, "digits": digi} #free-form

evaluations = {'accuracy': metrics.accuracy(digi, Y_)} #free-form

loss = tf.nn.softmax_cross_entropy_with_logits(…)

train = layers.optimize_loss(loss,framework.get_global_step(), 0.003,"Adam")

return learn.ModelFnOps(mode, predictions,loss,train,evaluations)

“features” and “targets“

learning rate

TRAIN, EVAL or INFER

Sample: goo.gl/y1SSFy

@martin_gorner

O’REILLY TensorfFlow World

22 of 75

Estimator

estimator = learn.Estimator(model_fn=model_fn)

estimator.fit(input_fn=… , steps=10000)

estimator.evaluate(input_fn=…, steps=1)

# => {'accuracy': … }

estimator.predict(input_fn=…)

# => {"probabilities":…, "digits":…}

# input_fn: feeds in batches of features and targets

Sample: goo.gl/y1SSFy

@martin_gorner

O’REILLY TensorfFlow World

23 of 75

Convolutional network

def conv_model(X, Y_, mode):

XX = tf.reshape(X, [-1, 28, 28, 1])

Y1 = layers.conv2d(XX, num_outputs=6, kernel_size=[6, 6])

Y2 = layers.conv2d(Y1, num_outputs=12, kernel_size=[5, 5], stride=2)

Y3 = layers.conv2d(Y2, num_outputs=24, kernel_size=[4, 4], stride=2)

Y4 = layers.flatten(Y3)

Y5 = layers.relu(Y4, 200)

Ylogits = layers.linear(Y5, 10)

prob = tf.nn.softmax(Ylogits)

digi = tf.cast(tf.argmax(prob, 1), tf.uint8)

predictions = {"probabilities": prob, "digits": digi} #free-form

evaluations = {'accuracy': metrics.accuracy(digi, Y_)} #free-form

loss = tf.nn.softmax_cross_entropy_with_logits(Ylogits, tf.one_hot(Y_, 10))

train = layers.optimize_loss(loss, framework.get_global_step(), 0.003, "Adam")

return learn.ModelFnOps(mode, predictions, loss, train, evaluations)

estimator = learn.Estimator(model_fn=conv_model)

model

Sample: goo.gl/y1SSFy

@martin_gorner

O’REILLY TensorfFlow World

24 of 75

Recurrent Neural Networks

@martin_gorner

O’REILLY TensorfFlow World

25 of 75

#Tensorflow @martin_gorner

deep�Science !

deep� Code ...

>TensorFlow, Keras and \

recurrent neural networks

without a PhD_

>TensorFlow, Keras and \

recurrent neural networks

without a PhD_

bit.ly/keras-rnn-codelab

@martin_gorner

O’REILLY TensorfFlow World

26 of 75

200

20

2

Neural network 101 (reminder)

1200

20x20x3

@martin_gorner

O’REILLY TensorfFlow World

What is a “neuron” in a neural network ? A neuron computes the weighted sum of all of its inputs, using weights that are still to be determined at this point, adds a “bias” which is an additional degree of freedom and then feeds this sum through an “activation function”. It’s just a function in the mathematical sense (one number in, one comes out, more on next slide) but in neural networks, it is always a nonlinear function. Now the problem to solve is to find the weights and biases that will make this neural network do what we want. This is accomplished through training on a known dataset (supervised learning). You can also chain layers of neurons. In the second layer, neurons perform weighted sums of outputs from the previous layer instead of image pixels.

We are writing a classifier, so we want a neural network that takes the pixels of our images as inputs and ends with just 2 neurons. One of them will activate on planes, the other on non-planes.

Let us pile a couple of dense layers where each neuron computes a weighted sum of all of the outputs from the previous layer. On “hidden” layers (every layer but the last), the traditional activation function to use is called RELU (Rectified linear unit). On the last layer, since we are building a classifier, the traditional approach is to use the “softmax” activation function and then to compute the distance between the output of the neural network and the correct answers (we know them for the training dataset) using a specific distance function called the “cross entropy”. ( see here for why you should use cross-entropy)

This distance is often called the “error function” or “loss function” and it is what we want to minimize. In Tensorflow, you can use one of the provided optimizers (here AdamOptimizer) to minimize the loss.

In the code here, we are using the high-level tf.layers API. It creates the weights and biases automatically when you call tf.layers.dense but it is a good idea to still remember how many weights and biases are being created.

A couple of things to notice:

Tensorflow has a combined function called tf.losses.softmax_cross_entropy which first applies softmax activation to the raw outputs of your last layer, then computes the cross-entropy distance between what it obtained and the correct result. It is implemented in a way that preserves numerical stability. When implemented naively, this operation can easily produce a log(0)=NaN.
XXXOptimizer.minimizes produces a “train_op”. What is that ? Tensorflow has a deferred execution model. All tf.xxxx commands build a graph of computation nodes in memory but do not execute actual computations yet. There is an extra step needed for that. “train_op” is the part of the computation graph which, when executed, will take in a batch of input images, compute the partial derivative of the loss function relatively to all the weights and biases in the system (this is known as the “gradient”) and update weights and biases by a small amount along the direction of this gradient (the opposite direction to be precise). Mathematically, the gradient is an arrow that points up, in the space of weights and biases. By following it in the reverse direction, we modify weights and biases to make a step towards a region where the loss is smaller. This process is known as “backpropagation”.

27 of 75

Activation functions (reminder)

softmax

(classification)

1

-1

relu

weights

bias

activation

inputs

weighted sum+b

norm

sigmoid

tanh

On last layer:

nothing

(regression)

@martin_gorner

O’REILLY TensorfFlow World

Here are a couple of typical activation functions applied after the weighted sum+bias.

In most layers, use the RELU. It’s very simple, fast to compute and proven to work well.

On the last layer, if you want to predict continuous values, you can use the tanh or sigmoid functions which are the simplest functions going smoothly from, respectively, -1 to +1 and 0 to +1.

The softmax activation is not much more complicated. You apply an exponential to raw neuron outputs and then normalize the vector of all the outputs for the layer. This is usually done on the last layer of the network, especially when doing classification. The result is that you network will output N values between 0 and 1, one of them being a strong winner. You can interpret them as probabilities of the input images belonging to each of your N target classes. Why not just use max ? For predictions, it would be OK but for training, there still is valuable information in the “wrong probabilities” so the fact that softmax does dot push them entirely towards 0 is important.

____________________________________________________________

Graphs made with GraphSketch.com

28 of 75

RNN

softmax

tanh

X: inputs

Y: outputs

H: internal� state

RNN cell

H

Xt

Yt

N: internal size

@martin_gorner

O’REILLY TensorfFlow World

29 of 75

RNN

X = X_t | H_t-1�

H_t = tanh(X.W_H + b_H)

Y_t = softmax(H_t.W + b)

concatenation

RNN cell

H

Xt

Yt

@martin_gorner

O’REILLY TensorfFlow World

30 of 75

RNN training

H_-1

cell

H₀

Y₀

X₀

cell

H₁

Y₁

X₁

cell

H₂

Y₂

X₂

cell

H₃

Y₃

X₃

cell

H₄

Y₄

X₄

cell

H₅

Y₅

X₅

The same weights and biases shared across iterations

@martin_gorner

O’REILLY TensorfFlow World

To train an RNN cell, you must unroll it across time. Why ?

Consider a single cell. Put something on its inputs. Its input state is the state received from the previous iteration.

This gives you some outputs. Not happy with what you got ?

Fine, compute the difference between what you got and your desired outputs, from that compute a gradient, and retro-propagate.

This will adjust weights and biases in the internal neural network layers so as to make the difference smaller.

But wait, what if the problem was not in the weights and biases at all ? What if the problem was the input state you received from the previous iteration ?

No luck in that case. That state has already been computed, you cannot go back in time to try and recompute it using better weights and biases, or can you ?

Actually, you can, up to a certain point. The trick to training RNNs is to unroll a the network across n iterations, by replicating one RNN cell. The first input state is still a constant as far as this training step is concerned. However, all of the following states are now combinations of weights, biases and inputs, amenable to backpropagation.

2 takeaways:

An RNN cell is a state machine and can therefore represent state changes across many many steps (for example close a parenthesis that was opened 5000 step ago)
However, to train an RNN, it has to be unrolled across a finite number of steps (usually less than 100). If there are state changes you want to teach it (like opening and closing parentheses) they must appear in the training dataset across spans of no more than N time steps, N being the number of unrolled steps.

31 of 75

Deep RNN

0

cell

H’₀

Y₀

cell

H₀

X₀

cell

H’₁

Y₁

cell

H₁

X₁

cell

H’₂

Y₂

cell

H₂

X₂

cell

H’₃

Y₃

cell

H₃

X₃

cell

H’₄

Y₄

cell

H₄

X₄

cell

H’₅

Y₅

cell

H₅

X₅

L: number of layers

@martin_gorner

O’REILLY TensorfFlow World

32 of 75

Michel C. was born in Paris, France. He is married and has three children. He received a M.S. in neurosciences from the University Pierre & Marie Curie and the Ecole Normale Supérieure in 1987, and and then spent most of his career in Switzerland, at the Ecole Polytechnique de Lausanne. He specialized in child and adolescent psychiatry and his first field of research was severe mood disorders in adolescent, topic of his PhD in neurosciences (2002). His mother tongue is ? ? ? ? ?

Long term dependencies: a problem

Short context

English,

German,

Russian,

French …

Long context Problems…

H_n

…

Michel

C.

was

born

in

French

…

H_n-1

@martin_gorner

O’REILLY TensorfFlow World

We have already seen that an RNN can learn a time dependency only if it is shorter than the unroll factor N.

So you need to unroll a lot if you want to teach long-term dependencies to your network.

An RNN that you unrolled 100 times is actually very similar to a 100-layer deep neural network.

And it comes with all of the problems common in deep neural networks: vanishing gradients, and big issues when training them. They simply refuse to converge.

The solution researchers found was to make the RNN cell a bit more complex on the inside, while keeping the same inputs and outputs. The key ideas was to add an internal “memory”. This new kind of cell is called the LSTM: Long Short Term Memory cell

I will not explain here why the LSTM solves the convergence problem. That explanation involves a good dose of mathematics. However, let’s look at an LSTM cell in detail:

33 of 75

RNN cell types

tanh

σ

Yt

Ct-1

×

+

×

σ

Ct

tanh

Yt

tanh

σ

Xt

Ht-1

Yt

σ

×

+

1-

Simple RNN cell

GRU cell

“Generalized Recurrent Unit”

LSTM cell

“Long Short Term Memory”

Ht

Xt

Ht-1

Ht

Xt

Ht-1

Ht

@martin_gorner

O’REILLY TensorfFlow World

M: Unfortunately, the simple RNN cell design would not work. RNNs, when unrolled always end up being very deep networks, with known convergence problems like vanishing or exploding gradients. 30 unrolling steps is very reasonable but the signal has to traverse 30 neural network layers to go from the first input to the last output. This problem is addressed by better recurrent cell designs. They have been covered in the previous RNN talk (slides | video). For now the important thing to know is that from the outside, all of these RNN cell variants look roughly the same, with their input, output, input state and output state. A recent paper found them all roughly equivalent in performance. The one getting a lot of use recently is the cheapest one (in terms of number of trainable parameters): the GRU and that is what we will use.

34 of 75

LSTM

LSTM = Long Short Term Memory

tanh

σ

Xt

Ht-1

Ht

Yt

Ct

Ct-1

concatenation

Element-wise operations

tanh

Neural net. layers

X = X_t | H_t-1

f = σ(X.Wf + bf)

u = σ(X.Wu + bu)

r = σ(X.Wr + br)

X’ = tanh(X.Wc + bc)

Ct = f * Ct-1 + u * X’

Ht = r * tanh(Ct)

Y_t = softmax(H_t.W + b)

×

+

×

σ

@martin_gorner

O’REILLY TensorfFlow World

35 of 75

LSTM

X = X_t | H_t-1

f = σ(X.Wf + bf)

u = σ(X.Wu + bu)

r = σ(X.Wr + br)

X’ = tanh(X.Wc + bc)

Ct = f * Ct-1 + u * X’

Ht = r * tanh(Ct)

Y_t = softmax(H_t.W + b)

tanh

σ

Xt

Ht-1

Ht

Yt

Ct-1

×

+

×

σ

Ct

concatenate :

forget gate :

update gate :

result gate :

input :

new C :

new H :

output :

p+n

n

vector sizes

m

@martin_gorner

O’REILLY TensorfFlow World

The idea is that the cell has an internal memory and at each step, based on the inputs, itl will select:

What to forget from the memory
What new data to store into the memory
What data from the memory to send to the output

The memory is vector of n values (n = the internal cell size. Actually, all vectors in the cell are of size n)

The “selection” is performed by computing another vector of size n where all elements have values between 0 and 1 and multiplying it by the memory vector.

These selection vectors will be called “gates” and they are computed as the outputs of a sigmoid-activated neural network layer.

Here is the algorithmic explanation of an LSTM:

We construct the inputs as real inputs concatenated to the state from the previous step
Based on these inputs, we use single-layer neural networks to compute three vectors of size n�- a forget gate�- an update gate�- a result gate�These three neural net layers use the sigmoid activation to produce values between 0 and 1 (good for “gating” by multiplication)
We adapt the size of the real input vector X to make it compatible with the internal size n again using a single-layer neural network (necessary for next step).�This step does not need to output a value between 0 and 1. It is not a gate. Any activation function is OK here. Tanh is usual.�This step corresponds to the single tanh step of the basic RNN cell.
The new memory vector Ct is is a combination of the old memory vector and of the real inputs. The amount of previous memory is gated by the forget gate. The amount of new data to memorise is gated by the update gate.
In addition to the memory C, an LSTM still has a state vector H, which is what will drive the softmax layer. It is computed by gating the new memory vector by the result gate. A tanh is used to force the scale of the C values between -1 and 1. Notice that Ct is the sum of two terms and could therefore grow above 1.
This is the end of the LSTM cell. If outputs are needed, they are computed with an additional softmax layer.

36 of 75

Gru !

@martin_gorner

O’REILLY TensorfFlow World

37 of 75

GRU

X = X_t | H_t-1

z = σ(X.Wz + bz)

r = σ(X.Wr + br)

X’ = X_t | r * H_t-1

X” = tanh(X’.Wc + bc)

Ht = (1-z) * Ht-1 + z * X”

Y_t = softmax(H_t.W + b)

p+n

n

p+n

n

vector sizes

m

GRU = Gated�Recurrent Unit

GRU

H_t

Y_t

X_t

H_t-1

2 gates instead of 3 => cheaper

H_t

@martin_gorner

O’REILLY TensorfFlow World

The GRU (Gated Recurrent Unit) achieves roughly the same thing as the LSTM, but using 3 neural network layers instead of 4. It is therefore cheaper (the layers are all of size n. Remember, everything is of size n inside an RNN cell)

A GRU has a combined forget/update gate z, and a retain gate r.

The algorithmic explanation is:

We construct the inputs as real inputs concatenated to the state from the previous step
Based on these inputs, we use single-layer neural networks to compute two vectors of size n�- a z gate�- an r gate
We compute a second type of inputs X’ by concatenating real inputs X with, this time, a gated version of the input state. The r gate is used here to gate how much of the previous state is retained for the rest of the computation.
A neural net layer is used to compute a candidate output X”.�This step does not need to output a value between 0 and 1. It is not a gate. Any activation function is OK here. Tanh is usual.�This step corresponds to the single tanh step of the basic RNN cell.
The new output state is computed as a combination of the previous state and the candidate output X”. The combination is gates by (1-z) and z. That is why z is a combines forget/update gate.
This is the end of the LSTM cell. If outputs are needed, they are computed with an additional softmax layer.

38 of 75

Language model in Tensorflow

0

H₅

S

t

_

J

o

h

t

_

J

o

h

n

character-based

Characters, one-hot encoded

@martin_gorner

O’REILLY TensorfFlow World

39 of 75

Language model in Tensorflow

0

GRU

H₀

X₀

H₀

cells = [tf.nn.rnn_cell.GRUCell(CELLSIZE) for i in range(NLAYERS)]

mcell = tf.nn.rnn_cell.MultiRNNCell(cells, state_is_tuple=False)

Hr, H = tf.nn.dynamic_rnn(mcell, X, initial_state=Hin)

GRU

0

H’₀

GRU

0

H”₀

GRU

H₁

X₁

H₀

GRU

H’₁

H’₀

GRU

H”₁

GRU

H₂

X₂

H₀

GRU

H’₂

H’₀

GRU

H”₂

GRU

H₃

X₃

H₀

GRU

H’₃

H’₀

GRU

H”₃

GRU

H₅

X₄

H₀

GRU

H’₅

H’₀

GRU

H”₅

GRU

H₆

X₆

H₀

GRU

H’₆

H’₀

GRU

H”₆

GRU

H₇

X₇

H₀

GRU

H’₇

H’₀

GRU

H”₇

GRU

H₈

X₈

H₀

GRU

H’₈

H’₀

GRU

H”₈

H

Hin

ALPHASIZE = 98

CELLSIZE = 512

NLAYERS = 3

SEQLEN = 30

defines weights and biases internally

@martin_gorner

O’REILLY TensorfFlow World

Tensorflow has a really neat RNN API.

Pick a cell, here a GRUCell. Calling this also creates all the necessary weights and biases in the background.�Remember that a cell size defines the size of all of its internal vectors. Here we choose CELLSIZE=512
Stack this cell 3-high to build a multi-layer RNN. Tensorflow has the MultiRNNCell function for that. It outputs a stacked cell but fortunately, stacked and simple RNN cells have the same API: input, outputs and a state. The state vector is simply bigger for stacked cells. The can be used in the same way.
Unroll this cell to allow training. dynamic_rnn unrolls the cell as specified by the shape of the input tensor X. It does so “dynamically” by inserting a “while loop” node into the computation graph, rather than replicating the cell a fixed number of times in the graph. This can be very useful because to process sequences of various lengths (we will not use this capability here though). We unroll the cell 30 times.

The complete RNN takes a 30-character sequence as input (characters one-hot encoded as vectors of ALPHASIZE=98 elements, i.e. a-z, A-Z, 0-9 and a handful of punctuation characters)

The RNN is 3 layers deep: NLAYERS=3

Its output state H is the output state of the last cell in the unrolled sequence. We have 3 layers so the output state is made of NLAYERS=3 vectors of CELLSIZE=512 elements. This is the state that needs to be passed in for the next sequence.

The output Hr to be used in a subsequent softmax layer to produce predicted characters is the output of the last layer. In a GRU cell, the output state and the “output for softmax” are the same thing. The RNN is unrolled SEQLEN=30 times so there are 30 output vectors of size CELLSIZE=512

40 of 75

Softmax readout layer

# Hr

Hf = tf.reshape(Hr, [-1, CELLSIZE])

0

H₀

X₀

H₀

0

H’₀

0

H”₀

H₁

X₁

H₀

H’₁

H’₀

H”₁

H₂

X₂

H₀

H’₂

H’₀

H”₂

H₃

X₃

H₀

H’₃

H’₀

H”₃

H₅

X₄

H₀

H’₅

H’₀

H”₅

H₆

X₆

H₀

H’₆

H’₀

H”₆

H₇

X₇

H₀

H’₇

H’₀

H”₇

H₈

X₈

H₀

H’₈

H’₀

H”₈

Tip: handle sequence and batch elements the same

loss = tf.nn.softmax_cross_entropy_with_logits(Ylogits, Y_)

[ BATCHSIZE, SEQLEN, CELLSIZE ]

[ BATCHSIZE x SEQLEN, CELLSIZE ]

ALPHASIZE = 98

CELLSIZE = 512

NLAYERS = 3

SEQLEN = 30

Y₀

Y₁

Y₂

Y₃

Y₄

Y₅

Y₆

Y₇

Ylogits = tf.layers.dense(Hf, ALPHASIZE)

Y = tf.nn.softmax(Ylogits)

[ BATCHSIZE x SEQLEN, ALPHASIZE ]

@martin_gorner

O’REILLY TensorfFlow World

41 of 75

Inputs and outputs

0

H₀

X₀

H₀

0

H’₀

0

H”₀

H₁

X₁

H₀

H’₁

H’₀

H”₁

H₂

X₂

H₀

H’₂

H’₀

H”₂

H₃

X₃

H₀

H’₃

H’₀

H”₃

H₅

X₄

H₀

H’₅

H’₀

H”₅

H₆

X₆

H₀

H’₆

H’₀

H”₆

H₇

X₇

H₀

H’₇

H’₀

H”₇

H₈

X₈

H₀

H’₈

H’₀

H”₈

ALPHASIZE = 98

CELLSIZE = 512

NLAYERS = 3

SEQLEN = 30

Y₀

Y₁

Y₂

Y₃

Y₄

Y₅

Y₆

Y₇

S

t

_

A

n

d

t

_

A

n

d

r

e

r

e

w

[ BATCHSIZE, SEQLEN ]

[ BATCHSIZE, SEQLEN, ALPHASIZE ]

H: [ BATCHSIZE,� CELLSIZE x NLAYERS ]

@martin_gorner

O’REILLY TensorfFlow World

Wow, lots of vectors and lost of vector sizes flying around…

When constructing an RNN, the hardest part is to get all of the tensor sizes right.

Let us recap.

One input character is a one-hot encoded vector of size ALPHASIZE=96

When we unroll the RNN SEQLEN=30 times, its input becomes 30 vectors of size ALPHASIZE

With mini-batching, the first dimension of the input tensor is the size of the batch.

Therefore, the input tensor shape is [BATCHSIZE, SEQLEN, ALPHASIZE]

One input or output state vector is, as any vector in an RNN cell, of size CELLSIZE=512

When we stack our cells NLAYERS=3 high, the state vectors get concatenated into a vector of size CELLSIZE * NLAYERS = 512*3

With mini-batching, the first dimension of the state vector is the size of the batch.

Therefore, the state tensor shape is [BATCHSIZE, CELLSIZE * NLAYERS]

42 of 75

Placeholders, and the rest...

ALPHASIZE = 98

CELLSIZE = 512

NLAYERS = 3

SEQLEN = 30

Xd = tf.placeholder(tf.uint8, [None, None])

X = tf.one_hot(X, ALPHASIZE, 1.0, 0.0)

Yd_ = tf.placeholder(tf.uint8, [None, None])

Y_ = tf.one_hot(Y_, ALPHASIZE, 1.0, 0.0)

Hin = tf.placeholder(tf.float32, [None, CELLSIZE*NLAYERS])

# Y, loss, Hout = my_model(X, Y_, Hin)

predictions = tf.argmax(Y, 1)

predictions = tf.reshape(predictions, [batchsize, -1])

train_step = tf.train.AdamOptimizer(1e-3).minimize(loss)

[ BATCHSIZE, SEQLEN ]

[ BATCHSIZE, SEQLEN, ALPHASIZE ]

[ BATCHSIZE, SEQLEN ]

[ BATCHSIZE, SEQLEN, ALPHASIZE ]

[ BATCHSIZE, CELLSIZE x NLAYERS ]

Y: [ BATCHSIZE x SEQLEN, ALPHASIZE ]

[ BATCHSIZE x SEQLEN ]

[ BATCHSIZE, SEQLEN ]

@martin_gorner

O’REILLY TensorfFlow World

Input and expected output character sequences are of shape [BATCHSIZE, SEQLEN] i.e. a mini-batch of character sequences as long as the unroll size of the RNN.

We define placeholders of the correct shape for the input and expected output sequences.

We use a Tensorflow helper function to one-hot encode them.

We also need a placeholder for the input state of the RNN, of the correct shape, determined on the previous slide.

We apply our RNN model, including the softmax layer.

With the trick explained previously, a prediction vector (i.e. a one-hot encoded character) comes out for every unroll step of the RNN and for every item in the mini-batch.

That is why the shape of the predictions tensor Y is [BATCHSIZE * SEQLEN, ALPHASIZE]

After reshaping and one-hot decoding, the predictions become, as expected, a batch of predicted character sequences of shape [BATCHSIZE, SEQLEN]

Unrolled RNNs can be trained using any of the usual optimisers, for example AdamOptimizer

43 of 75

Bitchin’ batchin’

H_t

H_t-1

The quic seventh Mr. Herm

Batch 1

k brown heaven o

ann Zapf

H_t+1

Batch 2

fox jump

f typogr

was the

H_t+2

Batch 3

++

later

++++

later

start

for x, y_ in utils.rnn_minibatch_sequencer(codetext, BATCHSIZE, SEQLEN, nb_epochs=10):

@martin_gorner

O’REILLY TensorfFlow World

What is the correct way of batching sequences of characters ?

The first sequence of the first mini-batch will produce an output state that will become the input state in the next run of the RNN.

This means that in the second mini-batch, the first sequence must be the continuation of the previous one, otherwise

We would be processing a sequence using an unrelated input state.

So the correct way of batching sequences for this language model is to split the text across the first line of the batches, until the desired number of batches NB is obtained.

From that point, the text is split across the second line of the mini-batches, then across the third lines of batches, and so on.

What is desired number of batches NB? It will depend on the number of sequences in a batch (BATCHSIZE) and on the total size of the training text Z.

The dependency is easily determined by stating that you need to fit the entire text exactly once into one epoch of NB batches.

SEQLEN * BATCHSIZE * NB = Z

=> NB = Z / ( SEQLEN * BATCHSIZE )

Fortunately a utility function that batches sequences in this way is provided in one of the Tensorflow examples: tf.models.rnn.ptb.reader.ptb_iterator

44 of 75

Language model in Tensorflow

ALPHASIZE = 98

CELLSIZE = 512

NLAYERS = 3

SEQLEN = 30

Xd = tf.placeholder(tf.uint8, [None, None])

X = tf.one_hot(Xd, ALPHASIZE, 1.0, 0.0)

Yd_ = tf.placeholder(tf.uint8, [None, None])

Y_ = tf.one_hot(Yd_, ALPHASIZE, 1.0, 0.0)

Hin = tf.placeholder(tf.float32, [None,

CELLSIZE*NLAYERS])

# the model

cell = [tf.nn.rnn_cell.GRUCell(CELLSIZE)

for i in range(NLAYERS)]

mcell = tf.nn.rnn_cell.

MultiRNNCell([cell]*NLAYERS,state_is_tuple=False)

Hr,H = tf.nn.

dynamic_rnn(mcell, X, initial_state=Hin)

# softmax output layer

Hf = tf.reshape(Hr, [-1, CELLSIZE])

Ylogits = layers.linear(Hf, ALPHASIZE)

Y = tf.nn.softmax(Ylogits)

Yp = tf.argmax(Y, 1)�Yp = tf.reshape(Yp, [batchsize, -1])

# loss and training step (optimizer)

loss = tf.nn.softmax_cross_entropy_with_logits(Ylogits, Y_)

train_step = tf.train.AdamOptimizer(1e-3).minimize(loss)

# training loop

for epoch in range(20):

inH = np.zeros([BATCHSIZE, INTERNALSIZE*NLAYERS])

for x, y_ in utils.rnn_minibatch_sequencer(codetext,

BATCHSIZE, SEQLEN, nb_epochs=30):

dic = {X: x, Y_: y_, Hin:inH}

_,y,outH = sess.run([train_step,Yp,H,], feed_dict=dic)

inH = outH

The code is on GitHub: github.com/martin-gorner/�tensorflow-rnn-shakespeare

@martin_gorner

O’REILLY TensorfFlow World

45 of 75

ee o no nonnaoter s ee seih iae r t i r io i ro s sierota tsohoreroneo rsa esia anehereeo hensh�rho etnrhhs iti saoitns t et rsearh tshseoeh ta oirhroren e eaetetnesnareeeoaraihss nshtano eter �e oooaoaeee nonn is heh easren ieson httn nihensont t e n a ooe oerhi neaeehteriseat tiet i i ntsh�orhi e ohhsiea e aht ohr er ra eeo oeeitrot hethisesaaei o saeii straieiteoeresorh e ooeri � e ninesh sort a es h rs hattnteseato sonoanr sniaase s rshninsasi na sntennn oti r etnsnrse oh n� r e tiathhnaeeano trrr hhohooon rrt eernre e rnoh

Shakespeare

0.03

epochs

C1

@martin_gorner

O’REILLY TensorfFlow World

With a trained model, we can now generate text, character by character.

No need to unroll the RNN for that, a single trained cell will do.

We start from a randomly selected character and a zero input state.

Then we sample from the output probabilities to obtain a character.

In the next iteration, we use the state and also the character from the previous iteration as respectively input state and input

We can now generate as much text as we want character by character.

Let us train our model on the complete works of William Shakespaeare (about 5M characters).

Tensorflow has a handy snapshot feature (tf.train.Saver) that we can use to same the state of our model (weights and biases) to disk and reload them later.

Initially, after training on only 200K characters (0.03 epochs) the output is gibberish.

46 of 75

Shakespeare

II WERENI� Are I I wos the wheer boaer.� Tin thim mh cals sate bauut site tar oue tinl an bsisonetoal yer an fimireeren.��L[IO SI Hns oret bsllssts aaau ton hete me toer frurtor sheus aed trat�� A faler bis tote oadt tou than male, tel mou ce an cime. ais fauto ws cien whus yas. Ande fert te a�ut wond aal sinr be at saar

0.1

epochs

C3

@martin_gorner

O’REILLY TensorfFlow World

47 of 75

BERENS Hall hat in she the hir meres.��Perstr in ame not of heard, me thin hild of shear and� ant on of mare. I lore wes lour.��DOCHES The chaster'd on not fenst� The laldoos more.

� [Ixeln thrish]

And tho priines sith of hamdeling the san wind

Shakespeare

0.2

epochs

C5

Stage directions ?

@martin_gorner

O’REILLY TensorfFlow World

48 of 75

KING LEAR Alas, I am not forsworn both to bod!� And let the firm I have to'st trainoured.��KING HENRY VIII I love not my father.��PORDIA He tash you will have it.��HENRY BLUTIUS Work, thou lovest my son here, thy father's fath!��CLIOND Why, then, would say, the beasts are

Shakespeare

1

epoch

C6

Invented names !

@martin_gorner

O’REILLY TensorfFlow World

49 of 75

Shakespeare

30

epochs

TITUS ANDRONICUS��ACT I��SCENE III An ante-chamber. The COUNT's palace.�� [Enter CLEOMENES, with the Lord SAY]��Chamberlain Let me see your worshing in my hands.

�LUCETTA I am a sign of me, and sorrow sounds it.

B10

@martin_gorner

O’REILLY TensorfFlow World

50 of 75

Shakespeare

30

epochs

And sorrow far into the stars of men,� Without a second tears to seek the best and bed,� With a strange service, and the foul prince of Rome�� [Exeunt MARK ANTONY and LEPIDUS]�� Well said, my lord,--��MENENIUS I do not say so.� Well, I will not have no better ways;� But not a woman's misery, and yonder to her

B10

@martin_gorner

O’REILLY TensorfFlow World

51 of 75

diassts_= =tlns==eti.s=tessn_((

sie_s_nts_ens= dondtnenroe dnar taonte srst anttntoilonttiteaen

detrtstinsenoaolsesnesoairt(

arssserleeeerltrdlesssoeeslslrlslie(e

drnnaleeretteaelreesioe niennoarens dssnstssaorns sreeoeslrteasntotnnai(ar dsopelntederlalesdanserl

lts(sitae(e)

Python code

0.03

epochs

A1

@martin_gorner

O’REILLY TensorfFlow World

52 of 75

with self.essors_sigeater(output_dits_allss,

self._train.

for sampated to than ubtexsormations.

expeddions = np.randim(natched_collection, ranger, mang_ops, samplering)

def assestErrorume_gens(assignex) as and(sampled_veases):

eved.

Python code

0.1

epochs

A2

Python

keywords

@martin_gorner

O’REILLY TensorfFlow World

53 of 75

def testGiddenSelfBeShareMecress(self):

with self.test_session() as sess:

tat = tf.contrib.matrix.cast_column_variable([1, 1], [0, 1, 1], [1, 7]],

[[1, 1, 1]].file(file, line_state_will_file))

with self.test_session():

self.assertAllEqual(1, l.ex6)

self.assertEqual(output_graph_def is_output_tensors_op(

tf.pro_context_name.sqrt(sess)

def test_shape(self):

res = values=value_rns[0].eval())

def tempDimpleSeriesGredicsIothasedWouthAverageData(self):

self._testDirector(self):

self._test_inv3_size = 5

with tf.train.ConvolutioBailLors_startswith("save_dir_context.PutIsprint().eval())

return tf.contrib.learn.RUCISLCCS:

# Check the orfloating so that the nimesting object mumputable othersifier.

# dense_keys.tokens_prefix/statch_size of the input1 tensors.

@property

Python code

0.4

epochs

A3

Wrong ([]) nesting

Correct use of colons:

Hallucinated function names

@martin_gorner

O’REILLY TensorfFlow World

54 of 75

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in [0.1, 2.0, 3.0]]

def __init__(self, expected):

return np.array([[0, 0, 0], [0, 0, 0]])

self.assertAllEqual(tf.placeholder(tf.float32, shape=(3, 3)),(shape, prior.pack(),

tf.float32))

for keys in tensor_list:

return np.array([[0, 0, 0]]).astype(np.float32)

# Check that we have both scalar tensor for being invalid to a vector of 1 indicating

# the total loss of the same shape as the shape of the tensor.

sharded_weights = [[0.0, 1.0]]

# Create the string op to apply gradient terms that also batch.

# The original any operation as a code when we should alw infer to the session case.

Python code

12

epochs

B10

Correct triple ([]) nesting

Recites Apache license

Tensorflow tips!

@martin_gorner

O’REILLY TensorfFlow World

55 of 75

...and more

Credit to Andrej Karpathy’s blog:

The Unreasonable Effectiveness of Recurrent Neural Networks

@martin_gorner

O’REILLY TensorfFlow World

56 of 75

Tensorflow: save, restore

saver = tf.train.Saver(keep_checkpoint_every_n_hours=0.1, max_to_keep=5)

with tf.Session() as sess:

# ... training loop ...

saver.save(sess, 'file_' , global_step=iter)

=> Save variables in , the graph in

file_200

file_200.meta

with tf.Session() as sess:

resto = tf.train.import_meta_graph('file_200.meta')

resto.restore(sess, 'file_200')

=> Restore graph and variable values

Must name variables explicitly !!!

# when saving

X = tf.placeholder(tf.uint8, name='X')

Y = tf.nn.softmax(Ylogits, name='Y')

# when using restored graph

y,h = sess.run(['Y:0', 'H:0'], feed_dict={'X:0': y} )

@martin_gorner

O’REILLY TensorfFlow World

57 of 75

Shakespeare generation

with tf.Session() as sess:

resto = tf.train.import_meta_graph('shake_200.meta')

resto.restore(sess, 'shake_200')

# initial values

x = np.array([[0]]) # [BATCHSIZE, SEQLEN] with BATCHSIZE=1 and SEQLEN=1

h = np.zeros([1, INTERNALSIZE * NLAYERS], dtype=np.float32)

for i in range(100000):

dic = {'X:0': x, 'Hin:0': h, 'batchsize:0':1}

y,h = sess.run(['Y:0', 'H:0'], feed_dict=dic)

c = my_txtutils.sample_from_probabilities(y, topn=5)

x = np.array([[c]]) # shape [BATCHSIZE, SEQLEN] with BATCHSIZE=1 and SEQLEN=1

print(chr(my_txtutils.convert_to_ascii(c)), end="")

X

H_t

H’₀

Y

H_t-1

One char at a time

@martin_gorner

O’REILLY TensorfFlow World

Here is the code that uses a trained RNN cell to generate text.

We use a single trained RNN cell, restored from a saved state.

We start from any character (here 0) and a zero input state.

One run through our RNN cell produces a ALPHASIZE=98 long vector of probabilities. These are the probabilities for the next character.

We “sample” the next character from those probabilities: we select the next character at random but respecting the probabilities we computed.

There is an option in the code for sampling from the top-n most probable characters only. This gives more English-looking text at the expense of variety.

With topn=1 (always selecting the most probable next character) the generated text is a collation of the most frequent English words: “the of me the for of a …”. Not great.

But topn=2 or 3 are actually good values.

In the next iteration, we use the state and also the character from the previous iteration as respectively input state and input

58 of 75

Tensorboard

summary_writer = tf.train.SummaryWriter("log/train_" + time)

loss_summary = tf.scalar_summary("batch_loss", loss)

# in training loop:

smm = sess.run(summaries, feed_dict=dic)�summary_writer.add_summary(smm, iteration)

Tip: use time in logdir name

Tip: use a second SummaryWriter for validation results

@martin_gorner

O’REILLY TensorfFlow World

59 of 75

RNN shapes

0

H₅

S

t

_

J

o

h

t

_

J

o

h

n

character-based

Characters, one-hot encoded

@martin_gorner

O’REILLY TensorfFlow World

60 of 75

RNN shapes

0

The

USA

and

China

have

agreed

geopolitics

Words encoded as vectors: “embeddings”

Text classification

embeddings = tf.Variable(tf.random_uniform([vocab_size, embed_size]))

X = tf.nn.embedding_lookup(embeddings, train_inputs)

Tensorflow sample: goo.gl/m41mNp

Or constant => see Word2Vec

@martin_gorner

O’REILLY TensorfFlow World

RNNs can also be used at the word level for example for classifying articles into newspaper categories such as “sports”, “business”, “technology,” …

One-hot encoding words is very expensive though. With a vocabulary of 100,000 words (typical English) you need a 100,000 component vector to represent each word!

It is more efficient to use an “embedding matrix” and retrieve from it a vector representation for each word. The embedding matrix can be:

Initialized at random and trained with the rest of the neural network
Downloaded pre-trained from a public source (Word2Vec, GloVe, ...). This will work if the embeddings were trained on a vocabulary similar to what you are using in your problem. For example, for any problem with English text, it is very common to use pre-trained embeddings that have been trained on the English language, independently of the final target problem.
Initialized from pre-trained embeddings and then left trainable so that they can further evolve during training and adapt to your problem.

Tensorflow has a helper function for retrieving the short representation based on the index of the word in the vocabulary: tf.nn.embedding_lookup

Embeddings aside, a word-based text classifier is a normal RNN that takes a sequence of words as its input and produces a category as an output.

61 of 75

Bitchin’ batchin’

China and the USA have agreed to a new round of talks 12 �The quick brown fox jumps over the lazy dog . 10

Boys will be boys . 5

Tom , get your coat . We are going out . 11

Math rules the world . Men rule math . 9

0

Hr, H =

tf.nn.dynamic_rnn(mcell, X, initial_state=Hin, sequence_lenght=slen)

H_n

∅

geopolitics

seq�len

@martin_gorner

O’REILLY TensorfFlow World

62 of 75

RNN shapes

The

red

cat

ate

the

mouse

Words encoded as vectors

Text translation

0

Le

chat

rouge

a

mangé

la

souris

∅

Le

chat

rouge

a

mangé

la

souris

Tensorflow sample: goo.gl/KyKLDv

tf.nn.sampled_softmax_loss(…)

slow

fast

@martin_gorner

O’REILLY TensorfFlow World

RNNs are also used for text translation.

We use 2 unrolled RNN sequences back to back for translation.

For English to French translation for example:

The first sequence receives the the English sentence as a sequence of vectors representing words (again, using fixed or trained embeddings).

It is terminated with a special ∅ character.

The second sequence stars with the output state computed by the first one.

Its expected output is the translated sequence of words in French.

It would be logical to use the the translated words, as they are being produced, as inputs for the cells of the second sequence.

It is a common trick however to force the expected translation on the inputs instead. It gives better training results in practice and is simpler to implement.

A problem that does not have an easy standard solution is the softmax layer. Here, we have to compute word probabilities over the entire ~100,000 word French vocabulary.

So we need a softmax layer that outputs vectors of ~100,000 elements. From vectors typically in the 500-element range. That’s 50M weights to predict one word. Ouch!

For predictions, we do not have a choice but for training, the only thing we are interested in is the gradient. Many techniques have been devised to compute this gradient in a cheap approximate way. A great recap of all these techniques can be found in this blog post: http://sebastianruder.com/word-embeddings-softmax/index.html

Tensorflow implements a couple of these techniques, for example tf.nn.sampled_softmax_loss

For math geeks only:

We will not dive here into the technical explanation of the sampled softmax loss. If you want to look into this yourself, start by writing the loss as the distance (cross-entropy) between the expected one-hot encoded outputs words and the word probabilities predicted by the softmax layer. Then derive this formula and try to see how to compute an approximate gradient that would be “good enough” to drive training but faster to compute.

63 of 75

RNN shapes

Images encoded as vectors

Image captioning

(simplified)

A

man

on

a

beach

flying

a

∅

kite

A

man

on

a

beach

flying

a

0

∅

kite

Google’s neural net for image captioning: goo.gl/VgZUQZ

For ex. output�of convolutional network or auto-encoder

@martin_gorner

O’REILLY TensorfFlow World

We can also use an RNN vector to caption images!

The goal is to compute a sequence of words that describe what is in an image for example “A young girl holding a teddy bear”. Just from the pixels of the image.

RNNs a great for generating sequences.

This time, a vector representation of the image is used as an additional input to the RNN cells. It’s simply concatenated to the normal input vector that represents a word.

The expected output used for training is a sentence describing the image written by a human.

As in the translation case, a common trick is to use, during training, the correct caption as inputs instead of using the previously predicted word as would be “normal” in an RNN.

As for words, there are many ways of representing an image as a vector. You can use a dense or a convolutional network. This network can be part of the training or pre-trained and used as a fixed encoding function.

A neat way to pre-train a network to encode images is to use an auto-encoder (Google it!).

64 of 75

Image captioning

Google’s neural net for image captioning: goo.gl/VgZUQZ

A person riding a motorcycle on a dirt road.

A herd of elephants walking across a dry grass field.

@martin_gorner

O’REILLY TensorfFlow World

65 of 75

Image captioning

Google’s neural net for image captioning: goo.gl/VgZUQZ

A refrigerator filled with lots of food and drinks.

A yellow school bus parked in a parking lot.

@martin_gorner

O’REILLY TensorfFlow World

66 of 75

Cloud Machine Learning Engine

@martin_gorner

O’REILLY TensorfFlow World

67 of 75

Data-parallel distributed training

parameter servers

model�replicas

data

W’ = W + ∆W

asynchronous

updates

I ♡ noise

@martin_gorner

O’REILLY TensorfFlow World

The simplest way of training a neural network on multiple machines is called “asynchronous” training. The model is replicated on many workers while the variables (weights and biases) are centralised on one (or a few) parameter servers. Then training proceeds on the workers. As soon as a worker has computed a gradient, it is sent to the parameter server(s) and used to update the weights and biases. How can it be right to apply deltas to weights and biases that might already have been modified by another gradient computed by a faster worker ? Actually, it’s not right. Its wrong. But as long as the workers are not desynchronised too much, it simply adds some noise to the training and noise usually does not negatively impact neural network training. It can even help.

Synchronous training is also possible.

The nice thing about the implementation of CloudML is that it was done entirely in Tensorflow, which means that you can also use your own distribution strategies. Tensorflow has a mechanism for pinning operations and variables to specific machines and the “asynchronous training” implementation in CloudML simply uses that to replicate the computation graph and assign each copy to a worker, and to pin all variables to the parameter server.

68 of 75

TF high level API

from tensorflow.contrib import learn

def model_fn(X, Y_, mode):

Yn = … # model layers

predictions = {"probabilities": …, "digits": …} #free-form

evaluations = {'accuracy': metrics.accuracy(…)} #free-form

loss = …

train = layers.optimize_loss(loss, …)

return learn.ModelFnOps(mode, predictions,loss,train,evaluations)

“features” and “targets

Samples: goo.gl/F3i3bf, goo.gl/CofxFM

@martin_gorner

O’REILLY TensorfFlow World

69 of 75

Estimator, Experiment, learn_runner

from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils

def experiment_fn(job_dir):

return learn.Experiment(

estimator=learn.Estimator(model_fn, model_dir=job_dir,� config=learn.RunConfig(save_checkpoints_secs=None,� save_checkpoints_steps=1000)),

train_input_fn=…, # data feed

eval_input_fn=…, # data feed

train_steps=10000,

eval_steps=1,

export_strategies=make_export_strategy(export_input_fn=

serving_input_fn))

def main(argv=None):

job_dir = # parse argument --job-dir

learn_runner.run(experiment_fn, job_dir)

if __name__ == '__main__': main()

Free stuff !!!

Tensorboard graphs

Resume on fail

Parallel data feeds

Serving model export

Distributed training

trainingInput:

scaleTier: STANDARD_1

Samples: goo.gl/F3i3bf, goo.gl/CofxFM

@martin_gorner

O’REILLY TensorfFlow World

70 of 75

Data queues for distributed training

# dummy implementation for data that fits in memory

def train_data_input_fn(mnist):

images = tf.constant(mnist.train.images)

labels = tf.constant(mnist.train.labels)

return tf.train.shuffle_batch([images, labels], 100,

1100, 1000, enqueue_many=True)

# dummy implementation for data that fits in memory

def eval_data_input_fn(mnist):

return tf.constant(mnist.test.images),

tf.constant(mnist.test.labels)

Inserts queue nodes

Into TF graph

For practical data

queuing use the

TF Records format

batch size

trainingInput:

scaleTier: STANDARD_1

Samples: goo.gl/F3i3bf, goo.gl/CofxFM

@martin_gorner

O’REILLY TensorfFlow World

71 of 75

Serving input function

# Online predictions on Cloud ML Engine

def serving_input_fn():

# Placeholder for data deserialised from JSON

inputs = {'A': tf.placeholder(tf.uint8, [None, 28, 28])}

# Transform the data as needed

features = [tf.cast(inputs['A'], tf.float32)]

return input_fn_utils.InputFnOps(features, None, inputs)

trainingInput:

scaleTier: STANDARD_1

Batch of images

For MNIST

Samples: goo.gl/F3i3bf, goo.gl/CofxFM

@martin_gorner

O’REILLY TensorfFlow World

72 of 75

Run it

gcloud ml-engine jobs submit training job22

--job-dir=gs://mybucket/job22

--package-path=trainer

--module-name=trainer.task

--config=config.yaml

--

--<custom model arguments here>

Deploy trained model to prod = click click click

gcloud ml-engine predict

--model <model_name>

--json-instances mydigits.json

model checkpoints

tensorboard summaries

trainingInput:

scaleTier: STANDARD_1

autoscaled serving

Samples: goo.gl/F3i3bf, goo.gl/CofxFM

@martin_gorner

O’REILLY TensorfFlow World

73 of 75

Demo: aucnet

Retrain Inception yourself: goo.gl/Z9eNek

@martin_gorner

O’REILLY TensorfFlow World

One of the easiest and fastest ways to train a neural network for a visual recognition task specific to your business is call transfer training.

You start with a large, very deep neural network trained by someone else, on millions of tagged images, for a generic visual recognition task. Google’s Inception network for example.

Then you chop off the last 2-3 layers from this network use the rest as a fixed encoding function for your images and train a couple of your own layers, using your own images.

Since you are training far fewer weights and biases than the entire network (often by an order of magnitude), you need significantly less training images and labels (also by an order of magnitude).

Japanese used cars dealer Aucnet did just that. When adding a car for sale in their database, dealerships can now simply take pictures of the car, upload them, and the neural network identifies the make and model of the car, and classifies the images in interior/exterior, front/rear shots and so on.

74 of 75

Have fun !

Cloud ML Engine�your TensorFlow models trained in Google’s cloud.

Pre-trained models:

That’s all�folks...

Martin Görner

Google Developer relations

@martin_gorner

Cloud Vision API

Cloud Speech API

Google Translate API

Natural Language API

Video Intelligence API

Cloud Jobs API^{PRIVATE BETA}

Cloud Auto ML Vision^ALPHA�Just bring your data

Cloud TPU^BETA�ML supercomputing

Videos, slides, code:

github.com/�GoogleCloudPlatform/�tensorflow-without-a-phd

@martin_gorner

O’REILLY TensorfFlow World

75 of 75

1 �neurons

Tensorflow and deep learning without a PhD

@martin_gorner

github.com/

GoogleCloudPlatform/�tensorflow-without-a-phd

github.com/

GoogleCloudPlatform/�tensorflow-without-a-phd

@martin_gorner

O’REILLY TensorfFlow World