1 of 51

Danbury AI

2 of 51

@AndrewJRibeiro | AndrewRib.com

Art has always been cherished as the most expressive and human production. The idea that a computer, a logical machine, can create the most quintessential human objects is preposterous to some. As anyone that has engaged in the artistic process will tell you, a lot of art is based on emotion, not logical rules. In this talk we will discuss the connectivist history leading to convolutional networks and their application in style transfer. I hope that the topics herein demonstrate to you that machine learning is a dramatic departure from rule based computing and that it does mimic intelligent behavior.

3 of 51

Overview

Machine Learning Background

Origin of Neural Networks
Neural Network Basics
Convolutions
Convolutional Neural Networks ( CNN )
Very Deep Convolutional Networks ( VGG )

Style Transfer Predecessors

Non-Photorealistic Rendering
Texture Transfer

A Neural Algorithm of Artistic Style

Introduction
The Underlying Loss Function
Algorithm Overview
Methods
Results

Implementation

Direct Tensor Flow implementation
Magenta: Multistyle Pastiche Generator

Intro Demo Video

4 of 51

Origin of Neural Networks

The Connectivist Timeline

1943: Threshold Logic ( McCulloch and Pitts )
1954: Hebbian Networks ( Wesley A. Clark )
1958: Perceptrons ( Frank Rosenblatt )
1969: AI Winter ( Minsky and Papert )
1974: Multi-Layer Perceptrons and Backpropagation ( Werbos )
1990: Convolutional Neural Networks ( LeCun first runaway success )
1997: Long Short-Term Memory Networks ( Hochreiter & Schmidhuber )
2014: Generative Adversarial Networks ( Goodfellow et al. )

“Either the universe is composable or God exists.”

-I heard Yann LeCun paraphrase this quote

*An incomplete history

5 of 51

Mark 1 Perceptron

Frank Rosenblatt

The Neuron

Biological Inspiration

Harbingers of the AI Winter

6 of 51

Neural Network Basics

Multi-Layer Perceptrons are universal function approximators ( the universal approximation theorem )
Feed-forward: processing flows in one direction (input -> hidden layers -> output)
Inputs/Features: the select properties of a problem which have enough information to produce separable classes.
Hidden Layer: neurons that take a weighted input and produce an activation based on some threshold.
Output: interpreted in most cases as a classification label.
Objective Function: usually interpreted as a cost function which is used to evaluate how well our network has learned from the data.
Learning: Finding the edge weights of a network such that the activations produce an output in line with observed cases aka an optimization of the objective function.

Typically done with gradient descent

7 of 51

8 of 51

Regularization coefficient

Regularization

Cost function

K Classes - Multi-Class Classification

Essentially Multinomial Logistic Regression

Hypothesis Parameterized by Theta ( Feature Weights )

Learning as an optimization problem: Minimize theta of J( Theta )

Hypothesis function: A linear combination of the bias and weights on features.

9 of 51

Convolutions

Key Question: Why do ConvNets give us better accuracy for visual object recognition over the standard MLP?

Key Point: Convolutions, in the form we are interested in for ConvNets, compute new values of a matrix based on surrounding values. This gives us a means of producing higher abstractions of local structure.

The convolutional layers of a ConvNet are often interpreted as learned “filters” or kernels.
The first layer of a ConvNet is usually associated e efeff d d d d with edge filters.

“In order to make decisions, or make sense of things, we must reduce the input feature space to a lower dimension.”

Try performing image recognition with the atoms of the objects as the features.

10 of 51

Sobel Operator

3X3 Kernel Convolutions which approximate the derivatives -- x,y changes. Let A be the source img and Gx,Gy be the x,y derivative approx.

11 of 51

Convolutional Neural Networks ( CNN )

Good for processing data that has a grid-like topology. ( Where local relationships matter! )

Time Series
Visual Objects

Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
Pooling

Summarizes the responses over a whole neighborhood.
The use of pooling can be viewed as adding an inﬁnitely strong prior that the function the layer learns must be invariant to small translations.
Max Pooling takes the max of a region which is used in downsampling.

A lot of stuff on this slide was ripped from the deep learning book: http://www.deeplearningbook.org/

12 of 51

Pooling and Feature Maps

13 of 51

14 of 51

15 of 51

16 of 51

Very Deep Convolutional Networks ( VGG )

VGG refers to a particular configuration of a CNN that performed outstandingly in the competition ImageNet ILSVRC-2014.
ImageNet challenges

Object localization for 1000 categories.
Object detection for 200 fully labeled categories.
Object detection from video for 30 fully labeled categories.
Scene classification for 365 scene categories (Joint with MIT Places team) on Places2 Database http://places2.csail.mit.edu.
Scene parsing for 150 stuff and discrete object categories (Joint with MIT Places team).

Previous CNN’s weren’t close to being as deep.
In the style transfer algorithm we introduce later we use a VGG trained on object localization for 1,000 categories.
VGG is named after the Visual Geometry Group which produced the configuration.

17 of 51

18 of 51

STYLE VS. CONTENT

THE BIG IDEA

Figure out how to do this for different domains, and you will be the king/queen of ML.

19 of 51

Style Transfer Predecessors

There have been a few attempts prior with varying success.
Non-Photorealistic Rendering

First attempt at copying style. ( Was not learned - hardcoded )
Very limited. Usually relegated to some form of shading or simple filters ( like photoshop filters which are quite linear ).

Texture Transfer

In 2001 a paper Image Quilting for Texture Synthesis and Transfer was published and introduced the method of Image Quilting for texture transfer. The results, shown on the next slide, are impressive, but they come nowhere near the nonlinearity the neural model produces.

Previous approaches mainly rely on non-parametric techniques to directly manipulate the pixel representation of an image. In contrast, by using Deep Neural Networks trained on object recognition, we carry out manipulations in feature spaces that explicitly represent the high level content of an image.

20 of 51

Image Quilting

21 of 51

Image Quilting Vs. Neural Style Transfer

Style

Content

Image Quilting

Neural

Note: I took a screenshot from the paper to get the style,content, and image quilting result images. They were probably scaled down in the paper so we didn’t get great results, but it’s still illustrative.

content loss: 1.22706e+06

style loss: .659507

total loss: 1.89246e+06

22 of 51

A Neural Algorithm of Artistic Style

Published September 2015
As of December 5, 2016: it has around 120 citations in academic publications.
Uses a trained VGG.
Does dramatically better than previous attempts at style transfer.
Commercial products have already been created around it.

23 of 51

This is the composite loss function, consisting of the loss function for style and content, which we minimise to get our wonderful style transfer result.

Where:

Alpha: is the content weight factor
Beta: is the style weight factor
A: image style to be transferred
P: image content to be stylized
X: the output image which is initially random.

I posit: creativity is optimization with competing constraints

24 of 51

Vocabulary and Concepts

Content Representation: higher layers in a ConvNet trained on object recognition capture high-level representations of content in terms of their arrangement in the input image , but are abstracted from the actual pixel values.
To capture a representation of style, a feature space originally designed to capture texture information is used.

Built upon the filter responses in each layer of the network.
Computes feature correlations between different filter responses over the feature maps.
End result: A stationary, mutli-scale representation of the input image which captures texture but not the global arrangement.

Style Representation: in order to gain an understanding of style, we compute correlations between different features in different layers of the CNN ( discussed later ).

Style features produce texturised versions of the input image that capture its general appearance in terms of colour and localised structures.

“Extracting correlations between neurons is a biologically plausible computation that is, for example, implemented by so-called complex cells in the primary visual system”

We therefore refer to the feature responses in higher layers of the network as the content representation.

This feature space is built on top of the filter responses in each layer of the network. It consists of the correlations between the different filter responses over the spatial extent of the feature maps (see Methods for details).

Moreover, the size and complexity of local image structures from the input image increases along the hierarchy, a result that can be explained by the increasing receptive field sizes and feature complexity. We refer to this multi-scale representation as style representation.

The style representations simply compute the correlations between different types of neurons in the network. Extracting correlations between neurons is a biologically plausible computation that is, for example, implemented by so-called complex cells in the primary visual system (V1).21

25 of 51

We can visualise the information at different processing stages in the CNN by reconstructing the input image from only knowing the network’s responses in a particular layer. We reconstruct the input image from from layers ‘conv1 1’ (a), ‘conv2 1’ (b), ‘conv3 1’ (c), ‘conv4 1’ (d) and ‘conv5 1’ (e) of the original�VGG-Network.

While the number of different filters increases along the processing hierarchy, the size of the filtered images is reduced by some downsampling mechanism (e.g. max-pooling) leading to a decrease in the total number of units per layer of the network.

Content Reconstructions. We can visualise the information at different processing stages in the CNN by reconstructing the input image from only knowing the network’s responses in a particular layer. We reconstruct the input image from from layers ‘conv1 1’ (a), ‘conv2 1’ (b), ‘conv3 1’ (c), ‘conv4 1’ (d) and ‘conv5 1’ (e) of the original VGG-Network. We find that reconstruction from lower layers is almost perfect (a,b,c). In higher layers of the network, detailed pixel information is lost while the high-level content of the image is preserved (d,e).

Style Reconstructions. On top of the original CNN representations we built a new feature space that captures the style of an input image. The style representation computes correlations between the different features in different layers of the CNN. We reconstruct the style of the input image from style representations built on different subsets of CNN layers ( ‘conv1 1’ (a), ‘conv1 1’ and ‘conv2 1’ (b), ‘conv1 1’, ‘conv2 1’ and ‘conv3 1’ (c), ‘conv1 1’, ‘conv2 1’, ‘conv3 1’ and ‘conv4 1’ (d), ‘conv1 1’, ‘conv2 1’, ‘conv3 1’, ‘conv4 1’ and ‘conv5 1’ (e)). This creates images that match the style of a given image on an increasing scale while discarding information of the global arrangement of the scene.

26 of 51

Algorithm Overview

Images are synthesised by finding an image that simultaneously matches the content representation of the photograph and the style representation of the respective piece of art.
While the global arrangement of the original photograph is preserved,�the colours and local structures that compose the global scenery are provided by the artwork.
Style representation: a multi-scale representation that includes multiple layers of the neural network
Trained to perform one of the core computational tasks of biological vision, automatically learns image representations that allow the separation of image content from style.

27 of 51

28 of 51

Rows: matching the style representation on increasing subsets of the CNN layers.
Local image structures captured by style representation increase in size and complexity when including style features from higher layers of the network.
Due to increasing receptive field sizes and feature complexity along the network’s processing hierarchy.
Columns: different relative weightings between the content and style reconstruction. ( content weight / style weight )

Mixing Style and Content

29 of 51

Methods

Uses the feature space provided by 16 convolutional and 5 pooling layers of the 19 layer VGG network.

Does not use any of the fully connected layers.
Max-Pooling is replaced by average pooling to improve gradient flow and more appealing results.

Each layer in the network defines a non-linear filter bank which increases in complexity further down the network.
Now let’s jump into the equations.

30 of 51

Formulae at a Glance

31 of 51

Key Definitions

We take an input image ( p ) and run it feed forward through the trained VGG. This gives us a version of the image which is encoded in each layer of the VGG by the filter responses to p.
A layer with N_l distinct filters has N_lfeature maps of size M_l, where M_l is the height * width of the feature map.
The responses in a layer l are stored in a matrix F^l ϵ R ^{N_l * M_l} where F ^l_ij is the activation of the i^thfilter at position j in layer l.
To visualise the image information that is encoded at different layers of the hierarchy (Fig 1, content reconstructions) we perform gradient descent on a white noise image to find another image that matches the feature responses of the original image.
p : The original image.
x : the image that is generated.
P ^l the feature representation of p in layer l.
F ^l the feature representation of x in layer l.

32 of 51

Content Representation

In order to figure out how much of the content of p is preserved in x, we define a squared error loss function which iterates over all layers and activations and compares the difference between the activations of the networks trained on p and x.
We find the derivative of the loss function with respect to the activations in layer l from which the gradient can be calculated via back-propagation.
Thus we can change the initially random image x until it generates the same�response in a certain layer of the CNN as the original image p.

33 of 51

Style Representation

On top of the CNN responses in each layer of the network we build a style representation that computes the correlations between the different filter responses.
These feature correlations are given by the Gram matrix G^l ϵ R ^{N_l * N_l} where G_i^l_jis the inner product between the vectorised feature map i and j in layer l.
To generate a texture that matches the style of a given image, we use gradient descent from a white noise image to find another image that matches the style representation of the original image.

Done by minimising the mean-squared distance between the entries of the Gram matrix ( G ) from the original image and the Gram matrix of the image to be generated.

a is the original image. x is the image to be generated. A^l and G^l is the style representations in layer l for a,x respectively

34 of 51

Style Representation ( 2 )

The loss function for style based upon the previous slide with the derivative of E_lwith respect to the activations in layer l.
The gradients of E_l with respect to the activations in lower layers of the network can be computed with back propagation.
w_lare the weighting factors of the contribution of each layer to the total loss.

See the paper for a discussion on what they experimentally found was good for the weighting factor w_lfor each convolutional layer.

35 of 51

Style Representation ( 3 )

36 of 51

Mixing Style and Content

To generate a mix of content and style, we jointly minimise the distance of a white noise image from the content representation of the photograph in one layer of the network and the style representation of the painting/image style source in a number of layer of the CNN.
Found experimentally.
Matched content representation layers: conv4_2
Matched style representations: conv1_1, conv2_1, conv3_1, conv4_1, conv5_1 ( with a weighting w_l = 1/5 in those layers, w_l= 0 in all other layers.

Different configurations give different results.

Where:

Alpha: is the content weight factor
Beta: is the style weight factor
A: image style to be transferred
P: image content to be stylized
X: the output image which is initially random.

37 of 51

Implementations

Magenta, a Google Brain project, is heavily engaged in how to use machine learning in art.

They have an environment you can install which gives you access to all the tools you need to perform style transfer and even some musical things ( like learning musical style ).

Products which have arisen from this algorithm:

DeepArt.io
Adobe Stylit

See the Resources and Sources section at the end of this presentation for a list of various implementations and variations of the algorithm.
There’s a nice implementation in Tensor Flow which I used in creating some assets for this presentation.

38 of 51

RESULTS

Look at the links

39 of 51

Discussion & Wrap Up

Is anything presented relevant to your work? Do you see using any projects or methods discussed here in your future or current projects?
Do you use deep learning? What NN architecture are you most interested in? CNN, RNN, MLP, GAN?
What frameworks and systems do you use for machine learning?

Thanks for coming!

Questions for you:

Questions for me?

By: Andrew Ribeiro of Knowledge-Exploration Systems

@kexpsocial

https://github.com/k-exp

WWW.KEXP.IO

Andrewnetwork@gmail.com

@AndrewJRibeiro

https://github.com/Andrewnetwork

AndrewRib.com

40 of 51

I REALLY WANT A TALK ON

Generative Adversarial Networks

Interested in giving one? Co-Author one with us?

41 of 51

Resources and Sources

A Neural Algorithm of Artistic Style - https://arxiv.org/abs/1508.06576
Texture Synthesis Using Convolutional Neural Networks - https://arxiv.org/abs/1505.07376
Understanding Deep Image Representations by Inverting Them - https://arxiv.org/abs/1412.0035
Separating Style and Content with Bilinear Models

http://www.mitpressjournals.org/doi/abs/10.1162/089976600300015349#.WEV0dnUrKpc

A Learned Representation For Artistic Style - https://arxiv.org/abs/1610.07629
Image Quilting for Texture Synthesis and Transfer

Spatiotemporal Energy Models for The Perception of Motion

http://cns-web.bu.edu/~yazdan/pdf/AdelsoonBergen1985MotionEnergy.pdf

Very Deep Convolutional Networks for Large-Scale Image Recognition

https://arxiv.org/abs/1409.1556v6

Visualizing and Understanding Convolutional Networks - https://arxiv.org/pdf/1311.2901v3.pdf

Papers

42 of 51

Resources and Sources

CNN

Standford CS 231 - http://cs231n.github.io/convolutional-networks/
Deep Learning Book Chapter - http://www.deeplearningbook.org/contents/convnets.html
Understanding Convolutions - http://colah.github.io/posts/2014-07-Understanding-Convolutions/
Conv Nets: A Modular Perspective - http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
Kernal Convolutions - https://www.youtube.com/watch?v=C_zFhWdM4ic
Neural Network that Changes Everything - https://www.youtube.com/watch?v=py5byOOHZM8
Visualizing and Understanding Convolutional Networks - https://arxiv.org/abs/1311.2901

Google Brain’s Magenta - https://magenta.tensorflow.org/

Multistyle Pastiche Generator

https://magenta.tensorflow.org/2016/11/01/multistyle-pastiche-generator/

Setting up a magenta environment

https://github.com/tensorflow/magenta/blob/master/README.md

Magenta Implementation

https://github.com/tensorflow/magenta/tree/master/magenta/models/image_stylization

43 of 51

Resources and Sources

Products

https://deepart.io/
( Adobe ) http://stylit.org/

Predecessors to Style Transfer

Really good Tensor Flow implementation

Convolutions

Neural Networks

44 of 51

Resources and Sources

VGG

https://www.youtube.com/watch?v=j1jIoHN3m0s&index=1&t=192s&list=FLMc_J9IiEHk1rFi-Sa2IFn

http://image-net.org/challenges/LSVRC/2016/

45 of 51

BEST Machine Learning Book:

http://www.deeplearningbook.org/

46 of 51

Unused Slides

47 of 51

Google Deep Dream

48 of 51

The Connectivist Ideas of Leibniz

TBA

49 of 51

50 of 51

51 of 51

Wavenet

Visual art isn’t the only field that is being investigated by Machine Learning researchers.