An introduction convolutional neural networks

This week's individual assignment will make use of the interactive website https://poloclub.github.io/cnn-explainer/. This website interactively visualizes a simple, already trained convolutional neural network (CNN) and explains convolution in detail. In this assignment’s exercises, we will refer to some parts of the text below the visualization; you only have to read and understand these parts.

First of all, go to https://poloclub.github.io/cnn-explainer/. This should load a page with a pre-configured deep convolutional neural network. Now, watch the video below to get an idea of what you can do with this interactive application:

https://www.youtube.com/watch?v=HnWIHWFbuUQ&feature=youtu.be

Last week, you tinkered with a (deep) “fully-connected” neural network (FCNN). This week, we discuss a popular and more complex version of neural networks: convolutional neural networks. This type of neural networks are more complex and more difficult to understand due to the extra operations they include (such as convolution and max pooling). However, it is important to realize that their object is the same as FCNNs: they take in data (inputs/independent variables) and they output either a categorical prediction (in case of a classification problem) or a continuous prediction (in case of a regression problem). CNNs are especially appropriate when there is some “structure” in the input data, for reasons we’ll discuss later. Examples of such structured input data are images (spatial structure) or speech (temporal structure).

Give an example of a classification problem and a regression problem on which CNNs would probably outperform non-convolutional models (so two separate examples).

As discussed, inputs to CNNs can be any type of (structured) data, but CNNs are most often used for images (like the example CNN on the website). Images can be seen as a rectangular grid of input variables, where each pixel represents a different variable. The input data (e.g., images) are often referred to as a layer: the input layer.

Importantly, images can be grayscale or in color. A grayscale image can be represented as a 2D (width x height) grid (or matrix) of variables. When in color, however, the pixel information of images is usually split into three “channels”: one for the amount of red in the image, one for the amount of green in the image, and one for the amount of blue in the image. Hence the often used term “RGB images” for color images. As such, color images are represented as a 3D (width x height x channels) grid in the input layer of a CNN. (Technically, this 3D version of a matrix is called a “tensor”.) For example, a full-HD color image would be represented as a 1920x1080x3 grid. Note that, in the CNN visualization on the website, the different channels (red, green, blue) are shown as separate 2D grids. To see clearly how a color image is represented in the three channels, change the default image in the network (a coffee cup) to the red bell pepper; you’ll clearly see how the image is represented as the combination of three different color channels.

Take a look at the input layer of the CNN on the website. With how many variables are the images represented in the input layer?

The outputs of CNNs are, like FCNNs, either a single continuous value (in case of regression) or a categorical value (in case of classification). The exact output variable depends on the problem that the CNN is trying to solve. On the website, the CNN is trying to solve a typical object recognition problem: for a given image (input), what is the object category (output)? Note that an output variable may have any number of categories! The predicted output is often represented as the final layer (output layer) of the network which contains a single unit for every category. We will discuss how to interpret the output layer later.

Check the output layer of the CNN on the website. How many categories (and thus units) does it contain?

Now, let’s discuss the main feature of CNNs: convolution! Convolution is a somewhat tricky concept to understand, so we’ll go through it step by step. The most important thing you need to know is that convolution is conceptually the same as the “multiplication-and-summing” operation as we discussed with FCNNs. As a side note, this “multiplication-and-summing” operation has a special name in linear algebra: the dot product. For brevity, we will use this term from now on to refer to this multiplication-and-summing process! Just like this operation in FCNNs, convolution in CNNs involves the dot product between the units’ activation and their corresponding weight. Like FCNNs, the output of the dot products performed during convolution represents the elements of the next layer. Importantly, the shape of data is largely retained across layers in CNNs. So, if the input layer consists of rectangular images, the subsequent layers will also consist of rectangular grids of units (although they usually become smaller from one layer to the next, which is explained later). You can see this clearly in the visualization of the network on the website! Each “convolutional layer” (i.e., the units resulting from the convolution process), such as “conv_1_1” and “conv_1_2”, is still represented as a square grid!

Although the result from the first series of convolutions, i.e. the “conv_1_1” layer, is still a square grid, it is slightly smaller than the input dimensions. What are the dimensions (width x height) of the first convolutional layer? (You may ignore the third dimension, which we’ll discuss later.)

Regardless of the (conceptual) similarities between the dot product between units and weights in FCNNs and CNNs, there is one important difference: the way the weights are organized. As you know, in FCNNs, each unit in the current layer has a weight for every unit in the next layer. For most modern convolutional neural networks, however, this specific fully-connected “wiring” is computationally infeasible.

Suppose that a FCNN network takes grayscale images of size 256x256 (like the famous Alexnet) and suppose that the first hidden layer retains the input size (i.e., the hidden layer is also of size 256x256). How many weights would it require to wire the input layer to the first hidden layer in a fully-connected way (excluding the bias)?

As you can see, the number of weights needed to fully connect the input layer to the hidden layer is enormous! Convolution is a solution to this problem, which introduces “weight sharing”. Instead of having separate weights for each input unit, it uses a small set of weights, often called a “kernel”, that is used across the entire input! Usually, these kernels are small 2D rectangular grids of values. Take a look at Figure 2 on the website (note: it’s a GIF, not an interactive figure). On the left, you see a small 3x3 pixel grid moving around -- this is the kernel! The nine values of the kernel represent that kernel’s weights. One way to interpret kernels is as “feature detectors”. More about this interpretation later.

Importantly, in case of multiple channels (such as with color images in the input layer), there is a separate kernel for each channel. So, for color images in the input layer, you need three different 2D kernels. The stack of 2D kernels for a multi-channel image is often referred to as a (3D) “filter”. Remember the relation between kernels and filters - this will become important in a short while!

Suppose that the first convolutional layer of a given CNN uses a 5x5 kernel to process color (RGB) images of size 256x256. How many weights does a given filter in this layer contain in total?

Following the previous question, how many weights would the same filter contain if the color images were of size 512x512?

As discussed, a single filter in CNNs usually only contains a small fraction of the weights from a fully-connected layer. As shown in Figure 2 of the website, the “weight sharing” property we discussed earlier is implemented by sliding the kernel across the layer. At every location in the layer, each element of the kernel is multiplied with the corresponding element in the layer which are subsequently summed (i.e., the dot product). In other words, the dot product is repeated for every location in the layer, which represents the essence of convolution. Each time the dot product is performed, its output (i.e., a single value) is stored (such as is shown in the rightmost panel of Figure 2). In the interpretation of kernels as feature detectors, you can think of this sliding dot product as the process in which the feature detection process is repeated at all locations in the layer!

Importantly, this convolution operation is repeated for each kernel and its corresponding channel. For example, in the first convolutional layer, convolution is performed three times (with three different kernels): once for each color channel in the input layer. This results in three intermediate maps, which are subsequently summed together elementwise (to which finally the bias term is added) to produce the next layer of units! The resulting 2D map from this process is sometimes called an “activation map” or “feature map”.

We know, this is complicated stuff. Check out Figure 1 on the website to see this process interactively and try to distinguish the different steps in the process before moving on.

Now, to make the process even more complicated, convolutional layers in CNNs often contain multiple 3D filters! A set of 3D filters is sometimes called a “filter bank”. By including multiple 3D filters, the network may extract different features from the previous layer to be processed and combined in the following layers. As each 3D filter results in only a single 2D activation map, in case multiple 3D filters are used, the 2D output of each 3D filter is stacked such that the resulting layer becomes not a 2D activation map, but a 3D activation tensor! In other words, each 3D filter used creates a new channel in the next layer! For example, if the first convolutional layer in a network uses six 3D filters, the resulting layer will have six channels (i.e., it will be a tensor with dimensions: width x height x 6). The use of “channel” here is the same as the three channels in the input layer when dealing with color images. Technically, “channel” is just used to refer to the third dimension of tensors in the context of CNNs.

How many 3D filters were used to create the first convolutional layer (“conv_1_1”) in the network from the website? (Hint: remember that the website chose to visualize the channels separately!)

How many different 2D kernels does the first convolutional layer use?

And how many weights does the entire filter bank from the first convolutional layer contain?

While we discussed the general concept of convolution in detail, there are actually many settings (or “hyperparameters”) that can be tweaked in the process. This is not strictly necessary to get a high-level understanding of CNNs, so this section and the associated (ungraded) questions is optional.

Important convolutional hyperparameters include the padding used, kernel size, and the stride. The text underneath the header “Understanding Hyperparameters” on the website explains these hyperparameters and their effect on CNNs quite well, so read it before going on.

Explain the relationship between each hyperparameter (padding, kernel size, and stride) and the output size (e.g., “with increasing {hyperparameter}, the output size becomes smaller”).

[Answer: with increasing padding, the output size becomes larger; other way around for kernel size and stride]

What is a practical reason to use relatively large strides and kernel sizes?

Beware: this is a difficult question! Assuming a stride of 1, what is the relationship between the number of padding pixels I need to use (P) and the kernel size (K) if I want my output size to be the same as the input size? Try to express this mathematically, e.g., P = (some expression using K). Hint: you should see that it is not always possible to maintain the same output size with these constraints!

There are many variations and properties of convolutions that we haven’t discussed yet, but that would warrant a separate tutorial. So let’s move on to activation functions! Like in FCNNs, activation functions are always included in CNNs to add non-linearity to the network, often right after the convolutional layer. In the FCNN tutorial, you’ve encountered different activation functions, which are all equally applicable to CNNs, but nowadays, the most often used activation function is the “rectified linear unit”, or ReLU. The website actually contains a clear explanation on the ReLU, so read this now! After you’re done reading that section, click on one of the activation maps on the website from the “ReLU_1_1” layer to view how the ReLU function works interactively!

Consider the following vector of activations from a particular activation map: [-0.05, 0.42, 0.57, -0.35, -0.81, 0.00, 0.63]. After passing each value of this vector through a ReLU activation function, what does the resulting vector look like? Format your answer as follows:

[number, number, number, number, number, number, number, number, number]

Another operation that is often included in CNNs is pooling. Pooling is an operation that is sometimes added after a convolutional layer (and the corresponding activation function) for the purpose to reduce the size (width + height) of the layer, which not only reduces the computational cost of training the network, but also allows the network to focus on more coarse-grained and global features.

Pooling is an operation that is similar to convolution in the sense that it performs a particular operation within a rectangular grid (often, confusingly, called a [pooling] kernel) which is also slid across a layer. However, instead of performing a dot product, like in convolution, pooling performs another operation, usually computing the mean (mean pooling) or more commonly computing the maximum (max pooling). Like convolution, the output of this operation (a single number) represents a unit of the next layer. The website actually shows this really well: click on one of the activation maps underneath the “max_pool_1” layer for an interactive visualization of how pooling works!

The hyperparameters discussed in the context of convolutional kernels (padding, size of the filter/grid, and stride) are also relevant for pooling. Given the size of the layers of the CNN on the website before and after pooling, what is the stride used (assuming no padding and a filter size of 2x2)?

If you understand convolution, activation functions, and pooling, you already understand 90% of most CNNs! These operations are often repeated across different layers (e.g., CONV -> ReLU -> CONV -> RELU -> POOL -> CONV -> RELU -> CONV -> RELU -> POOL, etc.) and make up the majority of the computations within most CNNs. The only really important step we still need to discuss is how to go from the penultimate (second last) layer to the output layer. This is done (usually) in three steps. First the penultimate layer is flattened, which refers to the process of transforming a multidimensional tensor or matrix to a 1D vector. For example, flattening the 2D matrix outlined below would result in the following vector:

[[1, 2]

[3, 4]] -> [1, 2, 3, 4]

Note that flattened layers in CNNs are equivalent to regular (hidden) layers in FCNNs! The second step is a fully-connected wiring between each unit in the flattened layer and the units in the output layer (remember: the output layer of classification CNNs consists of one neuron for each output category). Essentially, this is thus simply a fully-connected layer like we have seen in FCNNs! This is why the penultimate (flattened) layer is often called the fully-connected (FC) layer. Sometimes, CNNs contain even more fully-connected layers after flattening (such as the famous Alexnet), but we’ll ignore that for now. For an interactive visualization of the flattening process and the fully-connected layer, click on one of the output units (e.g., the “bell pepper” unit) of the CNN on the website!

Suppose that the penultimate layer in a CNN is a tensor with dimensions 24 (height) x 24 (width) x 5 (channels). How many units are there in the flattened version of this layer?

Suppose the hypothetical CNN from the previous question tries to solve an object recognition problem with 50 different classes (car vs. plane vs. tree vs. chair vs. guitar, etc. etc.). How many weights does the fully-connected layer contain (excluding the bias term)?

The third and last step is the application of a specific activation function, the softmax function, to each unit in the output layer.This function normalizes the activity of each output unit (on the website, this is called the unit’s “logit”) to be between 0 and 1 and makes sure that the sum of the normalized activity of all output units equals to 1. This way, the (softmax-normalized) activity in each output unit can be interpreted as the probability for the corresponding class. For example, if the softmax-normalized activity of the output unit corresponding to the “ladybug” class is, let’s say, 0.1, then that means the network thinks that there is a 10% chance that the currently presented image is of the class “ladybug”. As such, most CNNs make probabilistic predictions (which can, of course, be easily discretized by defining the prediction to be the one with the largest probability).

Importantly, while the softmax function is applied to the activity (“logit”) of each output unit separately, it uses the (not-yet-softmax-normalized) activity of all the other output units as well! In this respect, the softmax activation function differs from the previously discussed activation functions (such as the ReLU), which does not use information from other units. To view how the softmax is computed interactively, click on one of the output units and subsequently click on the “softmax” box. Note that, for this course, it is not important to know how the softmax function works exactly.

So far, we discussed CNNs solely in the context of classification problems. What would change in the CNN architecture in the context of regression problems? Name one element that should be omitted and explain why.

You managed to finish this tutorial -- well done! You now know the basics of CNNs. Of course, there is much more to learn about CNNs and CNN-related concepts (such as batch normalization, learning rate schedules, skip connections, and so on). One important topic that we skipped is how CNNs are trained, which is the same way as FCNNs: using backpropagation. Discussing backpropagation quickly becomes very mathematical, though, so it’s beyond the scope of this course. If you’d like to learn more about CNNs (including backpropagation), we highly recommend the deep learning specialization on Coursera (https://www.coursera.org/specializations/deep-learning)!