1 of 22

Machine Learning

for FRC

Edmond Lee

Programming Mentor

2 of 22

Overview

“Morning” and “Afternoon” discussion

Morning: theory oriented
Afternoon: Learn about the tools I developed for FRC vision

Machine learning requires a background in statistics, but for our purposes, just need basic algebra and the rest may be explained

3 of 22

What is Machine Learning?

Making computers learn? Old definition
Modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
T is the task, P is how well a program does T (probability), E is experience from learning
Supervised Learning

Regression - Making approximations to a continuous output
Classification - Making approximations to a discrete output (Yes or no)

4 of 22

General Practice

Model

Training

Data

Prediction

Learning Algorithm

Test Data

5 of 22

Applying Machine Learning in FRC?

Let’s think briefly about how machine learning can be used in FRC Vision:

Is object detection a regression or classification problem?
What about using object detection to help align the robot to the object?

6 of 22

Linear Regression

Predicts a real-valued output based on one or more input values or features
One input - 2D function, Two inputs - 3D function
One or more weights may be used which are like constants (no restriction on function or polynomial degree)
A specific pair of values from training data is called an example
Given a model for our hypothesis, how can we measure its accuracy?

Y = Theta_0 + Theta_1 * x, where Theta_0 is 100000 and Theta_1 is 100, x represents the square feet of an house, y is the price of the house

7 of 22

Modeling the Cost Function

The cost function represents the cost or error of the model, we want to be able to find a global minimum
The function we will use for this is the Squared error function. When its values are graphed for our model weights, it is a convex function
An important property of convex functions is they have only one local minimum
How do we optimize the model? By choosing weights that will minimize the cost function
So it’s basically an optimization problem from AP Calculus AB? No, the cost function isn’t modeled as a quadratic function. It wraps around h(x)

8 of 22

Gradient Descent

Repeat until convergence: Initialize the weights. For all weights, take the derivative of the cost function, multiply by learning rate, subtract from old weight value.
This is batch gradient descent. This goes through the whole training set which can be expensive. Stochastic gradient descent updates the weights by randomly selecting training examples but one iteration at a time. Faster but subject to fluctuations.
If the cost function is convex, gradient descent always converges regardless of the learning rate, so we can keep it the same. Too high of a learning rate can cause divergence.

9 of 22

Workshop

Now we’re going to use computers with MATLAB/Octave to implement linear regression
By using numerical computing, our code will resemble the formulas from previous slides
This is Programming Exercise 1 from Coursera’s ML course, here we do sections 1 and 2
Read in ex1.pdf section 1 and 2.1, then fill in warmUpExercise, plotData
Read the rest of section 2 and then fill in computeCost, gradientDescent after I go over the files

10 of 22

Hints

Row vector (1xN matrix): [1 2 3], Column Vector (Nx1 matrix): [1; 2; 3]

Math operations after a dot (.*) perform element-wise operations on matrices, else they will be matrix operations.

Many errors are due to operations on incompatible matrices, use size() to inspect dimensions

Useful functions include plot(), sum(), be sure to look at the documentation

Hypothesis is theta *x, not theta transposed as in the PDF, because it is a column vector already, so dimensions are correct

Why does X contain a column of 1s as well as the original input? The 1 serves as a bias value to allow the model to be shifted and it gets multiplied with theta_0. What happens if we use a columns of zeros instead?

11 of 22

Overfitting and Underfitting

Bias: Underfitting, model is too simple and maps poorly to data trend
Variance: Overfitting, model is too complex and generalizes poorly
How to avoid this problem?

By using most but not all of your available data
By having a large enough variety of data
Use less or regularize features

12 of 22

Neural Networks

Essentially a model of neurons and the brain, based on the idea that neurons are general purpose and can learn anything
Contains input, hidden, output layers

Dendrites (inputs) channel electrical signals to axons (outputs)

Connections from one node in a layer to a node in another layer are called weights
Good for non-linear hypotheses that can contain millions of features for small images

13 of 22

Optimizing a neural network

Forward and backward propagation
Forward - Given the input, calculate the output based on the activation function used
Backward - Compute the current cost function after forward prop to get the learning loss based on the current weights, then update the weights of the current model
Some similarities to gradient descent

14 of 22

CNNs

Convolutional Neural Networks - only takes images as input
What is Convolution and how does it work for images?

Convolution is the * operation for two functions where one is flipped horizontally and then “slides” through the other to calculate an integral from a start and end time, so it is an integral transform.
In CNNs, filters are convolved with the images so they slide through while computing dot products. This produces a convolution layer (or feature map) of which the filters convolve through until all layers are processed.
CNNs can learn the filters that are used in image classification based on whatever features are present in the images, but instead of dealing with millions of possible features in a shallow network, it reduces the number of features while making the network deeper all thanks to convolution.

15 of 22

SSD Caffe

Caffe optimizes a model by calling forward prop, calculating the output and loss, then doing backward prop to create a gradient used to update the weights.
Traditional methods such as Faster C-RNN use a sliding window where the contents of the window are classified, given a score, and then the window "slides" a small amount to repeat the process. Resampling occurs which is also computationally expensive.
Instead Single Shot detection produces fixed bounding boxes, predict by applying a small convolutional filter and can move them based on confidence levels defined in feature maps.
We have our own fork of SSD Caffe, thanks to yours truly

16 of 22

Setting up the repo

Install Ubuntu 17.10, install CUDA 9.1, install GCC 6, install necessary packages
Clone the repo
Make necessary fixes, specified in setup folder and instructions
Build
Download the latest version of OpenCV source code and build it
(Jetson only) Build all other libraries/frameworks needed to control camera and network

17 of 22

Gathering data

First we need to gather photos and annotate them

Record video and use Photoshop CC “Render Video” at 20-24 FPS to get images
Install LabelImg to create annotations
Takes hours to annotate, can be sped up if others work on it too

Follow directions in caffe-data repo to prepare the data (automatically creates a list of images for training and testing)

18 of 22

Training a model

Open ssd_train_ml and adjust the following parameters:

Base_lr - Training fails if too high, loss is infinite
Batch_size - GPU runs out of memory if too high, if set to 1, lower Base_lr to ensure it works
Accum_batch_size - Inversely proportional to batch_size, if lowering batch_size increase this

Train the model, watch output at beginning before leaving the computer

Takes several hours for 10000 iterations, which is an adequate amount
Learning rate determines how much is done per iteration not how long each takes

Use OpenCV script and ssd_score_ml.py to test out the model

19 of 22

Centering the robot

The problem: We want the robot to move the preloaded gear to the pipe. The pipe is in between two pieces of reflective tape.

Detection information: Bounding box top left and bottom right corner pixel coordinates (Four values) and confidence level for all objects found
What can we infer from the bounded box?
How can these inferences be related?
Edge cases: What happens if a bounding box is at the edge? What if we get a false positive?
Wait a minute, how we can do this if we have to collect data? We make our decisions while data comes in real-time, so it is online learning.

20 of 22

Detection 0: 370 225 466 405 Detection 0: 289 270 332 352 Detection 0: 289 262 313 311

Detection 1: 6 211 93 395 Detection 1: 445 274 479 354 Detection 1: 204 255 225 303

21 of 22

Future Work

SSD Caffe is usable and awesome, but there’s still a lot to do before we can integrate vision with autonomous.
Figure out how to get SSD Caffe to work on a Jetson or Android (detection)
Figure out how to get a low-latency, high quality video stream
Figure out an algorithm to get gradient descent to work for centering the robot
Figure out how to send the information from the coprocessor to the robot
Learn about Computer Vision and OpenCV to get the same information in different ways (which can work with more platforms)