1 of 22

Machine Learning

for FRC

Edmond Lee

Programming Mentor

2 of 22

Overview

  • “Morning” and “Afternoon” discussion
    • Morning: theory oriented
    • Afternoon: Learn about the tools I developed for FRC vision
  • Machine learning requires a background in statistics, but for our purposes, just need basic algebra and the rest may be explained

3 of 22

What is Machine Learning?

  • Making computers learn? Old definition
  • Modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
  • T is the task, P is how well a program does T (probability), E is experience from learning
  • Supervised Learning
    • Regression - Making approximations to a continuous output
    • Classification - Making approximations to a discrete output (Yes or no)

4 of 22

General Practice

Model

Training

Data

Prediction

Learning Algorithm

Test Data

5 of 22

Applying Machine Learning in FRC?

Let’s think briefly about how machine learning can be used in FRC Vision:

  • Is object detection a regression or classification problem?
  • What about using object detection to help align the robot to the object?

6 of 22

Linear Regression

  • Predicts a real-valued output based on one or more input values or features
  • One input - 2D function, Two inputs - 3D function
  • One or more weights may be used which are like constants (no restriction on function or polynomial degree)
  • A specific pair of values from training data is called an example
  • Given a model for our hypothesis, how can we measure its accuracy?
    • Y = Theta_0 + Theta_1 * x, where Theta_0 is 100000 and Theta_1 is 100, x represents the square feet of an house, y is the price of the house

7 of 22

Modeling the Cost Function

  • The cost function represents the cost or error of the model, we want to be able to find a global minimum
  • The function we will use for this is the Squared error function. When its values are graphed for our model weights, it is a convex function
  • An important property of convex functions is they have only one local minimum
  • How do we optimize the model? By choosing weights that will minimize the cost function
  • So it’s basically an optimization problem from AP Calculus AB? No, the cost function isn’t modeled as a quadratic function. It wraps around h(x)

8 of 22

Gradient Descent

  • Repeat until convergence: Initialize the weights. For all weights, take the derivative of the cost function, multiply by learning rate, subtract from old weight value.
  • This is batch gradient descent. This goes through the whole training set which can be expensive. Stochastic gradient descent updates the weights by randomly selecting training examples but one iteration at a time. Faster but subject to fluctuations.
  • If the cost function is convex, gradient descent always converges regardless of the learning rate, so we can keep it the same. Too high of a learning rate can cause divergence.

9 of 22

Workshop

  • Now we’re going to use computers with MATLAB/Octave to implement linear regression
  • By using numerical computing, our code will resemble the formulas from previous slides
  • This is Programming Exercise 1 from Coursera’s ML course, here we do sections 1 and 2
  • Read in ex1.pdf section 1 and 2.1, then fill in warmUpExercise, plotData
  • Read the rest of section 2 and then fill in computeCost, gradientDescent after I go over the files

10 of 22

Hints

Row vector (1xN matrix): [1 2 3], Column Vector (Nx1 matrix): [1; 2; 3]

Math operations after a dot (.*) perform element-wise operations on matrices, else they will be matrix operations.

Many errors are due to operations on incompatible matrices, use size() to inspect dimensions

Useful functions include plot(), sum(), be sure to look at the documentation

Hypothesis is theta *x, not theta transposed as in the PDF, because it is a column vector already, so dimensions are correct

Why does X contain a column of 1s as well as the original input? The 1 serves as a bias value to allow the model to be shifted and it gets multiplied with theta_0. What happens if we use a columns of zeros instead?

11 of 22

Overfitting and Underfitting

  • Bias: Underfitting, model is too simple and maps poorly to data trend
  • Variance: Overfitting, model is too complex and generalizes poorly
  • How to avoid this problem?
    • By using most but not all of your available data
    • By having a large enough variety of data
    • Use less or regularize features

12 of 22

Neural Networks

  • Essentially a model of neurons and the brain, based on the idea that neurons are general purpose and can learn anything
  • Contains input, hidden, output layers
    • Dendrites (inputs) channel electrical signals to axons (outputs)
  • Connections from one node in a layer to a node in another layer are called weights
  • Good for non-linear hypotheses that can contain millions of features for small images

13 of 22

Optimizing a neural network

  • Forward and backward propagation
  • Forward - Given the input, calculate the output based on the activation function used
  • Backward - Compute the current cost function after forward prop to get the learning loss based on the current weights, then update the weights of the current model
  • Some similarities to gradient descent

14 of 22

CNNs

  • Convolutional Neural Networks - only takes images as input
  • What is Convolution and how does it work for images?
    • Convolution is the * operation for two functions where one is flipped horizontally and then “slides” through the other to calculate an integral from a start and end time, so it is an integral transform.
    • In CNNs, filters are convolved with the images so they slide through while computing dot products. This produces a convolution layer (or feature map) of which the filters convolve through until all layers are processed.
    • CNNs can learn the filters that are used in image classification based on whatever features are present in the images, but instead of dealing with millions of possible features in a shallow network, it reduces the number of features while making the network deeper all thanks to convolution.

15 of 22

SSD Caffe

  • Caffe optimizes a model by calling forward prop, calculating the output and loss, then doing backward prop to create a gradient used to update the weights.
  • Traditional methods such as Faster C-RNN use a sliding window where the contents of the window are classified, given a score, and then the window "slides" a small amount to repeat the process. Resampling occurs which is also computationally expensive.
  • Instead Single Shot detection produces fixed bounding boxes, predict by applying a small convolutional filter and can move them based on confidence levels defined in feature maps.
  • We have our own fork of SSD Caffe, thanks to yours truly

16 of 22

Setting up the repo

  1. Install Ubuntu 17.10, install CUDA 9.1, install GCC 6, install necessary packages
  2. Clone the repo
  3. Make necessary fixes, specified in setup folder and instructions
  4. Build
  5. Download the latest version of OpenCV source code and build it
  6. (Jetson only) Build all other libraries/frameworks needed to control camera and network

17 of 22

Gathering data

  • First we need to gather photos and annotate them
    • Record video and use Photoshop CC “Render Video” at 20-24 FPS to get images
    • Install LabelImg to create annotations
    • Takes hours to annotate, can be sped up if others work on it too
  • Follow directions in caffe-data repo to prepare the data (automatically creates a list of images for training and testing)

18 of 22

Training a model

  • Open ssd_train_ml and adjust the following parameters:
    • Base_lr - Training fails if too high, loss is infinite
    • Batch_size - GPU runs out of memory if too high, if set to 1, lower Base_lr to ensure it works
    • Accum_batch_size - Inversely proportional to batch_size, if lowering batch_size increase this
  • Train the model, watch output at beginning before leaving the computer
    • Takes several hours for 10000 iterations, which is an adequate amount
    • Learning rate determines how much is done per iteration not how long each takes
  • Use OpenCV script and ssd_score_ml.py to test out the model

19 of 22

Centering the robot

The problem: We want the robot to move the preloaded gear to the pipe. The pipe is in between two pieces of reflective tape.

  • Detection information: Bounding box top left and bottom right corner pixel coordinates (Four values) and confidence level for all objects found
  • What can we infer from the bounded box?
  • How can these inferences be related?
  • Edge cases: What happens if a bounding box is at the edge? What if we get a false positive?
  • Wait a minute, how we can do this if we have to collect data? We make our decisions while data comes in real-time, so it is online learning.

20 of 22

Detection 0: 370 225 466 405 Detection 0: 289 270 332 352 Detection 0: 289 262 313 311

Detection 1: 6 211 93 395 Detection 1: 445 274 479 354 Detection 1: 204 255 225 303

21 of 22

Future Work

  • SSD Caffe is usable and awesome, but there’s still a lot to do before we can integrate vision with autonomous.
  • Figure out how to get SSD Caffe to work on a Jetson or Android (detection)
  • Figure out how to get a low-latency, high quality video stream
  • Figure out an algorithm to get gradient descent to work for centering the robot
  • Figure out how to send the information from the coprocessor to the robot
  • Learn about Computer Vision and OpenCV to get the same information in different ways (which can work with more platforms)

22 of 22

Attributions

  • SSD Caffe may be found on Github. https://github.com/weiliu89/caffe/tree/ssd/
  • Equations obtained from Andrew Ng’s Coursera Machine Learning course.