1 of 38

Portions of these slides are from Penn Course CIS680

F1TENTH Autonomous Racing

Vision II : Deep Learning Methods

Zirui Zang and the F1TENTH Team

Contact: Rahul Mangharam <rahulm@seas.upenn.edu>

2 of 38

Vision Module Overview

Lecture 1 : Classical Methods

  • Vision Hardware
  • Accessing Camera on Linux
  • Camera Model & Pose Estimation
  • Useful OpenCV Functions
  • Visual SLAM

Lecture 1I : DL Methods

  • Deep Learning Basics
  • Object Detection w/ Image - YOLO
  • Object Detection w/ Pointcloud - Pointpillars
  • Recent Trend in CV
  • Network Deployment

3 of 38

Deep Learning Basics

4 of 38

Deep Learning in Autonomous Driving

Optical Flow (Yang Wang et al.)

Semantic Segmentation / Drivable Surface

3D Object Detection from Monocular Camera

Lidar point cloud detection

5 of 38

Deep Learning in Autonomous Driving

Night to Day (ForkGAN)

Dehaze (Cameron Hodges et al.)

Depth from Monocular Camera

Monodepth2

6 of 38

From Classical to DL

  • It all comes down to feature extraction from the data.
  • We can use different kernels to detect different features.
  • Instead of corners and edges, can we work with higher-level features?
  • Do we have to design the kernels with hands?

7 of 38

Feature Extraction - CNN

  • Use a hierarchy of convolution kernel to detect feature at different levels.
  • Use nonlinear activation functions at each layer to catch nonlinear information.
  • Use machine learning methods to train the network for useful kernels.
  • At the end of the network, use a classifier to map features to outputs.

CNNs

Fully-connected (MLP)

8 of 38

Common Detection Structure

  • The structure of the neural network determines it function.
  • For object detection and many other tasks, Backbone + Detection Heads is a common structure.

Data Preprocess:

  • Resize
  • Normalize
  • etc.

Feature Extraction (Backbone):

  • CNN
  • Skip Connections
  • Bottlenecks
  • etc.

Detection Heads:

  • Object Detection
  • Keypoint Detection
  • 3D Bounding Box
  • etc.

Result Decode & Post Processing:

  • Confidence Threshold
  • Non-Maximum Suppression
  • etc.

9 of 38

Neural Network Training Pipeline

  • Prepared label data: the image/point cloud and the ground truth. The ground truth can be object class, object existence(for detection), size, or anything you want the network to learn.
  • Initialization: Start with random numbers or pretrained network.
  • Forward Propagation: Calculated with current kernels through the network until the last layer. Compare the current result to the ground truth to calculate the loss function.
  • Backward Propagation: Calculate the derivative value of every variable in the network with respect to the ground truth and adjust the network based on that.
  • Batch Normalization: During back propagation, the adjustment need to be normalized within batches of train data. An important step for your network to actually learn something.

10 of 38

Neural Network Training Pipeline

Data Preparation

Initialization

Forward Propagation

Backward Propagation

Network Update

  • Training step often repeated with
  • Network design experiments
    • Add or remove a layer.
    • Change number of channels in a certain layer.
    • Add dropout or pulling layer.
    • Add skip connections.
  • Hyper-parameter tuning
    • Learning rate
    • Convolution kernel sizes, strides, etc.

11 of 38

Neural Network Training Pipeline

Training

Data Collection & Labeling & Augmentation

Network Design or Selection from Existing Designs

Network Deployment

Lifelong Updates

  • Training is only one part of the network’s life.
  • Data collection and labeling is the time consuming part.
  • Network design is the technical part.
  • Network Deployment is the consequential part.
  • Lifelong Updates is the legal part

12 of 38

Object Detection w/ Image

13 of 38

YOLO

  • YOLO is a single-stage object detection architecture published in 2015.
  • YOLO is fast, fairly accurate, and structurally very simple.
  • But it has issues like too many false positives and require heavy post processing.
  • It is old but it is a good way to learn deep learning computer vision.

14 of 38

YOLO Structure

Feature Extraction “Backbone”

“Detection Head”

7x7x30

7x7 windows, each window has 30 channels.

7x7 is just the output dimension from the convolutions.

Each window proposes 2 objects, each object has (w, h, x, y, confidence).

Then each window has 20 values for object classes.

So 2x5+20 = 30 channels.

15 of 38

Loss Function

YOLO loss function

(x, y) coordinates error

(w, h) coordinates error

Class error, if there exist an object in this class in this grid

confidence error, if there doesn’t exist an object in this grid

confidence error

16 of 38

How YOLO detects.

  • Total is 7x7x30 tensor output.
    • 7x7 grids from convolutions
    • 2 of the [x, y, w, h, confidence] per grid.
    • 20 classes
  • Detection is simple:
    • In each grid, if the

confidence > threshold,

    • pick that [x, y, w, h, confidence] and the highest class.

What are the limitations?

17 of 38

Non-maximum Suppression

  • YOLO heavily rely on NMS

18 of 38

Development of Object Detection

https://paperswithcode.com/sota/object-detection-on-coco

19 of 38

Object Detection w/ Pointcloud

20 of 38

Pointpillars

  • Object detection on point cloud data.
  • Point cloud data is a list of unsorted (x, y, z) coordinates.

PointPillars: Fast Encoders for Object Detection from Point Clouds

21 of 38

Point Cloud Object Detection

  • Lidar point cloud is very sparse.
  • Can we just do binning them into 3D voxels and perform 3D convolutions like in YOLO?

Voxelization

Challenges

22 of 38

Pointpillars Structure

  • Combine sparse point cloud into pillars
  • Align the pillars into pseudo image.
  • Still using 2D CNN to extract feature and a detection head to learn the detections.

23 of 38

Recent Trend in CV

24 of 38

GAN-based Methods

  • Generative Adversarial Nets
  • Training a Generator and a Discriminator at the same time.

Latent Space

25 of 38

ForkGAN: Seeing into the Rainy Night (ECCV 2020)

E: encoder

G: generator

D: discriminator

L: loss function

z: latent space

latent space

26 of 38

ForkGAN: Seeing into the Rainy Night

E: encoder

G: generator

D: discriminator

L: loss function

z: latent space

27 of 38

Transformers

Attention Calculation (Repeat for every patch)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021)

DETR - End to end object detection with transformers (ECCV2020)

  • Transformers out performs CNN.

28 of 38

Multi-camera Fusion

Tesla AI Day 2021

29 of 38

Network Deployment

30 of 38

Network Deployment

  • When deploying the network, we want to optimize its size and speed.
  • Common techniques includes:
    • Pruning
    • Quantization
    • Platform Optimizations
  • However, robustness and safety may be concerns.

Pruning not necessarily loses accuracy

FP32 vs. INT8

31 of 38

Network Pruning

  • Pruning is removing redundant neurons.
  • Static pruning: performed offline prior to inference.
  • Dynamic pruning: performed at runtime

32 of 38

Network Quantization

Quantization

8-bit signed integer quantization of a floating-point tensor

33 of 38

Deployment Platforms

CPUs

GPUs

Field Programmable Gate Arrays

(FPGA)

Mobile SoCs

or other ASICs

34 of 38

Deployment Platforms

35 of 38

Platform Optimizations

  • Lookup Table
  • Memory Optimization
    • Some chips have very limited memory bandwidth and can be a bottleneck in performance.
  • Special Hardware Optimization Libraries (e.g. cuDNN, OpenVINO)
    • These libraries may use special instructions in the chip so it can be much faster.
  • Open Neural Network Exchange (ONNX) is an open-source tool to parse AI models written for a variety diverse frameworks.

36 of 38

TensorRT

TensorRT on Jetson TX1

  • NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime.
  • Black box optimizer but works great.
  • Easy to use.
  • Only on NVIDIA platforms.

37 of 38

TensorRT Engine Generation

TensorRT Optimizer

TensorRT Runtime

  • The serialized engine is platform-specific. You can’t use the generated engine on a different hardware.
  • Support fp16, int8 quantization.
  • Support runtime change of the input dimension, called dynamic shape.

38 of 38

References