1 of 38

Portions of these slides are from Penn Course CIS680

F1TENTH Autonomous Racing

�Vision II : Deep Learning Methods

Zirui Zang and the F1TENTH Team

Contact: Rahul Mangharam <rahulm@seas.upenn.edu>

2 of 38

Vision Module Overview

Lecture 1 : Classical Methods

Vision Hardware
Accessing Camera on Linux
Camera Model & Pose Estimation
Useful OpenCV Functions
Visual SLAM

Lecture 1I : DL Methods

Deep Learning Basics
Object Detection w/ Image - YOLO
Object Detection w/ Pointcloud - Pointpillars
Recent Trend in CV
Network Deployment

3 of 38

Deep Learning Basics

4 of 38

Deep Learning in Autonomous Driving

Optical Flow (Yang Wang et al.)

Semantic Segmentation / Drivable Surface

3D Object Detection from Monocular Camera

Lidar point cloud detection

5 of 38

Deep Learning in Autonomous Driving

Night to Day (ForkGAN)

Dehaze (Cameron Hodges et al.)

Depth from Monocular Camera

Monodepth2

6 of 38

From Classical to DL

It all comes down to feature extraction from the data.
We can use different kernels to detect different features.
Instead of corners and edges, can we work with higher-level features?
Do we have to design the kernels with hands?

7 of 38

Feature Extraction - CNN

Use a hierarchy of convolution kernel to detect feature at different levels.
Use nonlinear activation functions at each layer to catch nonlinear information.
Use machine learning methods to train the network for useful kernels.
At the end of the network, use a classifier to map features to outputs.

CNNs

Fully-connected (MLP)

8 of 38

Common Detection Structure

The structure of the neural network determines it function.
For object detection and many other tasks, Backbone + Detection Heads is a common structure.

Data Preprocess:

Resize
Normalize
etc.

Feature Extraction (Backbone):

CNN
Skip Connections
Bottlenecks
etc.

Detection Heads:

Object Detection
Keypoint Detection
3D Bounding Box
etc.

Result Decode & Post Processing:

Confidence Threshold
Non-Maximum Suppression
etc.

9 of 38

Neural Network Training Pipeline

Prepared label data: the image/point cloud and the ground truth. The ground truth can be object class, object existence(for detection), size, or anything you want the network to learn.
Initialization: Start with random numbers or pretrained network.
Forward Propagation: Calculated with current kernels through the network until the last layer. Compare the current result to the ground truth to calculate the loss function.
Backward Propagation: Calculate the derivative value of every variable in the network with respect to the ground truth and adjust the network based on that.
Batch Normalization: During back propagation, the adjustment need to be normalized within batches of train data. An important step for your network to actually learn something.

10 of 38

Neural Network Training Pipeline

Data Preparation

Initialization

Forward Propagation

Backward Propagation

Network Update

Training step often repeated with
Network design experiments

Add or remove a layer.
Change number of channels in a certain layer.
Add dropout or pulling layer.
Add skip connections.

Hyper-parameter tuning

Learning rate
Convolution kernel sizes, strides, etc.

11 of 38

Neural Network Training Pipeline

Training

Data Collection & Labeling & Augmentation

Network Design or Selection from Existing Designs

Network Deployment

Lifelong Updates

Training is only one part of the network’s life.
Data collection and labeling is the time consuming part.
Network design is the technical part.
Network Deployment is the consequential part.
Lifelong Updates is the legal part

12 of 38

Object Detection w/ Image

13 of 38

YOLO

YOLO is a single-stage object detection architecture published in 2015.
YOLO is fast, fairly accurate, and structurally very simple.
But it has issues like too many false positives and require heavy post processing.
It is old but it is a good way to learn deep learning computer vision.

14 of 38

YOLO Structure

Feature Extraction “Backbone”

“Detection Head”

7x7x30

7x7 windows, each window has 30 channels.

7x7 is just the output dimension from the convolutions.

Each window proposes 2 objects, each object has (w, h, x, y, confidence).

Then each window has 20 values for object classes.

So 2x5+20 = 30 channels.

15 of 38

Loss Function

YOLO loss function

(x, y) coordinates error

(w, h) coordinates error

Class error, if there exist an object in this class in this grid

confidence error, if there doesn’t exist an object in this grid

confidence error

16 of 38

How YOLO detects.

Total is 7x7x30 tensor output.

7x7 grids from convolutions
2 of the [x, y, w, h, confidence] per grid.
20 classes

Detection is simple:

In each grid, if the

confidence > threshold,

pick that [x, y, w, h, confidence] and the highest class.

What are the limitations?

17 of 38

Non-maximum Suppression

YOLO heavily rely on NMS

18 of 38

Development of Object Detection

https://paperswithcode.com/sota/object-detection-on-coco

19 of 38

Object Detection w/ Pointcloud

20 of 38

Pointpillars

Object detection on point cloud data.
Point cloud data is a list of unsorted (x, y, z) coordinates.

PointPillars: Fast Encoders for Object Detection from Point Clouds

21 of 38

Point Cloud Object Detection

Lidar point cloud is very sparse.
Can we just do binning them into 3D voxels and perform 3D convolutions like in YOLO?

Voxelization

Challenges

22 of 38

Pointpillars Structure

Combine sparse point cloud into pillars
Align the pillars into pseudo image.
Still using 2D CNN to extract feature and a detection head to learn the detections.

23 of 38

Recent Trend in CV

24 of 38

GAN-based Methods

Generative Adversarial Nets
Training a Generator and a Discriminator at the same time.

Latent Space

25 of 38

ForkGAN: Seeing into the Rainy Night (ECCV 2020)

E: encoder

G: generator

D: discriminator

L: loss function

z: latent space

latent space

26 of 38

ForkGAN: Seeing into the Rainy Night

E: encoder

G: generator

D: discriminator

L: loss function

z: latent space

27 of 38

Transformers

Attention Calculation (Repeat for every patch)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021)

DETR - End to end object detection with transformers (ECCV2020)

Transformers out performs CNN.

28 of 38

Multi-camera Fusion

Tesla AI Day 2021

29 of 38

Network Deployment

30 of 38

Network Deployment

When deploying the network, we want to optimize its size and speed.
Common techniques includes:

Pruning
Quantization
Platform Optimizations

However, robustness and safety may be concerns.

Pruning not necessarily loses accuracy

FP32 vs. INT8

31 of 38

Network Pruning

Pruning is removing redundant neurons.
Static pruning: performed offline prior to inference.
Dynamic pruning: performed at runtime

32 of 38

Network Quantization

Quantization

8-bit signed integer quantization of a floating-point tensor

33 of 38

Deployment Platforms

CPUs

GPUs

Field Programmable Gate Arrays

(FPGA)

Mobile SoCs

or other ASICs

34 of 38

Deployment Platforms

35 of 38

Platform Optimizations

Lookup Table
Memory Optimization

Some chips have very limited memory bandwidth and can be a bottleneck in performance.

Special Hardware Optimization Libraries (e.g. cuDNN, OpenVINO)

These libraries may use special instructions in the chip so it can be much faster.

Open Neural Network Exchange (ONNX) is an open-source tool to parse AI models written for a variety diverse frameworks.

36 of 38

TensorRT

TensorRT on Jetson TX1

NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime.
Black box optimizer but works great.
Easy to use.
Only on NVIDIA platforms.

37 of 38

TensorRT Engine Generation

TensorRT Optimizer

TensorRT Runtime

The serialized engine is platform-specific. You can’t use the generated engine on a different hardware.
Support fp16, int8 quantization.
Support runtime change of the input dimension, called dynamic shape.

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38