1 of 45

Zirui Zang and the F1TENTH Team

Contact: Rahul Mangharam <rahulm@seas.upenn.edu>

Portions of these slides are from Penn Course MEAM620, CIS680

F1TENTH Autonomous Racing

�Vision I : Classical Methods

2 of 45

Vision Module Overview

Lecture 1 : Classical Methods

Vision Hardware
Accessing Camera on Linux
Camera Model & Distance Estimation
Useful OpenCV Functions
Visual SLAM

Lecture 1I : DL Methods

Deep Learning Basics
Object Detection w/ Image
Object Detection w/ Pointcloud
Recent Trend in DL
Network Deployment

3 of 45

Vision Hardware

4 of 45

Vision Sensors - Cameras

RGB Camera
Event-based Camera
Depth Camera

Stereo Camera
Structural Light Camera
Active Stereo (Combining both)

5 of 45

Vision Sensors - Lidar

Dense 3D Lidar can be used as a vision sensor

xyz coordinates + intensity

Mechanical Lidar vs. Solid State Lidar

Moving parts, Scanning speed, Beam density

“Image” from one dense 3D lidar scan

6 of 45

Intel Realsense D435i

Active depth up to 3m.
Output image, depth map and point cloud.
Supports most libraries, including OpenCV and ROS2.

7 of 45

Let’s read its spec.

And see why do they provide these spec.

9 of 45

Camera Spec - Shutter Type

Shutter Type

Rolling: Cheaper
Global: better for high speed objects

Rolling

Global

11 of 45

Camera Spec - Sensor Size & Pixel Size

Sensor Size & Pixel Size

Bigger is better, more light gathering, more information.

iPhone 12 Pro Max marketing, but it’s true

13 of 45

Camera Spec - FOV

Field of View (FOV)

FOV = arctan(sensor_size/focal_length * 2) * 2

15 of 45

Camera Spec - Connector

Connector

Normal webcam is USB2.0, but why does Realsense use USB3.1?

Camera Serial Interface (CSI) Connector

10 Gbps / 4 lanes
Dedicated bandwidth
Direct connection to Image Signal Processor (ISP)

Ethernet

Able to use Ethernet protocol
Usually only 1Gbps

USB

Easy to use
Shared bandwidth of 5Gbps (USB 3.1)
No direct connection to Image Signal Processor (ISP)

CSI Camera

16 of 45

Image Signal Processor (ISP)

A specialized chip for image processing. A more and more important part of a SoC chip.
ISP can handle image denoise, resize, transform, filter, auto-exposure, auto-whitebalance, etc. faster than CPU or GPU.
As for our Realsense, there is an onboard SoC.

new Snapdragon 8 Gen 1 chipset offers 18-bit triple-ISP design

17 of 45

Accessing Camera on Linux

18 of 45

ls /dev/video*

In Linux, every device is mounted as a file. So are cameras.
After you connect your camera, you can use ls /dev/video* to see if the system has recognized your camera.
For example, if the Realsense D435i is recognized by Jetson NX:

16-bit Depth Map

IR Image

RGB Image

19 of 45

v4l2src

Video4Linux (v4l2) is an API to access camera on Linux.
v4l2src can be used to capture video from v4l2 devices.
You can give the pipeline to an OpenCV VideoCapture object.

20 of 45

v4l2-ctrl -d /dev/video2 --list-ctrl-menus

v4l2-ctrl can be used to control the camera on Linux.
OpenCV also has an API that can control the camera. Most cameras will support either (if not both) v4l2 or OpenCV.

21 of 45

v4l2-ctrl -d /dev/video2 --list-formats-ext

--list-formats-ext can show all operating modes for the camera.
Note that if we want the Realsense camera to operate at 60Hz, it can only support 960x540.

22 of 45

Camera Model & Calibration

23 of 45

Image - Camera - World

Goal: We need to find transformations between:

Image pixel coordinates Camera coordinates World coordinates

Camera Intrinsics

Camera Extrinsics

Scale factor

We will get:

24 of 45

Camera Intrinsics

From trigonometry we can find:

Write it into matrix form and add a image center shift term, because image center may not be on the principal axis:

Focal length (pixels)

Image Center Shift (pixels)

25 of 45

Camera Extrinsics

Camera Extrinsics is a homogeneous transform from camera frame to world frame, both are 3D systems.
It will be consist of a rotation and a translation.

26 of 45

Intrinsics & Extrinsics

But how do we calculate them for an actual camera?

Here X and Y are flipped, but it’s equivalent.

27 of 45

Get Intrinsics - Camera Calibration

We can use OpenCV functions findChessboardCorners and calibrateCamera to do the calibration.
It will not only calculate the camera matrix K, but also a distortion matrix to correct the image of any distortion.
Tutorial, OpenCV documentation.

Camera Matrix

28 of 45

Get Extrinsics - Computing a Transformation

The extrinsics can be calculated with:

Find enough corresponding points in camera and world frames;
Solve the linear equation for the homography;
Separate out rotation and translation using singular value decomposition.

However, we don’t race in the world frame. We race in the car’s frame.
We can directly work with the camera frame.

Camera - Image Frame

29 of 45

Distance Estimation

We can get the camera matrix from calibration.

Then if we look at the third line, we can see

Z_car

X_car

Y_car

If we mount the camera on the car without any roll-pitch-yaw, then axes in the camera and car frames are overlapping.

30 of 45

Distance Estimation

What if there are rotations in camera mounting?

We can add more corresponding points to solve for any missing variable.

Hence

If the object is on the ground, i.e. , , then we only need one pair of corresponding points to find .

31 of 45

Detection with OpenCV

Useful for doing the vision lab.

32 of 45

Edge & Blob Detection

cv2.Canny

detector = cv2.SimpleBlobDetector()

keypoints = detector.detect(img)

Parameters for the SimpleBlobDetector

33 of 45

HSV color space

Convert (red, green, blue)
To (hue, saturation, value)
value is also called intensity or brightness.
HSV is sometimes more useful than RGB when doing color-based filtering.
We can use to cv2.cvtColor convert color space.

HSV color space

Hue value

34 of 45

Example: lawn detection

Suppose you are an engineer in an autonomous lawn mower startup.
The investor is coming to see the prototype tomorrow. There is no time to train a neural network.
You are given the task to detect the boundary of a lawn, with classical methods.

It seems difficult to do it in RGB space. We can try HSV color space.

35 of 45

Used function:

cv2.GaussianBlur

Used function:

cv2.cvtColor

cv2.threshold

Used function:

cv2.connectedComponentsWithStats

Used function:

cv2.dilate

cv2.erode

36 of 45

Feature Extraction

To detect an object in an image, we always need to find features.
What is a feature?

AB: flat surfaces in some channels.

CD: certain edges

EF: certain corners or curves.

We need special filters to respond to them.

37 of 45

2D Convolution

The weights g is often called a kernel or a filter.
Kernels with different weights will respond differently to features.

38 of 45

Some Classical Feature Extraction Methods

Shi-Tomasi Corner Detector:

cv2.goodFeaturesToTrack()

SIFT (Scale-Invariant Feature Transform):

sift = cv2.SIFT_create()

kp = sift.detect()

ORB (Oriented FAST and Rotated BRIEF):

orb = cv2.ORB_create()

kp = orb.detect()

kp, des = orb.compute()

By matching these features between frames, we can track object, classically.

39 of 45

Put it together

We can find unique features in images and match them between frames.
And we can find their coordinates in the world frame with camera model.
Can we use them as landmarks in SLAM?

https://introlab.github.io/rtabmap/

40 of 45

Visual SLAM

41 of 45

Visual SLAM Framework

Feature Matching: Extract and match visual feature between camera frames.
Visual Odometry: Calculate the motion between consecutive camera frames.
Bundle Adjustment: Correct the transformations with historical data.
Map Building: Build the map based on the measurements and corrections.
Loop Closure: Detect path returns to a previous point and correct the accumulated drift error.

ORB-SLAM2

42 of 45

Bundle Adjustment

Optimizing the camera projections by considering all points and cameras.
Starting from an initial guess.
Treating camera from different time as different cameras.
It is a non-linear optimization, including rotation, distortion correction etc.
Statistically optimal by expensive.

Consider many points

Consider many points and cameras

is 0 when point i is not seen by camera j

When the projection is arbitrary

43 of 45

Challenges in Visual SLAM

Exponentially increasing number of feature points to match.
Bundle adjustment is especially expensive.
Can it work on monocular camera without IMU?
Can the algorithm handle drifts?
Can it close the loop?

44 of 45

ORB-SLAM

3 parallel threads: Tracking, Local Mapping and Loop Closing.
Selection of keyframes and Culling policies of keyframes and points.
Separating of local and global information with Covisibility Graph and Essential Graph.
Use Bags of Words for place recognition in loop closure.

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45