1 of 45

Zirui Zang and the F1TENTH Team

Contact: Rahul Mangharam <rahulm@seas.upenn.edu>

Portions of these slides are from Penn Course MEAM620, CIS680

F1TENTH Autonomous Racing

Vision I : Classical Methods

2 of 45

Vision Module Overview

Lecture 1 : Classical Methods

  • Vision Hardware
  • Accessing Camera on Linux
  • Camera Model & Distance Estimation
  • Useful OpenCV Functions
  • Visual SLAM

Lecture 1I : DL Methods

  • Deep Learning Basics
  • Object Detection w/ Image
  • Object Detection w/ Pointcloud
  • Recent Trend in DL
  • Network Deployment

3 of 45

Vision Hardware

4 of 45

Vision Sensors - Cameras

  • RGB Camera
  • Event-based Camera
  • Depth Camera
    • Stereo Camera
    • Structural Light Camera
    • Active Stereo (Combining both)

5 of 45

Vision Sensors - Lidar

  • Dense 3D Lidar can be used as a vision sensor
    • xyz coordinates + intensity
  • Mechanical Lidar vs. Solid State Lidar
    • Moving parts, Scanning speed, Beam density

“Image” from one dense 3D lidar scan

6 of 45

Intel Realsense D435i

  • Active depth up to 3m.
  • Output image, depth map and point cloud.
  • Supports most libraries, including OpenCV and ROS2.

7 of 45

Let’s read its spec.

And see why do they provide these spec.

8 of 45

9 of 45

Camera Spec - Shutter Type

  • Shutter Type
    • Rolling: Cheaper
    • Global: better for high speed objects

Rolling

Global

10 of 45

11 of 45

Camera Spec - Sensor Size & Pixel Size

  • Sensor Size & Pixel Size
    • Bigger is better, more light gathering, more information.

iPhone 12 Pro Max marketing, but it’s true

12 of 45

13 of 45

Camera Spec - FOV

  • Field of View (FOV)
    • FOV = arctan(sensor_size/focal_length * 2) * 2

14 of 45

15 of 45

Camera Spec - Connector

  • Connector
    • Normal webcam is USB2.0, but why does Realsense use USB3.1?

Camera Serial Interface (CSI) Connector

  • 10 Gbps / 4 lanes
  • Dedicated bandwidth
  • Direct connection to Image Signal Processor (ISP)

Ethernet

  • Able to use Ethernet protocol
  • Usually only 1Gbps

USB

  • Easy to use
  • Shared bandwidth of 5Gbps (USB 3.1)
  • No direct connection to Image Signal Processor (ISP)

CSI Camera

16 of 45

Image Signal Processor (ISP)

  • A specialized chip for image processing. A more and more important part of a SoC chip.
  • ISP can handle image denoise, resize, transform, filter, auto-exposure, auto-whitebalance, etc. faster than CPU or GPU.
  • As for our Realsense, there is an onboard SoC.

new Snapdragon 8 Gen 1 chipset offers 18-bit triple-ISP design

17 of 45

Accessing Camera on Linux

18 of 45

ls /dev/video*

  • In Linux, every device is mounted as a file. So are cameras.
  • After you connect your camera, you can use ls /dev/video* to see if the system has recognized your camera.
  • For example, if the Realsense D435i is recognized by Jetson NX:

16-bit Depth Map

IR Image

RGB Image

19 of 45

v4l2src

  • Video4Linux (v4l2) is an API to access camera on Linux.
  • v4l2src can be used to capture video from v4l2 devices.
  • You can give the pipeline to an OpenCV VideoCapture object.

20 of 45

v4l2-ctrl -d /dev/video2 --list-ctrl-menus

  • v4l2-ctrl can be used to control the camera on Linux.
  • OpenCV also has an API that can control the camera. Most cameras will support either (if not both) v4l2 or OpenCV.

21 of 45

v4l2-ctrl -d /dev/video2 --list-formats-ext

  • --list-formats-ext can show all operating modes for the camera.
  • Note that if we want the Realsense camera to operate at 60Hz, it can only support 960x540.

22 of 45

Camera Model & Calibration

23 of 45

Image - Camera - World

  • Goal: We need to find transformations between:

Image pixel coordinates Camera coordinates World coordinates

Camera Intrinsics

Camera Extrinsics

Scale factor

  • We will get:

24 of 45

Camera Intrinsics

  • From trigonometry we can find:
  • Write it into matrix form and add a image center shift term, because image center may not be on the principal axis:

Focal length (pixels)

Image Center Shift (pixels)

25 of 45

Camera Extrinsics

  • Camera Extrinsics is a homogeneous transform from camera frame to world frame, both are 3D systems.
  • It will be consist of a rotation and a translation.

26 of 45

Intrinsics & Extrinsics

  • But how do we calculate them for an actual camera?

Here X and Y are flipped, but it’s equivalent.

27 of 45

Get Intrinsics - Camera Calibration

  • We can use OpenCV functions findChessboardCorners and calibrateCamera to do the calibration.
  • It will not only calculate the camera matrix K, but also a distortion matrix to correct the image of any distortion.
  • Tutorial, OpenCV documentation.

Camera Matrix

28 of 45

Get Extrinsics - Computing a Transformation

  • The extrinsics can be calculated with:
    • Find enough corresponding points in camera and world frames;
    • Solve the linear equation for the homography;
    • Separate out rotation and translation using singular value decomposition.
  • However, we don’t race in the world frame. We race in the car’s frame.
  • We can directly work with the camera frame.

Camera - Image Frame

29 of 45

Distance Estimation

  • We can get the camera matrix from calibration.
  • Then if we look at the third line, we can see

Zcar

Xcar

Ycar

  • If we mount the camera on the car without any roll-pitch-yaw, then axes in the camera and car frames are overlapping.

30 of 45

Distance Estimation

  • What if there are rotations in camera mounting?

We can add more corresponding points to solve for any missing variable.

  • Hence
  • If the object is on the ground, i.e. , , then we only need one pair of corresponding points to find .

31 of 45

Detection with OpenCV

Useful for doing the vision lab.

32 of 45

Edge & Blob Detection

cv2.Canny

detector = cv2.SimpleBlobDetector()

keypoints = detector.detect(img)

Parameters for the SimpleBlobDetector

33 of 45

HSV color space

  • Convert (red, green, blue)
  • To (hue, saturation, value)
  • value is also called intensity or brightness.
  • HSV is sometimes more useful than RGB when doing color-based filtering.
  • We can use to cv2.cvtColor convert color space.

HSV color space

Hue value

34 of 45

Example: lawn detection

  • Suppose you are an engineer in an autonomous lawn mower startup.
  • The investor is coming to see the prototype tomorrow. There is no time to train a neural network.
  • You are given the task to detect the boundary of a lawn, with classical methods.
  • It seems difficult to do it in RGB space. We can try HSV color space.

35 of 45

Used function:

cv2.GaussianBlur

Used function:

cv2.cvtColor

cv2.threshold

Used function:

cv2.connectedComponentsWithStats

Used function:

cv2.dilate

cv2.erode

36 of 45

Feature Extraction

  • To detect an object in an image, we always need to find features.
  • What is a feature?
  • AB: flat surfaces in some channels.
  • CD: certain edges
  • EF: certain corners or curves.
  • We need special filters to respond to them.

37 of 45

2D Convolution

  • The weights g is often called a kernel or a filter.
  • Kernels with different weights will respond differently to features.

38 of 45

Some Classical Feature Extraction Methods

Shi-Tomasi Corner Detector:

cv2.goodFeaturesToTrack()

SIFT (Scale-Invariant Feature Transform):

sift = cv2.SIFT_create()

kp = sift.detect()

ORB (Oriented FAST and Rotated BRIEF):

orb = cv2.ORB_create()

kp = orb.detect()

kp, des = orb.compute()

By matching these features between frames, we can track object, classically.

39 of 45

Put it together

  • We can find unique features in images and match them between frames.
  • And we can find their coordinates in the world frame with camera model.
  • Can we use them as landmarks in SLAM?

https://introlab.github.io/rtabmap/

40 of 45

Visual SLAM

41 of 45

Visual SLAM Framework

  • Feature Matching: Extract and match visual feature between camera frames.
  • Visual Odometry: Calculate the motion between consecutive camera frames.
  • Bundle Adjustment: Correct the transformations with historical data.
  • Map Building: Build the map based on the measurements and corrections.
  • Loop Closure: Detect path returns to a previous point and correct the accumulated drift error.

ORB-SLAM2

42 of 45

Bundle Adjustment

  • Optimizing the camera projections by considering all points and cameras.
  • Starting from an initial guess.
  • Treating camera from different time as different cameras.
  • It is a non-linear optimization, including rotation, distortion correction etc.
  • Statistically optimal by expensive.

Consider many points

Consider many points and cameras

is 0 when point i is not seen by camera j

When the projection is arbitrary

43 of 45

Challenges in Visual SLAM

  • Exponentially increasing number of feature points to match.
  • Bundle adjustment is especially expensive.
  • Can it work on monocular camera without IMU?
  • Can the algorithm handle drifts?
  • Can it close the loop?

44 of 45

ORB-SLAM

  • 3 parallel threads: Tracking, Local Mapping and Loop Closing.
  • Selection of keyframes and Culling policies of keyframes and points.
  • Separating of local and global information with Covisibility Graph and Essential Graph.
  • Use Bags of Words for place recognition in loop closure.

45 of 45

References