JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 87

End-to-end Driving

Tambet Matiisen

7.09.2022

2 of 87

Modular Approach

�Sensors�

�Perception�

�Planning�

�Control�

�Actuators�

camera image

detected objects

trajectory

steering, �gas and brake

Images: https://twitter.com/haltakov/status/1382014488174530563

3 of 87

End-to-End Approach

�Sensors�

�Neural Network�

�Actuators�

camera image

steering, �gas and brake

Images: https://twitter.com/haltakov/status/1382014488174530563

4 of 87

Two approaches

	Modular Approach	End-to-End Approach
Pros	Easier to test and analyze Driverless solutions emerging	Conceptually simpler Low cost sensors
Cons	Needs costly sensors (lidar) �and HD maps Impossible to write rules for all situations	Hard to test and analyze Needs a lot of training data �and compute Mostly highway lane following

Tesla

comma.ai

5 of 87

Agenda

Training methods
Network architectures
Evaluation methods
Interpretability
Safety
Companies

6 of 87

Agenda

Training methods

Imitation learning
Reinforcement learning

Network architectures
Evaluation methods
Interpretability
Safety
Companies

7 of 87

Agenda

Training methods

Imitation learning
Reinforcement learning

Network architectures
Evaluation methods
Interpretability
Safety
Companies

8 of 87

Imitation learning

Record a big dataset of camera images and driving commands

Driving commands: steering wheel angles & car speed

Train a neural network to predict driving commands from camera images

Normal supervised learning can be used, nothing fancy or complicated

https://devblogs.nvidia.com/deep-learning-self-driving-cars/

9 of 87

NVIDIA DAVE-2

Bojarski et al. End to End Learning for Self-Driving Cars (2016)

10 of 87

Problem: distribution drift

Human driver always the car in the center of the lane.
When model drives the car, it

accumulates small prediction errors
and drifts away from the center.

When car has drifted away from the center,

the networks sees data it has never seen during training
and can behave unpredictably.

11 of 87

Distribution drift example

Ross et al. A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning (2010)

12 of 87

Common solutions to distribution drift

Use three cameras and adapt driving commands
Artificially generate off-the-track images
DAtaset AGGregation (DAGGER)

13 of 87

Common solutions to distribution drift

Use three cameras and adapt driving commands
Artificially generate off-the-track images
DAtaset AGGregation (DAGGER)

Giusti et al. A Machine Learning Approach to Visual Perceptionof Forest Trails for Mobile Robots (2015)

14 of 87

Common solutions to distribution drift

Use three cameras and adapt driving commands
Artificially generate off-the-track images
DAtaset AGGregation (DAGGER)

Bojarski et al. End to End Learning for Self-Driving Cars (2016)

15 of 87

Common solutions to distribution drift

Use three cameras and adapt driving commands
Artificially generate off-the-track images
DAtaset AGGregation (DAGGER)

Shift and rotate

Bojarski et al. End to End Learning for Self-Driving Cars (2016)

16 of 87

Common solutions to distribution drift

Use three cameras and adapt driving commands
Artificially generate off-the-track images
DAtaset AGGregation (DAGGER)

DAGGER algorithm:

Train model from collected data.
Run model to generate new data.
Ask expert to label correct actions for this data.
Aggregate previously collected and newly labeled data and start again.

Ross et al. A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning (2010)

17 of 87

Common solutions to distribution drift

Use three cameras and adapt driving commands
Artificially generate off-the-track images
DAtaset AGGregation (DAGGER)

Practical DAGGER algorithm:

Train model from collected data.
Let the model do driving, safety driver observes the model.
If model does something wrong then safety driver intervenes and her actions are recorded till she gives control back to the model.
Aggregate previously collected and newly recorded data and start again.

Ross et al. A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning (2010)

18 of 87

DAGGER

Ross et al. A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning (2010)

19 of 87

Imitation learning summary

Imitation learning trains model to predict human driving commands.�
Naive imitation learning suffers from distribution shift problem.�
Distribution shift problem can be alleviated using data augmentation and alternating between model and human driving (DAGGER).

20 of 87

Agenda

Training methods

Imitation learning
Reinforcement learning

Network architectures
Evaluation methods
Interpretability
Safety
Companies

21 of 87

Reinforcement learning

Car observes the environment, takes actions and receives rewards.
Car is given positive reward when it does something right

Examples: staying within the lane, moving forward, arriving to the destination, etc.

Car is given negative reward when it does something wrong

Examples: crashing, drifting away from the trajectory, etc.

22 of 87

Learning to drive in a day

Kendall et al. Learning to Drive in a Day (2018)

23 of 87

Simulations

Reinforcement learning is usually done in simulations, because who would want to crash the car in the real world?

Microsoft

AirSim

https://github.com/microsoft/AirSim

https://carla.org/

https://github.com/sentdex/pygta5/

24 of 87

CARLA

25 of 87

Distribution mismatch

Simulations do not really match the real world:

Lighting and textures are not perfect.
Limited number of human figures and car models.
Behavior of other traffic participants much more predictable.

26 of 87

Unsupervised domain adaptation

Convert simulation images to real-world images or vice versa.

No correspondences between images needed! (That’s the unsupervised part!)

Train driving model on simulation images or converted-to-real-world images.

Converted real-word images is more efficient, because no conversion needed at run-time!

Hoffman et al. CyCADA: Cycle-Consistent Adversarial Domain Adaptation (2017)

27 of 87

Wayve sim2real

Beweley et al. Learning to Drive from Simulation without Real World Labels

28 of 87

Reinforcement learning summary

When using reinforcement learning the car is given positive reward for doing the right thing and negative reward for doing the wrong thing.�
Reinforcement learning is challenging to train in the real world, therefore simulations are often used.�
Simulations do not perfectly match the real world, therefore distribution mismatch occurs.�
Unsupervised domain adaptation is one approach to overcome distribution mismatch problem with simulations.

29 of 87

Agenda

Training methods
Network architectures

Network inputs
Network outputs

Evaluation methods
Interpretability
Safety
Companies

30 of 87

Agenda

Training methods
Network architectures

Network inputs
Network outputs

Evaluation methods
Interpretability
Safety
Companies

31 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

32 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Pomerleau et al. ALVINN: An Autonomous Land Vehicle in a Neural Network (1995)

ALVINN (1995)

33 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Pomerleau et al. ALVINN: An Autonomous Land Vehicle in a Neural Network (1995)

ALVINN (1995)

34 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Bojarski et al. End to End Learning for Self-Driving Cars (2016)

NVIDIA DAVE-2 (2016)

35 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Bojarski et al. End to End Learning for Self-Driving Cars (2016)

NVIDIA DAVE-2 (2016)

36 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

LeCun et al. Off-Road Obstacle Avoidance through End-to-End Learning (2005)

DAVE (2005)

37 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Hecker et al. End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners (2018)

Hecker et al. (2018)

38 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Müller et al. Driving Policy Transfer via Modularity and Abstraction (2018)

Müller et al. (2018)

39 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Codevilla et al. Exploring the Limitations of Behavior Cloning for Autonomous Driving (2019)

Codevilla et al. (2019)

40 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Codevilla et al. End-to-end Driving via Conditional Imitation Learning (2017)

Codevilla et al. (2017)

41 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Codevilla et al. End-to-end Driving via Conditional Imitation Learning (2017)

Codevilla et al. (2017)

42 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Codevilla et al. (2017)

Codevilla et al. End-to-end Driving via Conditional Imitation Learning (2017)

43 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Hecker et al. End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners (2018)

Hecker et al. (2019)

44 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Hecker et al. End-to-End Learning of Driving Models with Surround-View Cameras and Route Planners (2018)

Hecker et al. (2019)

45 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Zeng et al. End-to-end Interpretable Neural Motion Planner (2019)

Zeng et al. (2019)

46 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Bansal et al. ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst (2018)

Bansal et al. (2018)

47 of 87

Network inputs

Camera
Stereo camera
360-degree camera(s)
Semantic representation
Vehicle state (e.g. speed)
Navigational input

Behavioral commands
Trajectory points
Route planner image

LiDAR
High-definition maps

Hecker et al. Learning Accurate, Comfortable and Human-like Driving (2019)

48 of 87

Multi-modal fusion

Usually inputs from multiple sensors are just concatenated.

Multiplication is also an option.

Data could be fused at early layers, at late layers or middle layers.

Middle fusion is the most common, each sensor needs specific preprocessing network.

Feng et al. Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges (2019)

49 of 87

Multiple timesteps

Fixed number of previous frames could be inputted to convolutional neural network (CNN).�
Alternatively recurrent neural network (RNN) could be used to take into account arbitrary number of frames.�
Transformer architecture could be used to selectively attend to subset of previous frames.

50 of 87

Network inputs summary

Cameras are the most commonly used sensors with end-to-end driving.�
Navigational command is important do disambiguate the intersections.�
Semantic inputs can improve the generalization of the network.�
Middle fusion is used, as different sensors need different pre-processing.�
Combination of CNNs and RNNs is used to take into account multiple frames.

51 of 87

Agenda

Training methods
Network architectures

Network inputs
Network outputs

Evaluation methods
Interpretability
Safety
Companies

52 of 87

Network outputs

Steering wheel angle + pedals
Frontal wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

53 of 87

Network outputs

Steering wheel angle + pedals
Frontal wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Pros:

Easy to apply

Cons:

Depends on car model, model not transferable to other cars

54 of 87

Network outputs

Steering wheel angle + pedals
Frontal wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Pros:

Easily interpretable
Better transferable to other cars

Cons:

Turning radius still depends on wheelbase in addition to angle
Controller (PID) needed to convert speed to low-level pedal presses

55 of 87

Network outputs

Steering wheel angle + pedals
Frontal wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Curvature is the inverse of turning radius.

56 of 87

Network outputs

Steering wheel angle + pedals
Frontal wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Pros:

Transferable to other cars
Zero when driving straight (turning radius would have been infinity)

Cons:

Does not force the network to plan ahead

57 of 87

Network outputs

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Network predicts fixed number of future trajectory points.
Potentially curve is fitted through those points to smooth out noise from predictions.
Classical controller, e.g. pure pursuit, is used to follow the trajectory.

58 of 87

Network outputs

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Pros:

Transferable to other cars
Forces network to plan ahead

Cons:

Needs classical controller to follow the trajectory
Cannot handle multiple equally good trajectories

59 of 87

Network outputs

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Multiple equally good trajectories:

http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-2.pdf

60 of 87

Network outputs

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Network is trained to output costmap.
The cost of human chosen trajectories is trained to be lower than random trajectories.

Zeng et al. End-to-end Interpretable Neural Motion Planner (2019)

61 of 87

Network outputs

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Pros:

Handles multiple equally good trajectories

Cons:

Harder to train, because no direct supervision

62 of 87

Network outputs

Affordances represent semantic information used by simple planner to plan trajectory, for example:

need to stop �(due to obstacle)
traffic light status
speed limit
distance to vehicle�in front
relative angle to�desired trajectory
distance to �centerline

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Sauer et al. Conditional Affordance Learningfor Driving in Urban Environments (2018)

63 of 87

Network outputs

Steering torque + pedal presses
Steering wheel angle + speed
Curvature
Trajectory
Costmap
Affordances

Pros:

Easily interpretable

Cons:

No automatic labeling

64 of 87

Network outputs summary

Steering wheel angle and car speed are the most commonly used outputs.�
Curvature is better than steering wheel angle, because it is not specific to a car model.�
Trajectory is even better, because it forces the network to plan ahead.�
Costmap can handle multiple equally good trajectories, but is harder to train.�
Affordances are practical solution for simpler tasks like lane following.

65 of 87

Agenda

Training methods
Network architectures
Evaluation methods
Interpretability
Safety
Companies

66 of 87

Evaluation methods

Open-loop evaluation

Model predictions are compared with human driver ground truth values, e.g. steering wheel angle and speed.

Closed-loop evaluation

The model drives the car, some metric is used to measure driving ability, e.g. kilometers driven without intervention.

BAD!�But could be used in model architecture search phase.

GOOD!�But beware of different driving conditions!

Codevilla et al. On Offline Evaluation of Vision-based Driving Models (2018)

67 of 87

Closed-loop evaluation metrics

number of infractions on given route

e.g. collisions, missed turns, going off the road, etc.

average distance between infractions or disengagements
percentage of successful trials
percentage of autonomy

i.e. percentage of time car is controlled by the model, not safety driver

Tampuu et al. A Survey of End-to-End Driving: Architectures and Training Methods (2020)

68 of 87

State of California disengagement report 2021

69 of 87

Evaluation summary

Open-loop testing tests how well the network can imitate human driver.�
Closed-loop testing tests how well the network can actually drive.�
Closed-loop testing should be preferred, while open-loop can be used for architecture search.�
Closed-loop testing depends on the environment where the test is performed and different testing results might not be comparable.

70 of 87

Agenda

Training methods
Network architectures
Evaluation methods
Interpretability
Safety
Companies

71 of 87

Interpretability

Neural networks are generally considered black boxes. If the network makes an error, it may be hard to understand why it happened and how to fix it.

Shahroudnejad et al. Improved Explainability of Capsule Networks: Relevance Path by Agreement (2018)

72 of 87

Highlighting salient areas on the image

Several methods can be used to highlight the areas on the image that were used to make the decision:

gradient analysis
activation analysis
self-attention

Bojarski et al. Explaining How a Deep Neural Network Trained with End-to-End Learning Steers a Car (2017)

73 of 87

Auxiliary outputs

Forcing the network to predict additional outputs, e.g. semantic segmentation, can both speed up training and make the model generalize better, due to more supervision signal. These outputs can be used at prediction time for debugging.

Hawke et al. Urban Driving with Conditional Imitation Learning

74 of 87

Auxiliary outputs

Hawke et al. Urban Driving with Conditional Imitation Learning

75 of 87

Interpretability summary

Neural network are often considered black boxes - it can be hard to explain why they made the wrong decision and how to fix it.�
Visual saliency can be used to highlight areas that were used by the neural network to make the decision.�
Another option is to force the network to predict auxiliary outputs besides driving commands and use those outputs to debug the network.

76 of 87

Agenda

Training methods
Network architectures
Evaluation methods
Interpretability
Safety
Companies

77 of 87

Is End-to-End the Future?

78 of 87

Putting stickers on road misguided Tesla Autopilot

Source: https://keenlab.tencent.com/en/2019/03/29/Tencent-Keen-Security-Lab-Experimental-Security-Research-of-Tesla-Autopilot/

79 of 87

Showing this image to Tesla activated windscreen wipers

Source: https://keenlab.tencent.com/en/2019/03/29/Tencent-Keen-Security-Lab-Experimental-Security-Research-of-Tesla-Autopilot/

80 of 87

Adding stickers to stop sign caused it to be misclassified

Ekholt et al., Robust Physical-World Attacks on Deep Learning Models (2017)

81 of 87

Safety summary

End-to-end promises to make self-driving cheaper and easier to implement.�
Testing and debugging is challenging with neural networks.�
Adversarial examples need to be understood before we can let neural networks drive the cars all by themselves.

82 of 87

Agenda

Training methods
Network architectures
Evaluation methods
Interpretability
Safety
Companies

83 of 87

Company: Tesla

Network inputs:

frontal camera(s)

Network outputs:

lanes
lane markers
drivable area
stop lines
obstacles (car, truck, pedest.)�and their distances
trajectory for the car

https://twitter.com/greentheonly

84 of 87

Company: comma.ai

Network inputs:

single frontal camera

Network outputs:

left and right lane markers
distance to the leading car, �its relative velocity and �acceleration
trajectory for the car

https://medium.com/@chengyao.shen/decoding-comma-ai-openpilot-the-driving-model-a1ad3b4a3612

85 of 87

Company: Wayve

Network inputs:

three frontal cameras
optical flow for center camera
route command: straight, �left, right

Network outputs:

steering wheel angle and �change rate for N timesteps
car speed and acceleration�for N timesteps

Hawke et al. Urban Driving with Conditional Imitation Learning (2019)

86 of 87

Companies summary

Most practical driving networks fall under end-to-mid approach, where the network predicts useful affordances that are used by classical controller.�
Driving networks currently achieve level 2 autonomy at most. But they are often seen as the only path to level 5 autonomy in future.�
Tesla is in the best position to achieve breakthrough in use of neural networks for driving, due to their fleet and data collection infrastructure.�
Wayve is the most interesting startup in autonomous driving, because they are the only one pushing full end-to-end driving.

87 of 87

Thank you!

tambet.matiisen@ut.ee