1 of 87

End-to-end Driving

Tambet Matiisen

7.09.2022

2 of 87

Modular Approach

�Sensors�

�Perception�

�Planning�

�Control�

�Actuators�

camera image

detected objects

trajectory

steering, �gas and brake

3 of 87

End-to-End Approach

�Sensors�

�Neural Network�

�Actuators�

camera image

steering, �gas and brake

4 of 87

Two approaches

Modular Approach

End-to-End Approach

Pros

  • Easier to test and analyze
  • Driverless solutions emerging
  • Conceptually simpler
  • Low cost sensors

Cons

  • Needs costly sensors (lidar) �and HD maps
  • Impossible to write rules for all situations
  • Hard to test and analyze
  • Needs a lot of training data �and compute
  • Mostly highway lane following

Tesla

comma.ai

5 of 87

Agenda

  • Training methods
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

6 of 87

Agenda

  • Training methods
    • Imitation learning
    • Reinforcement learning
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

7 of 87

Agenda

  • Training methods
    • Imitation learning
    • Reinforcement learning
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

8 of 87

Imitation learning

  • Record a big dataset of camera images and driving commands
    • Driving commands: steering wheel angles & car speed
  • Train a neural network to predict driving commands from camera images
    • Normal supervised learning can be used, nothing fancy or complicated

9 of 87

NVIDIA DAVE-2

10 of 87

Problem: distribution drift

  • Human driver always the car in the center of the lane.
  • When model drives the car, it
    • accumulates small prediction errors
    • and drifts away from the center.
  • When car has drifted away from the center,
    • the networks sees data it has never seen during training
    • and can behave unpredictably.

11 of 87

Distribution drift example

12 of 87

Common solutions to distribution drift

  • Use three cameras and adapt driving commands
  • Artificially generate off-the-track images
  • DAtaset AGGregation (DAGGER)

13 of 87

Common solutions to distribution drift

  • Use three cameras and adapt driving commands
  • Artificially generate off-the-track images
  • DAtaset AGGregation (DAGGER)

14 of 87

Common solutions to distribution drift

  • Use three cameras and adapt driving commands
  • Artificially generate off-the-track images
  • DAtaset AGGregation (DAGGER)

15 of 87

Common solutions to distribution drift

  • Use three cameras and adapt driving commands
  • Artificially generate off-the-track images
  • DAtaset AGGregation (DAGGER)

Shift and rotate

16 of 87

Common solutions to distribution drift

  • Use three cameras and adapt driving commands
  • Artificially generate off-the-track images
  • DAtaset AGGregation (DAGGER)

DAGGER algorithm:

  • Train model from collected data.
  • Run model to generate new data.
  • Ask expert to label correct actions for this data.
  • Aggregate previously collected and newly labeled data and start again.

17 of 87

Common solutions to distribution drift

  • Use three cameras and adapt driving commands
  • Artificially generate off-the-track images
  • DAtaset AGGregation (DAGGER)

Practical DAGGER algorithm:

  • Train model from collected data.
  • Let the model do driving, safety driver observes the model.
  • If model does something wrong then safety driver intervenes and her actions are recorded till she gives control back to the model.
  • Aggregate previously collected and newly recorded data and start again.

18 of 87

DAGGER

19 of 87

Imitation learning summary

  • Imitation learning trains model to predict human driving commands.�
  • Naive imitation learning suffers from distribution shift problem.�
  • Distribution shift problem can be alleviated using data augmentation and alternating between model and human driving (DAGGER).

20 of 87

Agenda

  • Training methods
    • Imitation learning
    • Reinforcement learning
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

21 of 87

Reinforcement learning

  • Car observes the environment, takes actions and receives rewards.
  • Car is given positive reward when it does something right
    • Examples: staying within the lane, moving forward, arriving to the destination, etc.
  • Car is given negative reward when it does something wrong
    • Examples: crashing, drifting away from the trajectory, etc.

22 of 87

Learning to drive in a day

23 of 87

Simulations

Reinforcement learning is usually done in simulations, because who would want to crash the car in the real world?

Microsoft

AirSim

24 of 87

CARLA

25 of 87

Distribution mismatch

Simulations do not really match the real world:

  • Lighting and textures are not perfect.
  • Limited number of human figures and car models.
  • Behavior of other traffic participants much more predictable.

26 of 87

Unsupervised domain adaptation

  • Convert simulation images to real-world images or vice versa.
    • No correspondences between images needed! (That’s the unsupervised part!)
  • Train driving model on simulation images or converted-to-real-world images.
    • Converted real-word images is more efficient, because no conversion needed at run-time!

27 of 87

Wayve sim2real

28 of 87

Reinforcement learning summary

  • When using reinforcement learning the car is given positive reward for doing the right thing and negative reward for doing the wrong thing.�
  • Reinforcement learning is challenging to train in the real world, therefore simulations are often used.�
  • Simulations do not perfectly match the real world, therefore distribution mismatch occurs.�
  • Unsupervised domain adaptation is one approach to overcome distribution mismatch problem with simulations.

29 of 87

Agenda

  • Training methods
  • Network architectures
    • Network inputs
    • Network outputs
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

30 of 87

Agenda

  • Training methods
  • Network architectures
    • Network inputs
    • Network outputs
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

31 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

32 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

ALVINN (1995)

33 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

ALVINN (1995)

34 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

NVIDIA DAVE-2 (2016)

35 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

NVIDIA DAVE-2 (2016)

36 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

DAVE (2005)

37 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Hecker et al. (2018)

38 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Müller et al. (2018)

39 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Codevilla et al. (2019)

40 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Codevilla et al. (2017)

41 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Codevilla et al. (2017)

42 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Codevilla et al. (2017)

43 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Hecker et al. (2019)

44 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Hecker et al. (2019)

45 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Zeng et al. (2019)

46 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

Bansal et al. (2018)

47 of 87

Network inputs

  • Camera
  • Stereo camera
  • 360-degree camera(s)
  • Semantic representation
  • Vehicle state (e.g. speed)
  • Navigational input
    • Behavioral commands
    • Trajectory points
    • Route planner image
  • LiDAR
  • High-definition maps

48 of 87

Multi-modal fusion

  • Usually inputs from multiple sensors are just concatenated.
    • Multiplication is also an option.
  • Data could be fused at early layers, at late layers or middle layers.
    • Middle fusion is the most common, each sensor needs specific preprocessing network.

49 of 87

Multiple timesteps

  • Fixed number of previous frames could be inputted to convolutional neural network (CNN).�
  • Alternatively recurrent neural network (RNN) could be used to take into account arbitrary number of frames.�
  • Transformer architecture could be used to selectively attend to subset of previous frames.

50 of 87

Network inputs summary

  • Cameras are the most commonly used sensors with end-to-end driving.�
  • Navigational command is important do disambiguate the intersections.�
  • Semantic inputs can improve the generalization of the network.�
  • Middle fusion is used, as different sensors need different pre-processing.�
  • Combination of CNNs and RNNs is used to take into account multiple frames.

51 of 87

Agenda

  • Training methods
  • Network architectures
    • Network inputs
    • Network outputs
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

52 of 87

Network outputs

  • Steering wheel angle + pedals
  • Frontal wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

53 of 87

Network outputs

  • Steering wheel angle + pedals
  • Frontal wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Pros:

  • Easy to apply

Cons:

  • Depends on car model, model not transferable to other cars

54 of 87

Network outputs

  • Steering wheel angle + pedals
  • Frontal wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Pros:

  • Easily interpretable
  • Better transferable to other cars

Cons:

  • Turning radius still depends on wheelbase in addition to angle
  • Controller (PID) needed to convert speed to low-level pedal presses

55 of 87

Network outputs

  • Steering wheel angle + pedals
  • Frontal wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Curvature is the inverse of turning radius.

56 of 87

Network outputs

  • Steering wheel angle + pedals
  • Frontal wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Pros:

  • Transferable to other cars
  • Zero when driving straight (turning radius would have been infinity)

Cons:

  • Does not force the network to plan ahead

57 of 87

Network outputs

  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances
  • Network predicts fixed number of future trajectory points.
  • Potentially curve is fitted through those points to smooth out noise from predictions.
  • Classical controller, e.g. pure pursuit, is used to follow the trajectory.

58 of 87

Network outputs

  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Pros:

  • Transferable to other cars
  • Forces network to plan ahead

Cons:

  • Needs classical controller to follow the trajectory
  • Cannot handle multiple equally good trajectories

59 of 87

Network outputs

  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Multiple equally good trajectories:

60 of 87

Network outputs

  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances
  • Network is trained to output costmap.
  • The cost of human chosen trajectories is trained to be lower than random trajectories.

61 of 87

Network outputs

  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Pros:

  • Handles multiple equally good trajectories

Cons:

  • Harder to train, because no direct supervision

62 of 87

Network outputs

Affordances represent semantic information used by simple planner to plan trajectory, for example:

  • need to stop �(due to obstacle)
  • traffic light status
  • speed limit
  • distance to vehicle�in front
  • relative angle to�desired trajectory
  • distance to �centerline
  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

63 of 87

Network outputs

  • Steering torque + pedal presses
  • Steering wheel angle + speed
  • Curvature
  • Trajectory
  • Costmap
  • Affordances

Pros:

  • Easily interpretable

Cons:

  • No automatic labeling

64 of 87

Network outputs summary

  • Steering wheel angle and car speed are the most commonly used outputs.�
  • Curvature is better than steering wheel angle, because it is not specific to a car model.�
  • Trajectory is even better, because it forces the network to plan ahead.�
  • Costmap can handle multiple equally good trajectories, but is harder to train.�
  • Affordances are practical solution for simpler tasks like lane following.

65 of 87

Agenda

  • Training methods
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

66 of 87

Evaluation methods

Open-loop evaluation

Model predictions are compared with human driver ground truth values, e.g. steering wheel angle and speed.

Closed-loop evaluation

The model drives the car, some metric is used to measure driving ability, e.g. kilometers driven without intervention.

BAD!But could be used in model architecture search phase.

GOOD!But beware of different driving conditions!

67 of 87

Closed-loop evaluation metrics

  • number of infractions on given route
    • e.g. collisions, missed turns, going off the road, etc.
  • average distance between infractions or disengagements
  • percentage of successful trials
  • percentage of autonomy
    • i.e. percentage of time car is controlled by the model, not safety driver

68 of 87

State of California disengagement report 2021

69 of 87

Evaluation summary

  • Open-loop testing tests how well the network can imitate human driver.�
  • Closed-loop testing tests how well the network can actually drive.�
  • Closed-loop testing should be preferred, while open-loop can be used for architecture search.�
  • Closed-loop testing depends on the environment where the test is performed and different testing results might not be comparable.

70 of 87

Agenda

  • Training methods
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

71 of 87

Interpretability

Neural networks are generally considered black boxes. If the network makes an error, it may be hard to understand why it happened and how to fix it.

72 of 87

Highlighting salient areas on the image

Several methods can be used to highlight the areas on the image that were used to make the decision:

  • gradient analysis
  • activation analysis
  • self-attention

73 of 87

Auxiliary outputs

Forcing the network to predict additional outputs, e.g. semantic segmentation, can both speed up training and make the model generalize better, due to more supervision signal. These outputs can be used at prediction time for debugging.

74 of 87

Auxiliary outputs

75 of 87

Interpretability summary

  • Neural network are often considered black boxes - it can be hard to explain why they made the wrong decision and how to fix it.�
  • Visual saliency can be used to highlight areas that were used by the neural network to make the decision.�
  • Another option is to force the network to predict auxiliary outputs besides driving commands and use those outputs to debug the network.

76 of 87

Agenda

  • Training methods
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

77 of 87

Is End-to-End the Future?

78 of 87

Putting stickers on road misguided Tesla Autopilot

79 of 87

Showing this image to Tesla activated windscreen wipers

80 of 87

Adding stickers to stop sign caused it to be misclassified

81 of 87

Safety summary

  • End-to-end promises to make self-driving cheaper and easier to implement.�
  • Testing and debugging is challenging with neural networks.�
  • Adversarial examples need to be understood before we can let neural networks drive the cars all by themselves.

82 of 87

Agenda

  • Training methods
  • Network architectures
  • Evaluation methods
  • Interpretability
  • Safety
  • Companies

83 of 87

Company: Tesla

Network inputs:

  • frontal camera(s)

Network outputs:

  • lanes
  • lane markers
  • drivable area
  • stop lines
  • obstacles (car, truck, pedest.)�and their distances
  • trajectory for the car

84 of 87

Company: comma.ai

Network inputs:

  • single frontal camera

Network outputs:

  • left and right lane markers
  • distance to the leading car, �its relative velocity and �acceleration
  • trajectory for the car

85 of 87

Company: Wayve

Network inputs:

  • three frontal cameras
  • optical flow for center camera
  • route command: straight, �left, right

Network outputs:

  • steering wheel angle and �change rate for N timesteps
  • car speed and acceleration�for N timesteps

86 of 87

Companies summary

  • Most practical driving networks fall under end-to-mid approach, where the network predicts useful affordances that are used by classical controller.�
  • Driving networks currently achieve level 2 autonomy at most. But they are often seen as the only path to level 5 autonomy in future.�
  • Tesla is in the best position to achieve breakthrough in use of neural networks for driving, due to their fleet and data collection infrastructure.�
  • Wayve is the most interesting startup in autonomous driving, because they are the only one pushing full end-to-end driving.

87 of 87

Thank you!

tambet.matiisen@ut.ee