How To See With An
Event Camera
Cedric Scheerlinck
Supervisors: Prof. Robert Mahony A/Prof. Nick Barnes,�Prof. Davide Scaramuzza, Prof. Tom Drummond,
1
Contents
2
History Of Photography
3
1827
1861
1964
First photo. View from the Window at Le Gras.
First color photo. Tartan ribbon.
One of the first videos. The Horse in Motion.
First digital image. Russell Kirsch's son.
Bullet through Apple.
Harold E. Edgerton
One trillions frames per second.
MIT Media Lab.
1887
1957
2011
Frame-based Video Cameras
4
Today
Time
Image 1
Image 2
Image 3
Shutter�open
Shutter�close
Collecting light
Smartphone camera
Lenses
Light sensor
Pixel array
Image formation
Frame-based Video Cameras: Drawbacks
5
Time
Event Cameras: An Overview
Inspired by biological eyes.
Key properties:
6
An Event Camera Pixel
A pixel (top) from an event camera mimics biological cells (bottom).
7
Event Camera Pixel
Biological eye
Posch et al, PROC 2014
An Event Camera Pixel
8
Timestamp (µs) | Pixel (x, y) | Polarity (ON, OFF) |
DAVIS USB camera
Chip
Pixel
Lens
Input signal
Output events
Contrast threshold
DAVIS Event Camera Output
9
t | x | y | p |
0.003432 | 13 | 35 | 0 |
0.003464 | 4 | 24 | 0 |
0.005203 | 213 | 2 | 1 |
0.005242 | 5 | 75 | 0 |
0.006072 | 64 | 9 | 1 |
0.010764 | 36 | 126 | 1 |
0.010798 | 98 | 4 | 0 |
Image frames�30fps
Events
What Do Events Look Like?
10
Mueggler et al, IJRR 2017
t | x | y | p |
0.003432 | 13 | 35 | 0 |
0.003464 | 4 | 24 | 0 |
0.005203 | 213 | 2 | 1 |
0.005242 | 5 | 75 | 0 |
0.006072 | 64 | 9 | 1 |
0.010764 | 36 | 126 | 1 |
0.010798 | 98 | 4 | 0 |
Bardow et al, CVPR 2016
Comparison Of Image Sensors
11
Phantom v2640
Nikon D850
Human eye
Event camera
Comparison Of Image Sensors
12
| Ultrahigh-speed camera�(Phantom v2640) | High-end DSLR camera�(Nikon D850) | Human eye | Event camera |
Equivalent framerate (fps) | 12,500 | 120 | 50 | 100,000 |
Dynamic range (dB) | 64 | 45 | 30-40 | 120 |
Power consumption (W) | 280 | 8 | 0.01 | 0.1 |
Data rate (MB/s) | 800 | 8 | - | 0 - 1 |
Output | Images | Images | Nerve impulses | Events |
Part I: Continuous-time Vision With Event Cameras
13
Related Work
Brandli et al, propose adding events to a (log) DAVIS image frame to update the image.
14
Brandli et al, ISCAS 2014
Frame Frame + events
Related Work
Events are accumulated while the image is periodically regularised (smoothed).
Join optimization of image and optic flow over a batch of events.
15
Reinbacher et al, BMVC 2016
Bardow et al, CVPR 2016
Motivation
The DAVIS event camera outputs:
Aim: Reconstruct super high speed, high dynamic range video with low latency.
16
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach
Instead of a sequence of temporally sparse image frames, we propose to estimate a continuous-time image state.
17
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Time
Frame 1
Frame-based camera
Event camera
Time
Image state
2
3
Approach
To be useful in practical applications e.g., real-time robotics, we would our method to be:
18
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Crazyflie Nano
Mathematical Notation
19
Time
Integrator (equivalent to )
Events =
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Naive Approach: Direct Integration
Problem: low temporal frequency noise accumulates, degrading the estimate over time.
20
dt
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach: High-pass Filter
High-pass filters attenuate (reduce) low frequency components of the signal while allowing high frequency components to pass.
21
High-pass filter
*
dt
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach: High-pass Filter
Problem: Low temporal frequency information is lost (static background).
22
Conventional camera
High-pass filtered events
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach: Sensor Fusion
Can we fuse low-frequency information from frames with high-frequency information from events?
23
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Conventional camera
High-pass filtered events
Approach: Complementary Filter
24
*
*
dt
Low-pass
High-pass
Conventional Camera
Event Camera
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
+
Approach: Complementary Filter
25
=
*
dt
Event Camera
All-pass*
Reconstruction
*axes plotted in log-scale
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
+
*
Low-pass
High-pass
Conventional Camera
Reminder: Mathematical Notation
26
Integrator (equivalent to )
Events =
Contrast threshold
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach: Complementary Filter
Our complementary filter combines (temporally) low-pass filtered frames with high-pass filtered events.
27
Log intensity estimate
Low-pass filter
Log frames
High-pass filter
Events
Integrator
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach: Complementary Filter
The continuous-time ODE and solution can be obtained analytically.
28
Frequency domain
Time domain
Inverse Laplace transform
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Approach: Complementary Filter
We solve the ODE in two regimes: between events, and at events, for every pixel.
29
If E = 0, i.e. between two events
If E ≠ 0, i.e. an event occurs (dirac-delta)
Time
Update whenever an event is received
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Computationally efficient
20M events / second i7 CPU
Approach: Complementary Filter
30
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
31
Results
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
32
Results
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Contributions
33
C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018
Asynchronous Convolutions
Motivation: Spatial convolution is a fundamental image operator used for gradient computation, convolutional neural networks and much more, yet there is no natural spatial convolution operator for event cameras.
Idea: extend the continuous-time image framework to spatial image convolutions for event cameras.
34
*
=
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Basics: Image Convolutions
35
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
36
*
?
*
=
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
37
Consider one event
[timestamp, x, y, ±1]
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
38
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | -1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
Event image
Consider one event
[timestamp, x, y, ±1]
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
39
Kernel
*
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | -1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
Event image
Consider one event
[timestamp, x, y, ±1]
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
40
*
=
Event image
Kernel
*
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | -1 | 0 | 0 |
0 | 2 | 0 | -2 | 0 | 0 |
0 | 1 | 0 | -1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
Consider one event
[timestamp, x, y, ±1]
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
41
Event image
Kernel
*
0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | -1 | 0 | 0 |
0 | 2 | 0 | -2 | 0 | 0 |
0 | 1 | 0 | -1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
Six virtual events, or a convolved event, can be generated
Time
ON
OFF
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
Convolved events can be used as input to an event processing algorithm, e.g., complementary filter:
42
0
Events
Image
estimate
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Approach: Asynchronous Convolutions
43
0
Gradient
estimate
Events
Convolved events
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Results
44
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Results: Gradient
45
Events Gradient Poisson integration
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Results: Corner Detection
Gradient can be used as input to a corner detection algorithm.
When an event arrives, the Harris response is only updated in a local neighbourhood.
46
Gradient Harris response
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Results: Corner Detection
47
Gradient Harris response Corners Frame-based corners
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Results: Corner Detection
48
Eharris (Vasco’16) FAST (Mueggler’17) ARC (Ignacio’18) Ours Frame Harris
Local non-maximum suppression can be applied to our continuous-time Harris response state to get clean corners.
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Part II: Convolutional Neural Networks For Event Cameras
49
Motivation
Convolutional neural networks (CNNs) are a powerful image processing tool that yield state-of-the-art results in a range of computer vision topics from optic flow, classification, segmentation and more.
Can CNNs be used with event cameras, e.g., to reconstruct high quality images?
50
Related Works
Events are not naturally suited to convolutional neural networks (CNNs).
Converting them to 3D space-time voxel grids yields state-of-the-art results in image reconstruction, optic flow and classification.
51
Gehrig et al, ICCV 2019
Time
Voxel grid 1 2 3 4
Events
Related Works
Rebecq et al. achieve state-of-the-art video reconstruction using a recurrent variant of UNet, trained with simulated data.
52
TPAMI’20
CVPR’19
Related Works
Synthetic training data for E2VID.
53
CoRL 2018
54
Limitations Of E2VID
55
ms
Compute time per image, Titan Xp GPU
Limitations Of E2VID
56
ms
Compute time per image, Titan Xp GPU
Fast Image Reconstruction With An Event Camera
Aim: achieve similar image quality as E2VID while drastically improving computational efficiency.
Idea: Starting from E2VID, remove components one-by-one while maintaining prediction accuracy.
57
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.
Fast Image Reconstruction With An Event Camera
Result: our network runs 3-4x faster than E2VID, requires 10x less FLOPs, and has a size reduction of 99.6%.
58
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.
What about accuracy?
Fast Image Reconstruction With An Event Camera
Result: our network achieves similar accuracy to E2VID on the IJRR’17 dataset.
59
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.
Recurrent Unit Ablation
60
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.
Limitations
FireNet is slower to initialise (left) and has more smearing on fast motions (right).
61
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.
Fast Image Reconstruction With An Event Camera
62
C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.
Work In Progress (Unpublished)
Aim: determine important factors for training image reconstruction and optic flow networks, such as:
to outperform state-of-the-art and guide future research in this direction.
63
Limitations Of E2VID
64
ms
Compute time per image, Titan Xp GPU
Sequence Length
To train a recurrent neural network a (temporal) sequence of data is forward-passed to the network and the loss at each step computed.
A single backpropagation step is performed at the end, updating the network weights based on the gradient of the accumulated loss with respect to the weights.
65
Step 1
Loss 1
Step 2
Loss 2
Step 3
Loss 3
Step 4
Loss 4
Loss 1 +
Loss 2 +
Loss 3 +
Loss 4
Loss
Network update (backprop)
Time
Events
Sequence Length
A shorter sequence length requires fewer computational steps (forward-passes) per update (backprop), thus may train faster.
A longer sequence may endow the network with a longer temporal ‘memory’.
66
Step 1
Loss 1
Step 2
Loss 2
Step 3
Loss 3
Step 4
Loss 4
Loss 1 +
Loss 2 +
Loss 3 +
Loss 4
Loss
Network update (backprop)
Events
Time
67
E2VID Ours
Training data:
- sequence length = 40 steps
- medium-fast camera motions
- clean (no noise)
- planar, static scenes
Training data:
- sequence length = 135 steps
- slow-fast camera motions
- noise added
- planar, static scenes
E2VID Vs. Ours: Fading
E2VID fades rapidly while ours maintains temporal persistence.
Conclusion: long temporal sequences are key to improving temporal memory.
68
E2VID Ours
Limitations Of E2VID
69
ms
Compute time per image, Titan Xp GPU
Contrast Thresholds
How do you measure the contrast thresholds (CT) of an event camera?
Heuristic: measure the rate of events/(pixel*second).
High CT will produce less events, low CT will produce more.
70
Synthetic data
Real data
71
E2VID Ours
Training data:
- Contrast thresholds drawn from Gaussian distribution with mean=0.18, std=0.03
Training data:
- Contrast thresholds range from 0.2 - 1.0
E2VID Vs. Ours: MVSEC Dataset
E2VID simply breaks on event data from the MVSEC dataset, while ours produces a reasonable video.
Conclusion: wide range of contrast thresholds in training data improves generalizability to other datasets.
72
E2VID Ours
Image + Flow Network
We trained a combined network to output image and flow simultaneously.
Same size network - computationally efficient.
But so far performance of combined network is worse than an image-only or flow-only network.
73
Our Event Convolutional Neural Network
Outperforms state-of-the-art (E2VID) by 15-30% on major event camera datasets.
74
| E2VID | Ours |
Contrast thresholds | Guassian; mean=0.18, std=0.03 | Range from 0.2-1.0 |
Motion | Medium-fast, planar, static scenes | Slow-fast, multiple 2D objects flying across moving background |
Noise | Clean | Noise added dynamically at train time |
Loss | LPIPS: VGG pretrained weights | LPIPS: AlexNet pretrained weights |
Sequence length | 40 images | 120 images |
Optic flow | No | Yes |
Conclusion
Event cameras are:
Pros: fast, high dynamic range, low power sensors.
Cons: noisy, difficult to process using conventional computer vision (no images).
Part I: complementary filtering can be used for computationally efficient real-time image reconstruction and convolution.
Part II: CNNs are currently state-of-the-art in image reconstruction, optic flow (and more) for event cameras. Training data including a range of contrast thresholds, motion types, noise, and long temporal sequences are key to improving results.
75
An Event Camera Pixel
Key difference between frame-based camera and event camera is pixel circuitry.
76
DAVIS USB camera
Chip
Pixel
Lens
Gallego et al, arXiv 2019
Results
77
C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.
Related Works
Zhu et al. show depth, ego-motion and state-of-the-art optic flow using a convolutional UNet architecture, trained on 11 minutes of driving data from their MVSEC dataset.
They were the first to propose voxel-grid representation for events.
78
Voxel�grid
RSS 2018
CVPR 2019
79
Optic flow
Events prediction groundtruth
Depth
Events prediction groundtruth
Zhu et al, CVPR 2019
80
Voxel
+
81
LPIPS
CVPR 2018
LPIPS distance
82
“VGG” ICLR 2015 (30k citations)
83
ECCV 2018
TC
Training E2VID
84
Voxel
UNet
LPIPS
ESIM
Differentiable
TC
Time
1 2 3 4
Events
Groundtruth image