1 of 84

How To See With An

Event Camera

Cedric Scheerlinck

Supervisors: Prof. Robert Mahony A/Prof. Nick Barnes,�Prof. Davide Scaramuzza, Prof. Tom Drummond,

1

2 of 84

Contents

  1. Introduction
    1. Frame-based Cameras
    2. Event Cameras
  2. Continuous-time Vision With Event Cameras
  3. Convolutional Neural Networks For Event Cameras
  4. Conclusion

2

3 of 84

History Of Photography

3

1827

1861

1964

First photo. View from the Window at Le Gras.

First color photo. Tartan ribbon.

One of the first videos. The Horse in Motion.

First digital image. Russell Kirsch's son.

Bullet through Apple.

Harold E. Edgerton

One trillions frames per second.

MIT Media Lab.

1887

1957

2011

4 of 84

Frame-based Video Cameras

4

Today

Time

Image 1

Image 2

Image 3

Shutter�open

Shutter�close

Collecting light

Smartphone camera

Lenses

Light sensor

Pixel array

Image formation

5 of 84

Frame-based Video Cameras: Drawbacks

  1. Redundant sampling: e.g., a static scene is captured repeatedly.
    1. High data rate (1-10Mb/s)
  2. Low intra-scene dynamic range, cannot see bright and dark at the same time.
    • Over/under-exposure
    • Auto-exposure artifacts (e.g., sun causes image to darken)
  3. Motion blur.

5

Time

6 of 84

Event Cameras: An Overview

Inspired by biological eyes.

Key properties:

  1. Asynchronous
    1. Independent pixels
    2. Only report changes in brightness
    3. Low bandwidth (0-1Mb/s)
  2. Low latency (0.5ms)
  3. High dynamic range (120dB)
  4. Low power (0.1W)
  5. No motion blur.

6

7 of 84

An Event Camera Pixel

A pixel (top) from an event camera mimics biological cells (bottom).

  1. Incoming light hits a sensor, generating photocurrent signal.
  2. The signal is amplified and compared to a reference level.
  3. If the change is positive, an�ON event is triggered, if negative, an OFF event is triggered.
  4. No change = no event.

7

Event Camera Pixel

Biological eye

Posch et al, PROC 2014

8 of 84

An Event Camera Pixel

8

  • Each event contains:

Timestamp

(µs)

Pixel

(x, y)

Polarity

(ON, OFF)

DAVIS USB camera

Chip

Pixel

Lens

Input signal

Output events

Contrast threshold

9 of 84

DAVIS Event Camera Output

9

t

x

y

p

0.003432

13

35

0

0.003464

4

24

0

0.005203

213

2

1

0.005242

5

75

0

0.006072

64

9

1

0.010764

36

126

1

0.010798

98

4

0

Image frames�30fps

Events

10 of 84

What Do Events Look Like?

10

Mueggler et al, IJRR 2017

t

x

y

p

0.003432

13

35

0

0.003464

4

24

0

0.005203

213

2

1

0.005242

5

75

0

0.006072

64

9

1

0.010764

36

126

1

0.010798

98

4

0

Bardow et al, CVPR 2016

11 of 84

Comparison Of Image Sensors

11

Phantom v2640

Nikon D850

Human eye

Event camera

12 of 84

Comparison Of Image Sensors

12

Ultrahigh-speed camera�(Phantom v2640)

High-end DSLR camera�(Nikon D850)

Human eye

Event camera

Equivalent framerate (fps)

12,500

120

50

100,000

Dynamic range (dB)

64

45

30-40

120

Power consumption (W)

280

8

0.01

0.1

Data rate (MB/s)

800

8

-

0 - 1

Output

Images

Images

Nerve impulses

Events

13 of 84

Part I: Continuous-time Vision With Event Cameras

13

14 of 84

Related Work

Brandli et al, propose adding events to a (log) DAVIS image frame to update the image.

14

Brandli et al, ISCAS 2014

Frame Frame + events

15 of 84

Related Work

Events are accumulated while the image is periodically regularised (smoothed).

Join optimization of image and optic flow over a batch of events.

15

Reinbacher et al, BMVC 2016

Bardow et al, CVPR 2016

16 of 84

Motivation

The DAVIS event camera outputs:

  1. low frequency, low dynamic range, motion blurred image frames
  2. high frequency, high dynamic range events.

Aim: Reconstruct super high speed, high dynamic range video with low latency.

16

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

17 of 84

Approach

Instead of a sequence of temporally sparse image frames, we propose to estimate a continuous-time image state.

17

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

Time

Frame 1

Frame-based camera

Event camera

Time

Image state

2

3

18 of 84

Approach

To be useful in practical applications e.g., real-time robotics, we would our method to be:

  1. Computationally efficient
  2. Low latency
  3. Update on a per-event basis (no batching)

18

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

Crazyflie Nano

19 of 84

Mathematical Notation

19

Time

Integrator (equivalent to )

Events =

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

20 of 84

Naive Approach: Direct Integration

Problem: low temporal frequency noise accumulates, degrading the estimate over time.

20

dt

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

21 of 84

Approach: High-pass Filter

High-pass filters attenuate (reduce) low frequency components of the signal while allowing high frequency components to pass.

21

High-pass filter

*

dt

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

22 of 84

Approach: High-pass Filter

Problem: Low temporal frequency information is lost (static background).

22

Conventional camera

High-pass filtered events

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

23 of 84

Approach: Sensor Fusion

Can we fuse low-frequency information from frames with high-frequency information from events?

23

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

Conventional camera

High-pass filtered events

24 of 84

Approach: Complementary Filter

24

*

*

dt

Low-pass

High-pass

Conventional Camera

Event Camera

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

+

25 of 84

Approach: Complementary Filter

25

=

*

dt

Event Camera

All-pass*

Reconstruction

*axes plotted in log-scale

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

+

*

Low-pass

High-pass

Conventional Camera

26 of 84

Reminder: Mathematical Notation

26

Integrator (equivalent to )

Events =

Contrast threshold

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

27 of 84

Approach: Complementary Filter

Our complementary filter combines (temporally) low-pass filtered frames with high-pass filtered events.

27

Log intensity estimate

Low-pass filter

Log frames

High-pass filter

Events

Integrator

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

28 of 84

Approach: Complementary Filter

The continuous-time ODE and solution can be obtained analytically.

28

Frequency domain

Time domain

Inverse Laplace transform

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

29 of 84

Approach: Complementary Filter

We solve the ODE in two regimes: between events, and at events, for every pixel.

29

If E = 0, i.e. between two events

If E ≠ 0, i.e. an event occurs (dirac-delta)

Time

Update whenever an event is received

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

Computationally efficient

20M events / second i7 CPU

30 of 84

Approach: Complementary Filter

30

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

31 of 84

31

Results

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

32 of 84

32

Results

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

33 of 84

Contributions

  1. Continuous-time formulation of complementary filtering for image reconstruction with event cameras.
  2. Asynchronous, per-event update scheme.
  3. Real-time implementation that outperformed state-of-the-art at the time.
  4. Fastest event camera image reconstruction algorithm to date.
  5. Open source code - github.com/cedric-scheerlinck/dvs_image_reconstruction

33

C. Scheerlinck, N. Barnes, R. Mahony, "Continuous-time Intensity Estimation Using Event Cameras", ACCV, 2018

34 of 84

Asynchronous Convolutions

Motivation: Spatial convolution is a fundamental image operator used for gradient computation, convolutional neural networks and much more, yet there is no natural spatial convolution operator for event cameras.

Idea: extend the continuous-time image framework to spatial image convolutions for event cameras.

34

*

=

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

35 of 84

Basics: Image Convolutions

35

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

36 of 84

Approach: Asynchronous Convolutions

36

*

?

*

=

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

37 of 84

Approach: Asynchronous Convolutions

37

Consider one event

[timestamp, x, y, ±1]

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

38 of 84

Approach: Asynchronous Convolutions

38

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Event image

Consider one event

[timestamp, x, y, ±1]

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

39 of 84

Approach: Asynchronous Convolutions

39

Kernel

*

0

0

0

0

0

0

0

0

0

0

0

0

0

0

-1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Event image

Consider one event

[timestamp, x, y, ±1]

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

40 of 84

Approach: Asynchronous Convolutions

40

*

=

Event image

Kernel

*

0

0

0

0

0

0

0

1

0

-1

0

0

0

2

0

-2

0

0

0

1

0

-1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Consider one event

[timestamp, x, y, ±1]

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

41 of 84

Approach: Asynchronous Convolutions

41

Event image

Kernel

*

0

0

0

0

0

0

0

1

0

-1

0

0

0

2

0

-2

0

0

0

1

0

-1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Six virtual events, or a convolved event, can be generated

Time

ON

OFF

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

42 of 84

Approach: Asynchronous Convolutions

Convolved events can be used as input to an event processing algorithm, e.g., complementary filter:

42

0

Events

Image

estimate

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

43 of 84

Approach: Asynchronous Convolutions

43

0

Gradient

estimate

Events

Convolved events

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

44 of 84

Results

44

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

45 of 84

Results: Gradient

45

Events Gradient Poisson integration

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

46 of 84

Results: Corner Detection

Gradient can be used as input to a corner detection algorithm.

When an event arrives, the Harris response is only updated in a local neighbourhood.

46

Gradient Harris response

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

47 of 84

Results: Corner Detection

47

Gradient Harris response Corners Frame-based corners

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

48 of 84

Results: Corner Detection

48

Eharris (Vasco’16) FAST (Mueggler’17) ARC (Ignacio’18) Ours Frame Harris

Local non-maximum suppression can be applied to our continuous-time Harris response state to get clean corners.

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

49 of 84

Part II: Convolutional Neural Networks For Event Cameras

49

50 of 84

Motivation

Convolutional neural networks (CNNs) are a powerful image processing tool that yield state-of-the-art results in a range of computer vision topics from optic flow, classification, segmentation and more.

Can CNNs be used with event cameras, e.g., to reconstruct high quality images?

50

51 of 84

Related Works

Events are not naturally suited to convolutional neural networks (CNNs).

Converting them to 3D space-time voxel grids yields state-of-the-art results in image reconstruction, optic flow and classification.

51

Gehrig et al, ICCV 2019

Time

Voxel grid 1 2 3 4

Events

52 of 84

Related Works

Rebecq et al. achieve state-of-the-art video reconstruction using a recurrent variant of UNet, trained with simulated data.

52

TPAMI’20

CVPR’19

53 of 84

Related Works

Synthetic training data for E2VID.

53

CoRL 2018

54 of 84

54

55 of 84

Limitations Of E2VID

  1. Computational cost
  2. Doesn’t generalize to MVSEC dataset
  3. Fades rapidly when event rate drops

55

ms

Compute time per image, Titan Xp GPU

56 of 84

Limitations Of E2VID

  • Computational cost
  • Doesn’t generalize to MVSEC dataset
  • Fades rapidly when event rate drops

56

ms

Compute time per image, Titan Xp GPU

57 of 84

Fast Image Reconstruction With An Event Camera

Aim: achieve similar image quality as E2VID while drastically improving computational efficiency.

Idea: Starting from E2VID, remove components one-by-one while maintaining prediction accuracy.

57

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.

58 of 84

Fast Image Reconstruction With An Event Camera

Result: our network runs 3-4x faster than E2VID, requires 10x less FLOPs, and has a size reduction of 99.6%.

58

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.

What about accuracy?

59 of 84

Fast Image Reconstruction With An Event Camera

Result: our network achieves similar accuracy to E2VID on the IJRR’17 dataset.

59

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.

60 of 84

Recurrent Unit Ablation

60

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.

61 of 84

Limitations

FireNet is slower to initialise (left) and has more smearing on fast motions (right).

61

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.

62 of 84

Fast Image Reconstruction With An Event Camera

62

C. Scheerlinck, H. Rebecq, D. Gehrig, N. Barnes, R. Mahony, D. Scaramuzza, “Fast Image Reconstruction with an Event Camera”, Winter Conference on Applications of Computer Vision (WACV), 2020.

63 of 84

Work In Progress (Unpublished)

Aim: determine important factors for training image reconstruction and optic flow networks, such as:

  • simulation parameters
  • training parameters
  • data augmentation
  • loss functions
  • network architecture,

to outperform state-of-the-art and guide future research in this direction.

63

64 of 84

Limitations Of E2VID

  • Computational cost
  • Doesn’t generalize to MVSEC dataset
  • Fades rapidly when event rate drops

64

ms

Compute time per image, Titan Xp GPU

65 of 84

Sequence Length

To train a recurrent neural network a (temporal) sequence of data is forward-passed to the network and the loss at each step computed.

A single backpropagation step is performed at the end, updating the network weights based on the gradient of the accumulated loss with respect to the weights.

65

Step 1

Loss 1

Step 2

Loss 2

Step 3

Loss 3

Step 4

Loss 4

Loss 1 +

Loss 2 +

Loss 3 +

Loss 4

Loss

Network update (backprop)

Time

Events

66 of 84

Sequence Length

A shorter sequence length requires fewer computational steps (forward-passes) per update (backprop), thus may train faster.

A longer sequence may endow the network with a longer temporal ‘memory’.

66

Step 1

Loss 1

Step 2

Loss 2

Step 3

Loss 3

Step 4

Loss 4

Loss 1 +

Loss 2 +

Loss 3 +

Loss 4

Loss

Network update (backprop)

Events

Time

67 of 84

67

E2VID Ours

Training data:

- sequence length = 40 steps

- medium-fast camera motions

- clean (no noise)

- planar, static scenes

Training data:

- sequence length = 135 steps

- slow-fast camera motions

- noise added

- planar, static scenes

68 of 84

E2VID Vs. Ours: Fading

E2VID fades rapidly while ours maintains temporal persistence.

Conclusion: long temporal sequences are key to improving temporal memory.

68

E2VID Ours

69 of 84

Limitations Of E2VID

  • Computational cost
  • Doesn’t generalize to MVSEC dataset
  • Fades rapidly when event rate drops

69

ms

Compute time per image, Titan Xp GPU

70 of 84

Contrast Thresholds

How do you measure the contrast thresholds (CT) of an event camera?

Heuristic: measure the rate of events/(pixel*second).

High CT will produce less events, low CT will produce more.

70

Synthetic data

Real data

71 of 84

71

E2VID Ours

Training data:

- Contrast thresholds drawn from Gaussian distribution with mean=0.18, std=0.03

Training data:

- Contrast thresholds range from 0.2 - 1.0

72 of 84

E2VID Vs. Ours: MVSEC Dataset

E2VID simply breaks on event data from the MVSEC dataset, while ours produces a reasonable video.

Conclusion: wide range of contrast thresholds in training data improves generalizability to other datasets.

72

E2VID Ours

73 of 84

Image + Flow Network

We trained a combined network to output image and flow simultaneously.

Same size network - computationally efficient.

But so far performance of combined network is worse than an image-only or flow-only network.

73

74 of 84

Our Event Convolutional Neural Network

Outperforms state-of-the-art (E2VID) by 15-30% on major event camera datasets.

74

E2VID

Ours

Contrast thresholds

Guassian; mean=0.18, std=0.03

Range from 0.2-1.0

Motion

Medium-fast, planar, static scenes

Slow-fast, multiple 2D objects flying across moving background

Noise

Clean

Noise added dynamically at train time

Loss

LPIPS: VGG pretrained weights

LPIPS: AlexNet pretrained weights

Sequence length

40 images

120 images

Optic flow

No

Yes

75 of 84

Conclusion

Event cameras are:

Pros: fast, high dynamic range, low power sensors.

Cons: noisy, difficult to process using conventional computer vision (no images).

Part I: complementary filtering can be used for computationally efficient real-time image reconstruction and convolution.

Part II: CNNs are currently state-of-the-art in image reconstruction, optic flow (and more) for event cameras. Training data including a range of contrast thresholds, motion types, noise, and long temporal sequences are key to improving results.

75

76 of 84

An Event Camera Pixel

Key difference between frame-based camera and event camera is pixel circuitry.

76

DAVIS USB camera

Chip

Pixel

Lens

Gallego et al, arXiv 2019

77 of 84

Results

77

C. Scheerlinck, N. Barnes, R. Mahony, “Asynchronous Spatial Image Convolutions for Event Cameras”, IEEE Robotics and Automation Letters (RAL), 2019.

78 of 84

Related Works

Zhu et al. show depth, ego-motion and state-of-the-art optic flow using a convolutional UNet architecture, trained on 11 minutes of driving data from their MVSEC dataset.

They were the first to propose voxel-grid representation for events.

78

Voxel�grid

RSS 2018

CVPR 2019

79 of 84

79

Optic flow

Events prediction groundtruth

Depth

Events prediction groundtruth

Zhu et al, CVPR 2019

80 of 84

80

Voxel

+

81 of 84

81

LPIPS

CVPR 2018

82 of 84

LPIPS distance

82

“VGG” ICLR 2015 (30k citations)

83 of 84

83

ECCV 2018

TC

84 of 84

Training E2VID

84

Voxel

UNet

LPIPS

ESIM

Differentiable

TC

Time

1 2 3 4

Events

Groundtruth image