1 of 21

Video surveillance for road traffic monitoring with computer vision techniques

Team 2

Guillem Delgado, Jordi Gené, Francisco Roldan, Victor Segura

2 of 21

Motivation

ROAD SAFETY

3 of 21

Outline

Introduction
State of the Art
Speed estimator using traditional CV techniques

Video stabilization
Background subtraction
Tracking
Speed estimation
Anomalies detection. Application

Speed estimator with Deep Learning
Datasets
Results
Conclusions

4 of 21

1. Introduction

4

Objective: Track speed of a car or different cars and detect anomalies or speeding in the road

Track speed of vehicles

Track speed from POV

5 of 21

2. State of the art

Lidar[1]

5

Color Cameras[2]

Deep Learning[3]

[1] Chen, Xin, et al. "Next generation map making: geo-referenced ground-level LIDAR point clouds for automatic retro-reflective road feature extraction." Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 2009.

[2] Yabo, Agustín, et al. "Vehicle classification and speed estimation using computer vision techniques." XXV Congreso Argentino de Control Automático (AADECA 2016)(Buenos Aires, 2016). 2016.

[3] Braham, Marc, and Marc Van Droogenbroeck. "Deep background subtraction with scene-specific convolutional neural networks." Systems, Signals and Image Processing (IWSSIP), 2016 International Conference on. IEEE, 2016.

6 of 21

3. Speed estimator using traditional CV techniques

Pipeline

6

Video Stabilization

Background subtraction

Car Tracking

Speed estimation

Video recording

7 of 21

3.1. Video Stabilization

Original

BM Stabilization

The optical flow has been computed to each frame by doing backward block matching.The block size was set to 16x16 pixels and the search area to 32 pixels.

The global movement was computed using the mode (the most often value) of the optical flows obtained in every pixel.

We only allowed translations of the images. We haven’t considered rotations and affinities for example. Nevertheless, this method suits well with our dataset because the camera jitter only produces translations.

8 of 21

3.2. Background subtraction

8

DataSet	ColorSpace	A_filt	Connectivity	SE1	SE2	Alpha
Highway	YCrCb	160	8	11	9	0.23
Fall	YCrCb	160	8	3	9	0.5
Traffic	RGB	160	8	3	13	1.5

Gaussian Modelling

Area filtering

Dilation

(SE1)

Hole filling

(connectivity)

Erosion

(SE1)

Opening

(SE2)

*SE1 and SE2 are the structural element sizes used in the morphological operators.

9 of 21

3.3. Tracking

Custom SORT: Simple, online, and realtime tracking of multiple objects in a video sequence

Ref paper https://arxiv.org/pdf/1602.00763.pdf

SORT Implementation https://github.com/abewley/sort

Detection

Estimation model

Data association

Background subtraction: filter by area, applying mathematical morphology using erosions and dilations with specific connectivity on structuring elements to remove noise on background

From each detection, we get the centroid of bounding box, and compute the scale/area and aspect ratio to obtain the state of the target

[x1, y1, x2, y2] → [x, y, s, r]

Each person detected is associated to specific tracker computing IoU distance between detection and all predicted bounding boxes from the existing targets

Kalman filter is used to predict bounding boxes position from targets and track people detected

where:

s is scale/area

r aspect ratio

x horizontal coordinates

y vertical coordinates

10 of 21

3.3. Tracking

Other methods: Meanshift key idea: locate the maxima of a density function given discrete data sampled

Opencv reference: https://docs.opencv.org/3.4.1/db/df8/tutorial_py_meanshift.html

Other interesting source: http://www.bogotobogo.com/python/OpenCV_Python/python_opencv3_mean_shift_tracking_segmentation.php

Detection

Set up detections

Predict positions

Background subtraction: filter by area, applying mathematical morphology using erosions and dilations with specific connectivity on structuring elements to remove noise on background

Get bounding box location of object detected as car

Transform detection from RGB to HSV color space

Compute normalized histogram

Compute back projection using histogram computed on set up stage, and compute meanshift algorithm to predict new location

Pros:

Suitable for real time applications
Needs few parameters
Can handle differents feature spaces

Cons:

Difficult select optimal window size
Inappropriate window size cause false positives
Shape of detection may change over the time

11 of 21

3.4. Speed estimation

11

Traffic stabilized

During the tracking, we can calculate the displacement (D) of an object between two different frames. This displacement (D) is is known in pixels.

In order to obtain the velocity, it is needed the correspondence between pixels and real distance. In the Traffic dataset, we have assumed that 36 pixels in the image correspond to 1m. Having the frame rate (fr=20Hz), the pixel-Distance correspondence (pxD=36pixels/m) and the displacement of an object between two frames (D[pixels]). The velocity of an object can be computed as:

Where 3.6 is de conversion factor to convert from m/s to km/h

12 of 21

3.4. Speed estimation

12

In our implementation, we have assumed that there isn’t distortion due to the projection. To take this distortion in account, we should to apply an homography before speed estimation (idea form Team2Class2016).

Assumptions:

width and height of the road

pixels to meters

Planar homography:

projection of the image into a reference plane.

Slide credit: Team2 Class2016

13 of 21

3.5. Anomalies detection. Application

13

Hypothesis: When there is a road anomaly, the speed of the cars will not remain constant, so that the standard deviation of the speed data of each car will be higher.

14 of 21

4. Speed estimation using Deep Learning

14

Data Preprocessing

CNN

Data preparation

Speed Prediction

Pairs of consecutive frames are created and shuffled for training purposes.

* Usage of same saturation value for both frames.

* Crop upper and bottom parts of the image and resize.

* Farneback’s Dense Optical Flow

NVIDIA model for end-to-end self autonomous driving.

15 of 21

4. Speed estimation using Deep Learning

15

Data Preprocessing

CNN

Data preparation

Speed Prediction

Pairs of consecutive frames are created and shuffled for training purposes.

* Usage of same saturation value for both frames.

* Crop upper and bottom parts of the image and resize.

* Farneback’s Dense Optical Flow

NVIDIA model for end-to-end self autonomous driving.

NvidiaModel

10 m/s

16 of 21

4. Speed estimation using Deep Learning

16

5 conv layers. 4 fully connected and output layer
No learning involved in normalization
Adam optimizer with learning rate at 1e-4
MSE as loss function

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., ... & Zhang, X. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.

17 of 21

5. Datasets

17

Driving Point of View

Traffic Dataset

Own dataset, recorded from POV at 1080p@30fps, reduced to 512x512@20 fps for the neural network

18 of 21

6. Results

18

Highway

Fall

Traffic

Traffic stabilized

Anomaly detector did not work as intended for our own dataset so we cannot provide any qualitative results

19 of 21

6. Results

19

Generally it detects when the car is increasing/decreasing speed. However, it struggles a lot when dealing repetitive scenes like highways, as many pixels remain the same in these scenarios. Moreover, when the car is stationary the model is sensitive to other vehicles movement. Probably the different hardware used and the camera placement are the determinant factor for the difference in performance between one sample and the other.

Jovan Sardinha validation video

20 of 21

7. Conclusions

20

Application did not work much as intended. New methods or new datasets would be needed in order to make it work.
The parameters used for the background subtraction highly depends on the dataset, so that each application must be an addock system.
With a simple CNN we are able to predict speed from the drivers POV performing a regression task. However, we do not use raw data as input for the network.
Deep learning approach struggled with our own data. This might be because of the camera positioning and the recording hardware.

1 of 21

2 of 21

3 of 21

4 of 21

5 of 21

6 of 21

7 of 21

8 of 21

9 of 21

10 of 21

11 of 21

12 of 21

13 of 21

14 of 21

15 of 21

16 of 21

17 of 21

18 of 21

19 of 21

20 of 21

21 of 21