2 of 30

Task at hand

Trajectory Prediction of dynamic objects is a widely studied topic in the field of artificial intelligence. Especially with the introduction of modern technologies like that of self-driving cars, the importance of efficient path prediction models for these automated machines to avoid accidents and crashes with pedestrians and other entities, have been drawing attentions of the researchers.

The thesis intends to propose an end to end deep learning model to learn and predict the motion patterns of pedestrians aided by object detection and tracking models.

The study also intends to discuss the results of the same by comparing it to the standard benchmarks available and also give the future scope of what one could do further with this model.

3 of 30

Problem Setup

We intend to solve for path prediction of humans (which will be extended to other objects of interest) at the scene of a traffic signal in context of a self-driving car.
Existing techniques make use of several cameras on the car including very costly LIDAR and RADAR systems.
We intend to achieve state-of-art performance with camera feeds from just two cameras placed perpendicular to each other, one on the car and the other on a traffic signal pole. (Somewhat similar to that of the illustration)

Cam - 1

Cam - 2

4 of 30

Standard workflow involved in solving our problem

Object Tracking

Since we are dealing with a problem where multiple humans would be in the video, it is important to identify a person as the same person from the previous frame (also in the other cameras’ footages), in order to know their tracks traversed previously.

Object Detection

Given video feeds from the installed cameras, it is essential to detect humans in each frame and draw bounding boxes around them, hence localising their position. Usage of multiple cameras.

Path Prediction

Once the previous tracks of a person have been studied using Object Tracking methods, these can be fed into Machine Learning/ Probability Learning algorithms in order to learn the tracks as well as other features from the footages, to predict the path taken by a person at a future time instance.

�

5 of 30

“Here’s the catch”

Though many applications of CV can be solved easily by developing simple mathematical models, but that’s not always the case.

A simple example is that of Object Detection - Given an image of a dog, in order for the computer to realize that the image is that of a dog, it should have prior knowledge of what a dog looks like. Similar to cognitive skills of human brain? This is where the problem of ‘Learning’ comes in, specifically referred to as Machine Learning in context of computers - which are complex statistical algorithms in theory.

This prompted us to decide what methods to use exactly, and why?

6 of 30

Why deep learning for our problem?

All the state of art techniques with respect to our problem, are based on some sort of DL architecture.

This can be reasoned out by the fact that all three steps involved in our model are highly complex steps which requires a high number parameters to capture non-linearities and process the results properly.

DL also simplifies the task at hand to a much greater extent.

For eg. Traditionally object detection using CV/ML algorithms is carried out by a prior feature extraction step. This step is a highly complex step since different images having different classes of objects, may have different distinctive features which needs to be handpicked. This step requires expertise. DL provides a platform to completely override this step, by autonomously choosing and learning the prominent features.

7 of 30

Work Environment

Entirety of coding for our problem is being done using Python 3.7 Language. Several dependencies had to be downloaded and installed, such as numpy, tensorflow-gpu, keras, conda, git etc.
Initial plans were to make use of local computation power to process all the codes. But owing to several version mismatches due to different versions of dependencies required, work environment was shifted to Google Colab which comes with all dependencies and packages pre installed.
Colab is a free GPU work environment provided by Google, on which simulations can be run up till 12 hours.
Tensorflow 2.xx with Keras framework has been used with CUDA 10.
Video reading and writing has been done with OpenCV2.
Visualisation of predictions has been done using PIL

8 of 30

Object Detection

Object detection using deep learning is a highly researched upon field of research. Several DNN architectures with extremely high accuracy are already in action in today’s world.
For our problem, we use one of Tensorflow’s pre-trained object detection models, trained on COCO Dataset downloaded from Github.
A pre-trained object detection model is being used since it is a very computationally expensive process and the complexity of architecture is already very high which leaves room for less to no improvement.

9 of 30

Deep-learning based object-detection algorithms.

R-CNN (Region Based Convolutional Neural Network
SPP (Spatial Pyramid Pooling)
Fast R-CNN
Faster R-CNN
Feature Pyramid networks
RetinaNet (Focal loss)
Yolo (You Only Look Once) Framework — Yolo1, Yolo2, Yolo3
SSD (Single Shot Detector)

10 of 30

Object Detection Model and Architecture

ResNet V2 has been chosen as the backbone architecture for the object detection step.
This backbone using faster RCNN algorithm was chosen, owing to its nominal speed and highest accuracy of classification of medium sized objects (Humans here)
Faster_rcnn_resnet101_coco model was chosen from the list of pre-trained models

11 of 30

The above object detection model returns a list in python which contains the values of prediction accuracy, prediction box coordinates and predicted class.
Only the class values which pertain to humans were filtered out and bounding boxes were drawn around the detections.
A variation of the model was implemented (masked RCNN) which gives pixel-wise prediction values which enabled me to create a mask around predictions of interest.

MODEL 1

MODEL 2

12 of 30

A word on object detection with masks model

This model can be very useful if one intends to make path predictions of pedestrians according to their body pose (eg. face orientation and feet orientation) since it is possible to localize the orientation of different parts of the human body with the help of the masks. This can be one of the future scopes of the above masking model in path prediction.

13 of 30

Object Tracking using Deep Learning

In the model pipeline, object tracking model gets its input from the object detection model in the form of object detections, their bounding box coordinates and their confidence of predictions over all the frames of the video. There are 3 major approaches to achieve object tracking using DL, as follows:

DL as Auto-encoders: This proposes a network of autoencoders stacked in layers that are used to refine visual features extracted from natural scenes. After the extraction step, affinity computation is performed.
CNNs as visual feature extractors: The most widely used methods for feature extraction are based on subtle modifications of convolutional neural networks. The extracted features are later passed to other networks for further processing and final predictions for re-identification are made after computing and comparing various distance metrics.
Siamese Networks: These networks use the idea of training CNNs with loss functions that combine information from different images, in order to learn the set of features that best differentiates examples of different objects.

A very detailed study of the above methods and many more, is done in the following review paper which has summarised and reviewed the findings of around 174 papers. Deep learning in video multi-object tracking: A survey

14 of 30

Object Tracking Algorithm

Two short term memory based algorithms in combination with a re-identification framework have been used for object tracking:

Short Term Memory Tracking (STMT) : There is a database of unique people appearing in a video. Every time a new person is detected, the localised image of the person is fed along with all the prior people’s images stored in the database to the re - identification network which gives probabilities of person A being same as one of people X, Y, Z stored in the database. In an attempt to reduce computations involved, if a person stored in a database doesn’t appear in a video for more than 10 seconds, he is deleted from the database, which means, if he reappears after 10 seconds he’s counted as a new person. If the inference from the re-ID network gives a probability below a given threshold, he’s treated as a new person and assigned some new identity α

Pro: Can be extremely accurate in most scenarios depending on the accuracy of the re-ID network used.

Con: Extremely computationally expensive. Inferencing with DNNs is a very computationally expensive process. Every time a person has been detected in a new frame/ camera feed, his/her image has to be fed with images of N (Number of people in database), so for every detection, N number of inferences should be made. In a traffic scene with many people, this approach is extremely inefficient.

15 of 30

Intersection over Union Tracking (IoUT): This builds upon the basics of the previous algorithm with an aim to improve efficiency of the same. Concept of IoU is introduced here.

IoU = (Intersection Area)/(Union Area)

In this approach, there are two databases. Database 1 stores the values of box coordinates of various detected people for the past k (user specified) number of frames. Everytime a new person is detected, his/her box coordinates are taken and IoU is computed for all k frames of N number of people. The person x out of the N number of people to have a frame with the highest IoU with the input image is assigned to the new detection. If none of these IoUs are above a threshold value, the new detection is assigned a place in the database as some person y. To cross check the results of IoU, feature matching transform is applied between the images in the two compared boxes with highest IoU, the result of which should be high, with the assumption that the object must’ve not have moved much between a max of k frames. If this transformation is below a given threshold, approach 1 is followed to assign an identity.

Pro: Relaxes the computational expense, since calculating IoU and feature matching coefficients are much less computationally expensive.

Con: IoU fails miserably in scenes with a lot of occlusion. Since there is a lot of overlapping of bounding boxes in very crowded scenes, computation of IoU becomes meaningless. It ends up relying on the accuracy of the re-ID network for overall tracking accuracy in such situations.

16 of 30

The Problem of Re-Identification in tracking

Above mentioned tracking algorithms help a model choose the best identity for a specific object of interest considering prior detections. Hence these can’t work individually. These algorithms work in combination with Re-Identification networks. Re-Identification networks take two images in the form of arrays as input, run it through a neural network to determine the percentage of confidence that it’s the same person, or not. Two re-ID networks have been modeled and implemented as follows:

Siamese Neural Network
Deep Sort Network

17 of 30

Siamese Neural Network

Input is fed into the network as series of [2xHxWx3] arrays instead of traditionally being fed as two individual images.
The use of residual networks in the form of DenseNet 121 increases the learning complexity of this model to a great extent
The proposed model consists of two blocks of DenseNet 121 networks sharing their weights. This sharing is where the model learns to distinguish if two given images are of the same person, or not.

18 of 30

Deep Sort Network

This model utilizes tied convolutions and cross-input neighbourhood differences in learning the differentiating features.
The shallow layers of the CNN tend to learn shallow features such as color mapping, and deeper layers learn feature differentiation, as depicted in learnt weight maps
The main difference when compared to the previous network is the presence of cross-neighbourhood differences, which learns the distinguishing features better than the DenseNet.

19 of 30

Path Prediction

Path Prediction is a fairly new field in case of deep learning, so there is no formal broad classification of the DL algorithms available. However, in general, Path prediction algorithms can be classified into three broad categories:

Physics based approach: Physics-based approaches are suitable in those situations where the effect of other agents or the static environment, and the agent’s motion dynamics can be modeled by an explicit transition function.
Pattern-based approaches: Pattern-based approaches are suitable for environments with complex unknown dynamics (e.g. public areas with rich semantics), and can cope with comparatively large prediction horizons. However, this requires ample data that must be collected for training purposes in a particular type of location or scenario.
Planning-based approach: Planning-based approaches work well if goals, that the agents try to accomplish, can be explicitly defined and a map of the environment is available. In these cases, the planning-based approaches tend to generate better long term predictions than the physics-based techniques and generalize to new environments better than the pattern based approaches.

I propose two models to solve the path prediction problem, the architectures and algorithms of which, will be shown in the following slides.

20 of 30

Linear Model

The first model proposed, modeled and tested was a linear non-learning based model which uses a Physics based approach. This model doesn’t account for the past motion of the object of interest, over a period of time. Rather, it computes only the immediate future position of every object of interest (position in the very next frame), by finding a velocity vector between two previous frames. The mathematical model used for the above approach is discussed below:

Y_n-1 is the image center of the object of interest in the (n-1)th frame having image coordinates (x_n-1, y_n-1). It moves to a position Y_n in the (n)th frame with coordinates (x_n, y_n). The position of the object of interest in the (n+1)th frame according to the linear model is computed as follows:

21 of 30

The velocity vector V of the object of interest is as follows :

V= ((x_n - x_n-1)*FPS u + (y_n- y_n-1) *FPS v), where,

FPS = Frames Per Second value of the video

u= unit vector along the horizontal image coordinate axis

v = unit vector along the vertical image coordinate axis

The predicted coordinates of the object of interest in the (n+1)th frame is given by:

(x_n+1, y_n+1) = (x_n+ (u.V/FPS) , y_n+ (v.V/FPS))

22 of 30

Bi-Directional Long Short Term Memory (LSTM) Model

A brief overview of LSTMs and Bi-Directional LSTMs:

LSTM Networks are a unique type of Recurrent Neural Networks (RNN) which are capable of learning Long term dependencies. All RNNs have a chain of recurring modules which have a simple structure.

Whereas, LSTMs consist of recurrent modules which consist of 4 layers

What mainly distinguishes an LSTM from an RNN is the presence of cell states which is depicted by the horizontal line running across the recurrent module.

Cell states are responsible for carrying over the information priorly learned (long term memory) all through the rest of the model. σlayers depict what are called gates. σ layers are functions with parameters which control the flow of information from one module to another. These parameters are also called the gate coefficients. The gate coefficients determine how much information from the particular module to be carried out to the further layers.

23 of 30

Bi-Directional LSTMs are essentially the same as Uni-Directional LSTMs in architecture except that these have two independent LSTMs running in the opposite directions.

This enables the Bi-Directional LSTMs to remember and process information from both past and future states. This helps Bi-Directional LSTMs in modeling complex non-linear motion.

24 of 30

Model Architecture and working

The architecture of the proposed Bi-Direction LSTM model to accomplish our trajectory prediction problem is as shown to the left hand side. It uses a pattern based approach.

However the actual model is not linearly connected as shown in the flowchart. It is connected via layers called the time distributed layers. The time-distributed layers make every bi-directional LSTM layer apply to every time slice instead of having time as a dimension itself, hence helping in cutting down a 5th dimension, greatly reducing the computational cost.

The input layer receives an array of dimensions (N x 4 x 15), where each parameter means the following:

● N - Number of unique identity tracks detected in a frame

● 4 - Leftmost corner value, Topmost corner value, Width, Height of the bounding box of the unique identity track.

● 15 - 15 prior frames in which the unique tracks were detected are inputted.

The model gives an array of Dimensions (N x 4 x 30) as the output. It gives predictions in the same sequence as explained above for 30 frames by taking an input of 15 frames.

25 of 30

Evaluation Metrics

In case of the object tracking model, the following metric has been used:

Where, t is the frame index and other parameters are described as follows:

FN - False Negatives, FP - False Positives, IDSW - The number of mismatches/identity switches, GT - The number of ground truth objects, everything summed over the domain of all frames. The higher the score, better the model.

In case of the path prediction model, the following metric has been used:

The object of interest is at a position Yn at the time T = t of the video being processed. It moves to a position Yn+1 at time T = t+1. Our model predicts its motion, and inferences that it goes to a position Y’n+1 at time T = t+1. The root mean squared error of our model for this particular instance between n and (n+1)th frames is given above. When computing the same for the whole duration of the video, RMSE is given by:

Where N gives the total number of frames in the video processed. Lower the score, better the model.

26 of 30

Results in terms of Evaluation Metrics

OBJECT TRACKING (done on MOT16 benchmark dataset):

State-of-Art technique - MOTA Score : 68.7
Siamese Neural Network - MOTA Score : 22.7
Deep Sort Network - MOTA Score : 44.2

We can see that the Siamese Neural Network performs pretty bad when compared to the State-of-Art technique. This can be reasoned out that it is due to the fact that the rather simple nature of the model doesn’t let it learn higher level distinguishing features between targets of interest. It suffers from a very high number of identity switches (mismatches), thus increasing the IDSW in MOTA score, due to the fact that it entirely depends on features of the input images and nothing more.

27 of 30

PATH PREDICTION (done on ETH BIWI Walking Pedestrians dataset ):

State-of-Art technique - RMSE Score : 0.39
Linear Model - RMSE Score : 10.782
Bi-Directional LSTM Model - RMSE Score : 0.784

Linear model gives a very mediocre score when compared to the state-of-art technique with respect to this dataset, which has a much lower RMSE score of 0.39. This can be reasoned out due to the inability of the linear model to learn or infer any non-linear motion or interactions between objects of interest (when almost all forms of motion in real life are highly non-linear in nature) . This model is capable of only making linear path predictions by computing the velocities of objects of interest.

28 of 30

The Bi-Directional LSTM model is a much better improvement over the linear model but still has limitations since it can be seen that it has a worse RMSE score than that of the state-of-art techniques. This is due to the following reasons:

1. The accuracy of the base tracking system used - Deep Sort Network was used as the tracking system for this model. Since it doesn’t have as high accuracy as the state-of-art method, this ends up increasing the errors in path prediction as well.

2. It implicitly accounts for social interactions. Though we supply all the detections in a frame as the input, still the model doesn’t explicitly have the architecture to model interactions between different objects of interest.

However the success of the model lies in successfully modelling complex non-linear motion models for a time period that is more than that of the input time period. (Gives 30 frames as output by taking just 15 frames as input)

29 of 30

Conclusions and Future Scope

The above demonstrated path prediction model pipeline provides a sophisticated use of Bi-Directional LSTMs, which are conventionally used for Multi-Object Tracking problems, to solve the path prediction problem. It does a good job in predicting 30 frames into the future by taking past 15 frames of the object of interest into consideration. The bi-directional LSTM model improves upon the conventionally used uni-directional LSTMs, by taking both future and past trajectories into account while training.

However, the model still faces problems in the scenes containing a high amount of occlusions. This can be partly accounted to the inefficiency of the object tracking network used and partly to the inability of the model to explicitly learn intra-object interactions. Real life crossroad scenes are often highly occluded ones due to being heavily crowded. Also in the context of self-driving cars, predicting future motion upto just 30 frames (which translates to just 1 second in case of a 30 fps video) can’t be of much use.

Hence, in the future one can improve this model by:

Using a better base object tracking model
Incorporating social LSTMs along with the bi-directional LSTMs which explicitly models the interaction between the objects of interest.
Improving the complexity of the model such that it predicts further into the future and not just 30 frames which would be more meaningful in the context of self-driving cars.
One can also utilize the non-learning tracking model, mentioned in the text which would make use of multi-camera feeds to improve tracking accuracies

1 of 30