Path Prediction in Crowded Environments using Computer Vision and DL
30/06/2020
Shyamsundar P I
BITS Pilani
Task at hand
Trajectory Prediction of dynamic objects is a widely studied topic in the field of artificial intelligence. Especially with the introduction of modern technologies like that of self-driving cars, the importance of efficient path prediction models for these automated machines to avoid accidents and crashes with pedestrians and other entities, have been drawing attentions of the researchers.
The thesis intends to propose an end to end deep learning model to learn and predict the motion patterns of pedestrians aided by object detection and tracking models.
The study also intends to discuss the results of the same by comparing it to the standard benchmarks available and also give the future scope of what one could do further with this model.
Problem Setup
Cam - 1
Cam - 2
Standard workflow involved in solving our problem
Object Tracking
Since we are dealing with a problem where multiple humans would be in the video, it is important to identify a person as the same person from the previous frame (also in the other cameras’ footages), in order to know their tracks traversed previously.
Object Detection
Given video feeds from the installed cameras, it is essential to detect humans in each frame and draw bounding boxes around them, hence localising their position. Usage of multiple cameras.
Path Prediction
Once the previous tracks of a person have been studied using Object Tracking methods, these can be fed into Machine Learning/ Probability Learning algorithms in order to learn the tracks as well as other features from the footages, to predict the path taken by a person at a future time instance.
�
“Here’s the catch”
Though many applications of CV can be solved easily by developing simple mathematical models, but that’s not always the case.
A simple example is that of Object Detection - Given an image of a dog, in order for the computer to realize that the image is that of a dog, it should have prior knowledge of what a dog looks like. Similar to cognitive skills of human brain? This is where the problem of ‘Learning’ comes in, specifically referred to as Machine Learning in context of computers - which are complex statistical algorithms in theory.
This prompted us to decide what methods to use exactly, and why?
Why deep learning for our problem?
All the state of art techniques with respect to our problem, are based on some sort of DL architecture.
This can be reasoned out by the fact that all three steps involved in our model are highly complex steps which requires a high number parameters to capture non-linearities and process the results properly.
DL also simplifies the task at hand to a much greater extent.
For eg. Traditionally object detection using CV/ML algorithms is carried out by a prior feature extraction step. This step is a highly complex step since different images having different classes of objects, may have different distinctive features which needs to be handpicked. This step requires expertise. DL provides a platform to completely override this step, by autonomously choosing and learning the prominent features.
Work Environment
Object Detection
Deep-learning based object-detection algorithms.
Object Detection Model and Architecture
MODEL 1
MODEL 2
A word on object detection with masks model
This model can be very useful if one intends to make path predictions of pedestrians according to their body pose (eg. face orientation and feet orientation) since it is possible to localize the orientation of different parts of the human body with the help of the masks. This can be one of the future scopes of the above masking model in path prediction.
Object Tracking using Deep Learning
In the model pipeline, object tracking model gets its input from the object detection model in the form of object detections, their bounding box coordinates and their confidence of predictions over all the frames of the video. There are 3 major approaches to achieve object tracking using DL, as follows:
A very detailed study of the above methods and many more, is done in the following review paper which has summarised and reviewed the findings of around 174 papers. Deep learning in video multi-object tracking: A survey
Object Tracking Algorithm
Two short term memory based algorithms in combination with a re-identification framework have been used for object tracking:
Pro: Can be extremely accurate in most scenarios depending on the accuracy of the re-ID network used.
Con: Extremely computationally expensive. Inferencing with DNNs is a very computationally expensive process. Every time a person has been detected in a new frame/ camera feed, his/her image has to be fed with images of N (Number of people in database), so for every detection, N number of inferences should be made. In a traffic scene with many people, this approach is extremely inefficient.
IoU = (Intersection Area)/(Union Area)
In this approach, there are two databases. Database 1 stores the values of box coordinates of various detected people for the past k (user specified) number of frames. Everytime a new person is detected, his/her box coordinates are taken and IoU is computed for all k frames of N number of people. The person x out of the N number of people to have a frame with the highest IoU with the input image is assigned to the new detection. If none of these IoUs are above a threshold value, the new detection is assigned a place in the database as some person y. To cross check the results of IoU, feature matching transform is applied between the images in the two compared boxes with highest IoU, the result of which should be high, with the assumption that the object must’ve not have moved much between a max of k frames. If this transformation is below a given threshold, approach 1 is followed to assign an identity.
Pro: Relaxes the computational expense, since calculating IoU and feature matching coefficients are much less computationally expensive.
Con: IoU fails miserably in scenes with a lot of occlusion. Since there is a lot of overlapping of bounding boxes in very crowded scenes, computation of IoU becomes meaningless. It ends up relying on the accuracy of the re-ID network for overall tracking accuracy in such situations.
The Problem of Re-Identification in tracking
Above mentioned tracking algorithms help a model choose the best identity for a specific object of interest considering prior detections. Hence these can’t work individually. These algorithms work in combination with Re-Identification networks. Re-Identification networks take two images in the form of arrays as input, run it through a neural network to determine the percentage of confidence that it’s the same person, or not. Two re-ID networks have been modeled and implemented as follows:
Siamese Neural Network
Deep Sort Network
Path Prediction
Path Prediction is a fairly new field in case of deep learning, so there is no formal broad classification of the DL algorithms available. However, in general, Path prediction algorithms can be classified into three broad categories:
I propose two models to solve the path prediction problem, the architectures and algorithms of which, will be shown in the following slides.
Linear Model
The first model proposed, modeled and tested was a linear non-learning based model which uses a Physics based approach. This model doesn’t account for the past motion of the object of interest, over a period of time. Rather, it computes only the immediate future position of every object of interest (position in the very next frame), by finding a velocity vector between two previous frames. The mathematical model used for the above approach is discussed below:
Yn-1 is the image center of the object of interest in the (n-1)th frame having image coordinates (xn-1, yn-1). It moves to a position Yn in the (n)th frame with coordinates (xn, yn). The position of the object of interest in the (n+1)th frame according to the linear model is computed as follows:
The velocity vector V of the object of interest is as follows :
V= ((xn - xn-1)*FPS u + (yn - yn-1) *FPS v), where,
FPS = Frames Per Second value of the video
u= unit vector along the horizontal image coordinate axis
v = unit vector along the vertical image coordinate axis
The predicted coordinates of the object of interest in the (n+1)th frame is given by:
(xn+1, yn+1) = (xn + (u.V/FPS) , yn + (v.V/FPS))
Bi-Directional Long Short Term Memory (LSTM) Model
A brief overview of LSTMs and Bi-Directional LSTMs:
LSTM Networks are a unique type of Recurrent Neural Networks (RNN) which are capable of learning Long term dependencies. All RNNs have a chain of recurring modules which have a simple structure.
Whereas, LSTMs consist of recurrent modules which consist of 4 layers
What mainly distinguishes an LSTM from an RNN is the presence of cell states which is depicted by the horizontal line running across the recurrent module.
Cell states are responsible for carrying over the information priorly learned (long term memory) all through the rest of the model. σlayers depict what are called gates. σ layers are functions with parameters which control the flow of information from one module to another. These parameters are also called the gate coefficients. The gate coefficients determine how much information from the particular module to be carried out to the further layers.
Bi-Directional LSTMs are essentially the same as Uni-Directional LSTMs in architecture except that these have two independent LSTMs running in the opposite directions.
This enables the Bi-Directional LSTMs to remember and process information from both past and future states. This helps Bi-Directional LSTMs in modeling complex non-linear motion.
Model Architecture and working
The architecture of the proposed Bi-Direction LSTM model to accomplish our trajectory prediction problem is as shown to the left hand side. It uses a pattern based approach.
However the actual model is not linearly connected as shown in the flowchart. It is connected via layers called the time distributed layers. The time-distributed layers make every bi-directional LSTM layer apply to every time slice instead of having time as a dimension itself, hence helping in cutting down a 5th dimension, greatly reducing the computational cost.
The input layer receives an array of dimensions (N x 4 x 15), where each parameter means the following:
● N - Number of unique identity tracks detected in a frame
● 4 - Leftmost corner value, Topmost corner value, Width, Height of the bounding box of the unique identity track.
● 15 - 15 prior frames in which the unique tracks were detected are inputted.
The model gives an array of Dimensions (N x 4 x 30) as the output. It gives predictions in the same sequence as explained above for 30 frames by taking an input of 15 frames.
Evaluation Metrics
Where, t is the frame index and other parameters are described as follows:
FN - False Negatives, FP - False Positives, IDSW - The number of mismatches/identity switches, GT - The number of ground truth objects, everything summed over the domain of all frames. The higher the score, better the model.
The object of interest is at a position Yn at the time T = t of the video being processed. It moves to a position Yn+1 at time T = t+1. Our model predicts its motion, and inferences that it goes to a position Y’n+1 at time T = t+1. The root mean squared error of our model for this particular instance between n and (n+1)th frames is given above. When computing the same for the whole duration of the video, RMSE is given by:
Where N gives the total number of frames in the video processed. Lower the score, better the model.
Results in terms of Evaluation Metrics
OBJECT TRACKING (done on MOT16 benchmark dataset):
We can see that the Siamese Neural Network performs pretty bad when compared to the State-of-Art technique. This can be reasoned out that it is due to the fact that the rather simple nature of the model doesn’t let it learn higher level distinguishing features between targets of interest. It suffers from a very high number of identity switches (mismatches), thus increasing the IDSW in MOTA score, due to the fact that it entirely depends on features of the input images and nothing more.
PATH PREDICTION (done on ETH BIWI Walking Pedestrians dataset ):
Linear model gives a very mediocre score when compared to the state-of-art technique with respect to this dataset, which has a much lower RMSE score of 0.39. This can be reasoned out due to the inability of the linear model to learn or infer any non-linear motion or interactions between objects of interest (when almost all forms of motion in real life are highly non-linear in nature) . This model is capable of only making linear path predictions by computing the velocities of objects of interest.
The Bi-Directional LSTM model is a much better improvement over the linear model but still has limitations since it can be seen that it has a worse RMSE score than that of the state-of-art techniques. This is due to the following reasons:
1. The accuracy of the base tracking system used - Deep Sort Network was used as the tracking system for this model. Since it doesn’t have as high accuracy as the state-of-art method, this ends up increasing the errors in path prediction as well.
2. It implicitly accounts for social interactions. Though we supply all the detections in a frame as the input, still the model doesn’t explicitly have the architecture to model interactions between different objects of interest.
However the success of the model lies in successfully modelling complex non-linear motion models for a time period that is more than that of the input time period. (Gives 30 frames as output by taking just 15 frames as input)
Conclusions and Future Scope
The above demonstrated path prediction model pipeline provides a sophisticated use of Bi-Directional LSTMs, which are conventionally used for Multi-Object Tracking problems, to solve the path prediction problem. It does a good job in predicting 30 frames into the future by taking past 15 frames of the object of interest into consideration. The bi-directional LSTM model improves upon the conventionally used uni-directional LSTMs, by taking both future and past trajectories into account while training.
However, the model still faces problems in the scenes containing a high amount of occlusions. This can be partly accounted to the inefficiency of the object tracking network used and partly to the inability of the model to explicitly learn intra-object interactions. Real life crossroad scenes are often highly occluded ones due to being heavily crowded. Also in the context of self-driving cars, predicting future motion upto just 30 frames (which translates to just 1 second in case of a 30 fps video) can’t be of much use.
Hence, in the future one can improve this model by:
THANK YOU
HAVE A NICE DAY