1 of 24

TechTalks:�Person Re-Identification

By Manideep Kolla

2 of 24

Introduction

The Questions:

How can we associate people across unobserved regions?

OR

How can we identify a person at different times recorded on multiple cameras at different angles and places?

3 of 24

Person Re-Identification

Person Re-identification is the task of re-identifying a target person across multiple non overlapping cameras.

4 of 24

Key Challenges in Person Re-Identification

ViewPoint Change

Illumination variation

Partial Occlusion

Change in camera ViewPoint

Change in Illumination

Objects blocking the image view

5 of 24

Current Approaches

Feature map Similarity Estimation - A Siamese network

A siamese network few convolution layers will be used for which the inputs are the reference image and the checking image
Two sets of feature maps will be generated each from each input.
A similarity estimation layer like Cross Correlation layer, Euclidean distance or neighbourhood difference layer will be used between these two sets of feature maps.
And then after a couple of convolution layers and a flat layer we output a binary value from a softmax layer. With 1 for similar person and 0 for different person.

6 of 24

Current Approaches

Feature map Similarity Estimation - A Siamese network

7 of 24

Current Approaches

Re-Identification through Pose Estimation

Generating heat maps of the persons based on their pose.
Fourteen keypoints are considered for every person representing different parts of the body.
Generate heat maps or feature maps by training a pose estimation network and incorporate these heat maps into another network which does re-identification task.
The network will output a n-dimensional vector or a feature map for each of reference and checking image and a distance metric is used to compare them.

8 of 24

Current Approaches

Re-Identification through Pose Estimation

Spindle Net

9 of 24

Current Approaches

Re-Identification through Pose Estimation

10 of 24

What I have been doing...

11 of 24

The Architecture

Used a Siamese network with shared weights network.
The output feature maps are then passed into a similarity estimation layer, in this case a Cross-input Neighborhood difference layer.
And then after three convolution layers, we will get a binary softmax prediction.
The motivation behind taking differences in a neighbourhood is to add robustness to the positional differences in the corresponding features of the two input images.

Training Scheme:

The network is trained by passing a pair of images. This pair is selected such a way that to every positive pair a random negative pair will be chosen.
Adam optimizer is used and the loss function used is the sparse categorical cross entropy loss

12 of 24

The Architecture

13 of 24

The Results

14 of 24

The Results

Confusion Matrices

Train	0	1
0	31165	630
1	177	31618

Val	0	1
0	6069	172
1	620	5621

Test	0	1
0	4530	1296
1	1277	4549

Phase	Dataset / Identities	Accuracy (%)
Training	CUHK03 / 742	98.6
Validation	CUHK03 / 100	93.7
Testing	CUHK01 / 971	78
Testing	Market-1501 / 750	80.9
Testing	FUJITSU Data / 20	81.3

Testing on CUHK01

15 of 24

The Results - FUJITSU Data

16 of 24

The Results

Loss Plot

Accuracy Plot

17 of 24

The Results - True-Positives

18 of 24

The Results - True-Negatives

19 of 24

The Results - False-Positives

20 of 24

The Results - False-negatives

21 of 24

Challenges I have faced

Wasn't able to implement the Normalized Cross-correlation layer in simple python instead of CUDA. The simple python implementation would take about 0.5 seconds to forward pass a single pair of feature maps of shape (37, 12, 25) with which its impossible to train the model.
When using a very deep convolutional network instead of a similarity estimation layer, the learning rate should be very less for the network to learn otherwise it’s predicting every pair as similar.
There was as lot of overfitting initially due to which i have introduced L2 regularizers and Dropouts in the network.
The model I have build right now is good enough, but Introduction of features extracted from a pose estimation network will considerably improve the results as discussed earlier. But keeping the duration in mind I haven't tried this.

22 of 24

Next steps

I have tried using hard negative mining, but the results are not very good compared to normal training. Need to work on this and fix it.
The model I have built for person re-identification can be used to differentiate between a pair of images whether they are similar or different but cannot directly track and re-identify a person in different cameras like I have shown earlier. So, this network should be remodeled to do live person tracking and Re-identification.
Pose estimation is not used in the network I have built. But using pose estimation will improve the current results. Using pose estimation, the network can learn about misaligned and moving body parts.

23 of 24

References

24 of 24

Thank you.