FERM: A FEature-space Representation Measure for Improved Model Evaluation
Yeu-Shin Fu, Wenbo Ge, Jo Plested
Jo Plested Associate Lecturer
University of New South Wales Australia
A Well Formed Feature Space
2
A 1,500 dimensional feature space is reduced using T-SNE into the normalised top-2 representative dimensions so it can be visualised.
Each class is numbered and allocated a different colour. Class centroids are labelled.
Inception v4 is pretrained on ImageNet then fine-tuned on a combination of the Stanford Cars [1] and FGVC Aircraft [2] datasets
[1] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, Toyota Technological Institute at Chicago, 2013.
[2] Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1153–1162,�2016.
A well formed feature space has high similarity/tight clustering within a class (intra-class) and low�similarity/sparse clustering between classes (inter-class).
Feature-space Representation Measures
3
Why Should We Care?
Why Should We Care? – Overfitting poor features
4
Overfitting when transferring large models to small target datasets. For smaller target dataset sizes the difference between training loss and validation loss increases as the number of layers transferred increases
| | training examples per class | average training and validation loss difference | ||
Layers | Loss | 10 | 25 | 50 | |
4 | Training | 1.55 | 1.22 | 1.35 | 2.91 |
| Validation | 5.07 | 4.27 | 3.51 | |
5 | Training | 0.71 | 0.82 | 1.06 | 3.63 |
| Validation | 5.38 | 4.43 | 3.67 | |
6 | Training | 0.49 | 0.60 | 0.83 | 3.90 |
| Validation | 5.34 | 4.48 | 3.81 | |
7 | Training | 0.39 | 0.48 | 0.71 | 4.72 |
| Validation | 6.26 | 5.18 | 4.29 | |
http://users.cecs.anu.edu.au/~Tom.Gedeon/pdfs/An%20Analysis%20of%20the%20Interaction%20Between%20Transfer%20Learning%20Protocols%20in%20Deep%20Neural%20Networks.pdf
Transferring from pretraining on one set of 500 ImageNet classes to the other 500 with limited training examples per class in the target dataset
As more layers are transferred the difference between training and validation loss becomes significantly larger - there is not enough data to change the features much during fine-tuning
FERM: FEature-space Representation Measures
5
FERM: Evaluating fixed features
6
Measure Target task | FF/FT | FERM 1 | FERM 2 | LEEP | OTCE | H-score | Silhouette | Davies Bouldin | Calinski Harabasz | Dunn | RS | Point Biserial | |
Aircraft - 100 | 0.663 | 1.11 | 1.14 | -4.59 | 0.26 | 53.09 | -0.12 | 6.79 | 20.71 | 0.08 | 1.61 | 243.95 | 0.1 |
Cars - 196 | 0.677 | 1.22 | 1.22 | -5.25 | 0.32 | 160.41 | -0.10 | 4.29 | 5.75 | 0.21 | 1.61 | 1146.63 | 0.07 |
DTD - 47 | 0.946 | 1.33 | 1.36 | -3.83 | 0.28 | 41.77 | -0.04 | 4.65 | 9.07 | 0.09 | 1.81 | 248.99 | 0.14 |
Caltech - 256 | 0.987 | 1.61 | 1.61 | -5.49 | 0.31 | 152.64 | 0.09 | 3.10 | 18.22 | 0.12 | 1.61 | 2334.25 | 0.06 |
ImageNet - 1K | 1.0 | 1.82 | 1.81 | -6.84 | 0.00 | 360.00 | 0.11 | 2.88 | 40.68 | 0.00 | 1.55 | 9823.30 | 0.03 |
Using the ratio of fixed feature results using the pretrained model to fine-tuned results (FF/FT) as a proxy for a well formed feature space. Higher is better.
Measures in the correct order: FERM 1, FERM2 , Silhouette. Close to correct: Davies Bouldin
Transferability measures
------------------------Clustering measures-------------------------
FERM: Measuring Corruption in the Dataset
7
Inception v4 pretrained on ImageNet 1K
Conclusion
8
Our first two FERM measures have good potential to measure how well the current model and weights represent the feature space for a given dataset. Within the bounds of our experiments they:
The Silhouette clustering measure [1] also has good potential to be used for this purpose as it is monotonically decreasing both:
We expect the combination of our FERM measures and the Silhouette clustering measure to be effective at evaluating when a trained model is not representing a dataset well.
[1] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987