1 of 8

FERM: A FEature-space Representation Measure for Improved Model Evaluation

Yeu-Shin Fu, Wenbo Ge, Jo Plested

Jo Plested Associate Lecturer

University of New South Wales Australia

2 of 8

A Well Formed Feature Space

2

A 1,500 dimensional feature space is reduced using T-SNE into the normalised top-2 representative dimensions so it can be visualised.

Each class is numbered and allocated a different colour. Class centroids are labelled.

Inception v4 is pretrained on ImageNet then fine-tuned on a combination of the Stanford Cars [1] and FGVC Aircraft [2] datasets

[1] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, Toyota Technological Institute at Chicago, 2013.

[2] Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1153–1162,�2016.

A well formed feature space has high similarity/tight clustering within a class (intra-class) and low�similarity/sparse clustering between classes (inter-class).

3 of 8

Feature-space Representation Measures

3

Why Should We Care?

  • Uncovering and quantifying biases in models that make them unreliable for use on rarer data.
  • Detecting domain shift and predict the best response in terms of model retraining – for example Quantifying how well prediction models based on historical data are representing data from the last few years that has changed due to COVID and other modern challenges
  • Highlighting when models are performing well on training and test data, but overfitting a poor representation that will not generalise well to new data.
  • Predict the optimal way to train or retrain a model with limited training examples for a new or changing target dataset.

4 of 8

Why Should We Care? – Overfitting poor features

4

Overfitting when transferring large models to small target datasets. For smaller target dataset sizes the difference between training loss and validation loss increases as the number of layers transferred increases

training examples

per class

average training and validation loss difference

Layers

Loss

10

25

50

4

Training

1.55

1.22

1.35

2.91

Validation

5.07

4.27

3.51

5

Training

0.71

0.82

1.06

3.63

Validation

5.38

4.43

3.67

6

Training

0.49

0.60

0.83

3.90

Validation

5.34

4.48

3.81

7

Training

0.39

0.48

0.71

4.72

Validation

6.26

5.18

4.29

http://users.cecs.anu.edu.au/~Tom.Gedeon/pdfs/An%20Analysis%20of%20the%20Interaction%20Between%20Transfer%20Learning%20Protocols%20in%20Deep%20Neural%20Networks.pdf

Transferring from pretraining on one set of 500 ImageNet classes to the other 500 with limited training examples per class in the target dataset

As more layers are transferred the difference between training and validation loss becomes significantly larger - there is not enough data to change the features much during fine-tuning

5 of 8

FERM: FEature-space Representation Measures

5

 

6 of 8

FERM: Evaluating fixed features

6

Measure

Target task

FF/FT

FERM 1

FERM 2

LEEP

OTCE

H-score

Silhouette

Davies Bouldin

Calinski Harabasz

Dunn

RS

Point Biserial

Aircraft - 100

0.663

1.11

1.14

-4.59

0.26

53.09

-0.12

6.79

20.71

0.08

1.61

243.95

0.1

Cars - 196

0.677

1.22

1.22

-5.25

0.32

160.41

-0.10

4.29

5.75

0.21

1.61

1146.63

0.07

DTD - 47

0.946

1.33

1.36

-3.83

0.28

41.77

-0.04

4.65

9.07

0.09

1.81

248.99

0.14

Caltech - 256

0.987

1.61

1.61

-5.49

0.31

152.64

0.09

3.10

18.22

0.12

1.61

2334.25

0.06

ImageNet - 1K

1.0

1.82

1.81

-6.84

0.00

360.00

0.11

2.88

40.68

0.00

1.55

9823.30

0.03

Using the ratio of fixed feature results using the pretrained model to fine-tuned results (FF/FT) as a proxy for a well formed feature space. Higher is better.

  • Comparison of FERM with other potential measures from transferability and clustering analysis on different target tasks.
  • The source task is ImageNet 1K and the model is Inception v4.
  • Target dataset number of classes is listed next to its name. Note some measures are severely affected by this number

Measures in the correct order: FERM 1, FERM2 , Silhouette. Close to correct: Davies Bouldin

Transferability measures

------------------------Clustering measures-------------------------

7 of 8

FERM: Measuring Corruption in the Dataset

7

Inception v4 pretrained on ImageNet 1K

  • Starting with a subset of ImageNet 1K, samples from the Aircraft dataset are added to the dataset to slowly change the ratio of source to target dataset (% corruption)
  • We are ideally looking for measures that are mostly monotonically increasing or decreasing from left to right.
    • Silhouette is mostly monotonically decreasing and drops off significantly when there is close to 50% corruption
    • The FERM (Cosine) measures are interesting as they are monotonically decreasing apart from the case where there is just 1 training example per class of either ImageNet or Aircraft. It is useful to detect this case as with 1 example per class there is no way to measure within class similarity and so it is essentially adding pure noise.

8 of 8

Conclusion

8

Our first two FERM measures have good potential to measure how well the current model and weights represent the feature space for a given dataset. Within the bounds of our experiments they:

  • decrease monotonically as the fixed features become less suited to the current target dataset and task.
  • are extremely sensitive to adding noise to corrupt the dataset
  • are marginally sensitive to adding more structured corruption in the form of new classes not currently modelled

The Silhouette clustering measure [1] also has good potential to be used for this purpose as it is monotonically decreasing both:

  • as the fixed features become less suited to the current target dataset and task, and
  • as structured corruption in the form of new classes not currently modelled are added to an existing dataset.

We expect the combination of our FERM measures and the Silhouette clustering measure to be effective at evaluating when a trained model is not representing a dataset well.

[1] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987