2 of 8

A Well Formed Feature Space

A 1,500 dimensional feature space is reduced using T-SNE into the normalised top-2 representative dimensions so it can be visualised.

Each class is numbered and allocated a different colour. Class centroids are labelled.

Inception v4 is pretrained on ImageNet then fine-tuned on a combination of the Stanford Cars [1] and FGVC Aircraft [2] datasets

[1] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, Toyota Technological Institute at Chicago, 2013.

[2] Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1153–1162,�2016.

A well formed feature space has high similarity/tight clustering within a class (intra-class) and low�similarity/sparse clustering between classes (inter-class).

3 of 8

Feature-space Representation Measures

Why Should We Care?

Uncovering and quantifying biases in models that make them unreliable for use on rarer data.
Detecting domain shift and predict the best response in terms of model retraining – for example Quantifying how well prediction models based on historical data are representing data from the last few years that has changed due to COVID and other modern challenges
Highlighting when models are performing well on training and test data, but overfitting a poor representation that will not generalise well to new data.
Predict the optimal way to train or retrain a model with limited training examples for a new or changing target dataset.

4 of 8

Why Should We Care? – Overfitting poor features

Overfitting when transferring large models to small target datasets. For smaller target dataset sizes the difference between training loss and validation loss increases as the number of layers transferred increases

		training examples per class			average training and validation loss difference
Layers	Loss	10	25	50
4	Training	1.55	1.22	1.35	2.91
	Validation	5.07	4.27	3.51
5	Training	0.71	0.82	1.06	3.63
	Validation	5.38	4.43	3.67
6	Training	0.49	0.60	0.83	3.90
	Validation	5.34	4.48	3.81
7	Training	0.39	0.48	0.71	4.72
	Validation	6.26	5.18	4.29

http://users.cecs.anu.edu.au/~Tom.Gedeon/pdfs/An%20Analysis%20of%20the%20Interaction%20Between%20Transfer%20Learning%20Protocols%20in%20Deep%20Neural%20Networks.pdf

Transferring from pretraining on one set of 500 ImageNet classes to the other 500 with limited training examples per class in the target dataset

As more layers are transferred the difference between training and validation loss becomes significantly larger - there is not enough data to change the features much during fine-tuning

5 of 8

FERM: FEature-space Representation Measures

6 of 8

FERM: Evaluating fixed features

Measure Target task	FF/FT	FERM 1	FERM 2	LEEP	OTCE	H-score	Silhouette	Davies Bouldin	Calinski Harabasz	Dunn	RS	Point Biserial
Aircraft - 100	0.663	1.11	1.14	-4.59	0.26	53.09	-0.12	6.79	20.71	0.08	1.61	243.95	0.1
Cars - 196	0.677	1.22	1.22	-5.25	0.32	160.41	-0.10	4.29	5.75	0.21	1.61	1146.63	0.07
DTD - 47	0.946	1.33	1.36	-3.83	0.28	41.77	-0.04	4.65	9.07	0.09	1.81	248.99	0.14
Caltech - 256	0.987	1.61	1.61	-5.49	0.31	152.64	0.09	3.10	18.22	0.12	1.61	2334.25	0.06
ImageNet - 1K	1.0	1.82	1.81	-6.84	0.00	360.00	0.11	2.88	40.68	0.00	1.55	9823.30	0.03

Using the ratio of fixed feature results using the pretrained model to fine-tuned results (FF/FT) as a proxy for a well formed feature space. Higher is better.

Comparison of FERM with other potential measures from transferability and clustering analysis on different target tasks.
The source task is ImageNet 1K and the model is Inception v4.
Target dataset number of classes is listed next to its name. Note some measures are severely affected by this number

Measures in the correct order: FERM 1, FERM2 , Silhouette. Close to correct: Davies Bouldin

Transferability measures

------------------------Clustering measures-------------------------

7 of 8

FERM: Measuring Corruption in the Dataset

Inception v4 pretrained on ImageNet 1K

Starting with a subset of ImageNet 1K, samples from the Aircraft dataset are added to the dataset to slowly change the ratio of source to target dataset (% corruption)
We are ideally looking for measures that are mostly monotonically increasing or decreasing from left to right.

Silhouette is mostly monotonically decreasing and drops off significantly when there is close to 50% corruption
The FERM (Cosine) measures are interesting as they are monotonically decreasing apart from the case where there is just 1 training example per class of either ImageNet or Aircraft. It is useful to detect this case as with 1 example per class there is no way to measure within class similarity and so it is essentially adding pure noise.

8 of 8

Conclusion

Our first two FERM measures have good potential to measure how well the current model and weights represent the feature space for a given dataset. Within the bounds of our experiments they:

decrease monotonically as the fixed features become less suited to the current target dataset and task.
are extremely sensitive to adding noise to corrupt the dataset
are marginally sensitive to adding more structured corruption in the form of new classes not currently modelled

The Silhouette clustering measure [1] also has good potential to be used for this purpose as it is monotonically decreasing both:

as the fixed features become less suited to the current target dataset and task, and
as structured corruption in the form of new classes not currently modelled are added to an existing dataset.

We expect the combination of our FERM measures and the Silhouette clustering measure to be effective at evaluating when a trained model is not representing a dataset well.

[1] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987