2 of 6

What is “image filtering”

It is hard to understate the importance of good data in training good image models (including generative models like Stable Diffusion or contrastive models like CLIP)

There is vast literature surrounding dataset filtering: how to take a massive dataset and remove bad samples (e.g., poor quality samples, duplicated samples)

A model trained with higher-quality samples will outperform a model trained with more samples with larger variation in quality

The aim of this project is to investigate a novel approach towards image filtering, making use of fMRI collected while humans looked at images in the MRI machine

3 of 6

How do ML researchers typically do image filtering

Deduplication (e.g., hashing)
Semantic clustering (e.g., SemDeDup) and removing overlapping samples within the same cluster
Cosine similarity thresholding with CLIP scores

LAION-400M was made by looking at text / image embeddings from LAION-5B and dropping those with similarity below 0.3

And more: https://haoliyin.substack.com/p/data-pruning-at-scale
https://www.datacomp.ai/workshop.html#first

4 of 6

Why would the brain be at all relevant?

Ultimately ML models are meant for human use, and thus needs to be aligned with how humans cognitively represent images, and brain data is as direct a method as you can get for measuring such representations
There is precedent that fMRI encoding models (input image → pretrained model latent → brain prediction) get better by using better quality model latents (this should be pretty obvious). Less obvious is that this function seems to specifically tracks the diversity of the underlying dataset used to train that model, not other factors like effective dimensionality, model architecture, SSL vs. contrastive training, size of the underlying dataset, etc. (Conwell et al. 2022). Such a finding hints to the potential of brain encoding models to measure dataset diversity for image filtering.

5 of 6

How would this actually work

Imagine we are training a CLIP model on a full, unfiltered dataset. We also have a separate train/val set of image/brain paired samples (from Allen et al., 2021).

We are training the CLIP model in batches of images. After every batch, we freeze the CLIP model and train a simple linear regression encoding model that goes from image → CLIP latent → brain. This brain prediction is the basis for a separate “fMRI loss” metric. We can then look back at the gradient to determine which images in the present batch were most responsible for improving or worsening the fMRI loss. We then remove k images from the batch from the entire unfiltered dataset and continue training until we have effectively pruned the dataset to a certain number of samples.

6 of 6

Practicalities

Collaborating over Discord and Github �(our repo: https://github.com/MedARC-AI/brain-image-filtering)

1 of 6

2 of 6

3 of 6

4 of 6

5 of 6

6 of 6