1 of 38

Enhancing Medical Image Classification with Multi-Modal Inputs

Alain Fidahoussen

Danil Garmaev

2 of 38

2

Problem Statement and Approach

1

Results

2

Challenges and Next steps

3

3 of 38

Problem Statement

3

  • Motivation :

    • Single-view X-ray models miss complementary information.
    • Textual data (e.g., radiology reports) can provide context for better diagnosis.

  • Goal : Improve diagnostic accuracy in medical imaging by combining multi-modal inputs (frontal + lateral X-rays + text) using Visual Language Models.

Frontal and lateral X-rays of a patient showing cardiomegaly (enlarged heart size)

Frontal and lateral X-rays of a patient showing pleural effusion (fluid is present in the space around the lung)

4 of 38

Problem Statement

4

  • Cardiomegaly : heart looks larger than normal

  • Edema : extra fluid in the lungs

  • Consolidation : region of the lung filled with fluid instead of air

  • Atelectasis : part of the lung appears collapsed

  • Pleural Effusion : fluid is present in the space around the lung.

5 of 38

Related work

5

  • Model : CNN (DenseNet-121)

  • Inputs : frontal and lateral views

  • Outputs : 14 conditions (multi-label)

  • Dataset : MIMIC-CXR

    • ~340K frontal images

    • ~116K lateral images

The architecture of DualNet CNN architecture accepts a pair of frontal (PA or AP) and lateral input images.

6 of 38

Related work

6

  • Improvement for 10 of the 14 classes.

  • Metric : Area Under the Curve

7 of 38

Related work

7

  • Model : CLIP (Image encoder : ViT-B/16 - Text encoder : PubMedBERT)

  • Dataset : PMC-15M

    • ~15M high-quality images-text pairs from 4.4M scientific papers

    • 2 orders of magnitude larger than MIMIC-CXR

8 of 38

Related work

8

  • Model : LLaVa (Mistral-7B-Instruct)

  • Dataset :

    • PMC-15M (Stage 1)

    • GPT-4 Instructions (Stage 2)

9 of 38

Approach

9

  • Hypothesis : Combining frontal and lateral X-rays with textual data improve diagnostic accuracy compared to using single-view X-rays.

  • Dataset : CheXpert dataset with frontal and lateral X-rays on 14 different conditions.

  • Baseline : Dual CNN trained on CheXpert.

  • Models :
    • BioMedCLIP
    • LLaVa-Med

  • Methods :
    • Zero-shot classification
    • Fine-tuning
    • Open-ended questions

10 of 38

Approach

10

  • BioMedCLIP
    • Freeze the CLIP parameters
    • Fine-tuning the image/text encoder
    • Add a multi-label classification head.

  • LLaVA-Med
    • Adapt the model to accept dual-view images
    • Fine-tune the model on CheXpert dataset.
    • And at first we evaluate it on zero-shot classification using closed questions. Further on, we will attempt to expand the model to answer open-ended questions.

  • Evaluation process : for each model train and evaluate
    • On frontal X-rays
    • On both frontal and lateral X-rays

11 of 38

11

Problem Statement and Approach

1

Results

2

Challenges and Next steps

3

12 of 38

12

Problem Statement and Approach

1

Results

2

Challenges and Next steps

3

13 of 38

Challenges

13

  • LLaVa-Med

    • Poor zero-shot results

    • Difficulty to find a good prompt to generate the Q/A from the report

    • Difficulty to find a good model to generate the Q/A from the report
      • LLAMA 3.2 ⇒ Local, fast, but not always accurate
      • Deepseek R1 ⇒ Local, but slow inference
      • GPT ⇒ not free

    • Fine-tuning
      • Computationally expensive (~30h / epoch - LORA)
      • Unbalanced dataset

14 of 38

Next steps

14

  • BiomedCLIP ✅

    • Frontal and lateral > Frontal
    • CLIP-based > CNN

  • LLaVa-Med ❌

    • Generate better Q/A
    • Weighted-loss
    • Under/Over sampling methods
    • Text-based augmentation

15 of 38

Experimental setup

15

Dataset

  • CheXpert

    • Focus on the 5 most important clinical conditions

16 of 38

Experimental setup

16

Dataset

  • CheXpert

    • Focus on the 5 most important clinical conditions

    • Filter patients with both frontal and lateral views

      • Training : 25188 patients
      • Validation : 6297 patients
      • Test : 143 patients

17 of 38

Experimental setup

17

Metric

  • Area Under the Curve

18 of 38

Experimental setup

18

Metric

  • Area Under the Curve

  • Not sensible to moderate imbalance

19 of 38

Experimental setup

19

Metric

  • Area Under the Curve

  • Not sensible to moderate imbalance

  • Random classifier (0.5) : TP = TN = FP = FN

  • Perfect classifier (1.0) : FP = FN = 0

20 of 38

Preliminary Results

20

BioMedCLIP

  • Zero-shot performance

21 of 38

Preliminary Results

21

BioMedCLIP

  • Zero-shot performance

    • For each condition, discriminate between having (+) and not having (-) the condition

(+) ⇒ "This chest X-ray suggests an overall enlargement of the cardiac silhouette, consistent with cardiomegaly.",

(-) ⇒ "This chest X-ray shows a normal-sized cardiac silhouette, with no indication of cardiomegaly."

22 of 38

Preliminary Results

22

BioMedCLIP

  • Zero-shot performance

    • For each condition, discriminate between having (+) and not having (-) the condition

(+) ⇒ "This chest X-ray suggests an overall enlargement of the cardiac silhouette, consistent with cardiomegaly.",

(-) ⇒ "This chest X-ray shows a normal-sized cardiac silhouette, with no indication of cardiomegaly."

    • Compute two cosine-similarity scores between the visual features and each of the (+) and (-) textual features, and choose the maximum one.

23 of 38

Preliminary Results

23

BioMedCLIP

  • Zero-shot performance

    • For each condition, discriminate between having (+) and not having (-) the condition

(+) ⇒ "This chest X-ray suggests an overall enlargement of the cardiac silhouette, consistent with cardiomegaly.",

(-) ⇒ "This chest X-ray shows a normal-sized cardiac silhouette, with no indication of cardiomegaly."

    • Compute two cosine-similarity scores between the visual features and each of the (+) and (-) textual features, and choose the maximum one

    • Frontal and Lateral visual features are averaging

24 of 38

Preliminary Results

24

BioMedCLIP

  • Zero-shot performance (AUC)

    • Not great (0.5 = random classifier)

    • Slight improvement using both Frontal and Lateral views

25 of 38

Preliminary Results

25

BioMedCLIP

  • Fine-tune a multi-label classification head

26 of 38

Preliminary Results

26

BioMedCLIP

  • Fine-tune a multi-label classification head

    • CLIP frozen

27 of 38

Preliminary Results

27

BioMedCLIP

  • Fine-tune a multi-label classification head

    • CLIP frozen

    • Frontal and Lateral visual features are averaging

28 of 38

Preliminary Results

28

BioMedCLIP

  • Fine-tune a multi-label classification head

    • CLIP frozen

    • Frontal and Lateral visual features are averaging

    • Use concatenation between visual and textual features

29 of 38

Preliminary Results

29

BioMedCLIP

  • Fine-tune a multi-label classification head

    • CLIP frozen

    • Frontal and Lateral visual features are averaging

    • Use concatenation between visual and textual features

    • For each (+) condition, predict a score between 0 and 1 (sigmoid)

30 of 38

Preliminary Results

30

BioMedCLIP

  • Fine-tune a multi-label classification head (AUC)

    • Significant improvement from Zero-Shot (~20%)

31 of 38

Preliminary Results

31

BioMedCLIP

  • Fine-tune a multi-label classification head (AUC)

    • Significant improvement from Zero-Shot (~20%)

    • Slight improvement using both Frontal and Lateral views

32 of 38

Preliminary Results

32

BioMedCLIP

  • Fine-tune CLIP and a multi-label classification head

33 of 38

Preliminary Results

33

BioMedCLIP

  • Fine-tune CLIP and a multi-label classification head

    • CLIP is fine-tuned for a few epochs

34 of 38

Preliminary Results

34

BioMedCLIP

  • Fine-tune CLIP and the multi-label classification head

    • Significant improvement from frozen CLIP (~10%)

35 of 38

Preliminary Results

35

BioMedCLIP

  • Fine-tune CLIP and the multi-label classification head

    • Significant improvement from frozen CLIP (~10%)

    • Slight improvement using both Frontal and Lateral views

36 of 38

Timeline

36

Week 1

Feb. 17 - 23

Literature review

Week 2

Feb. 24 - March 2

Data preparation

Model selection

Week 3

March 3 - 9

Baselines

BioMedCLIP

Week 4

March 10 -16

Week 5

March 17 - 23

LLaVa-Med

Week 6

March 24 - 30

Week 7

March 31 - April 6

Results Analysis

Week 8

April 7 - 15

Finalization and poster presentation

37 of 38

Q & A

37

38 of 38

MIMIC-CXR

38