1 of 33

EGDCL (Evidence-Guided Dual-Curriculum Learning)

An Adaptive Curriculum Learning Framework for Unbiased Glaucoma Diagnosis

Shirshak Acharya

RA, NAAMII

2 of 33

Method

Reweighted Loss Function

3 of 33

Student (Self Attention CNN network)

4 of 33

Spatial Attention

  • Use CNN & transform i/p image to downsampled feature maps
  • Use bilinear sampling in later stages of spatial attention network to generate attention score for whole image

  • In our case, attention is calculated for

feature map, so no bilinear sampling needed

5 of 33

Spatial Attention

  • Let, Ms(F) be the spatial attention for image feature map (F),

  • Spatial attention across feature map of CxH x W feature map:
    • H dimensional vector for each W of feature map

6 of 33

Other techniques for Spatial Attention

  • Input tensor (channel × height × width) is decomposed into (2 × h × w)
    • 1st channel = max pooling across channel
    • 2nd channel = average pooling across channels

  • Then, Convolution + Batch Norm + Relu(optional)

  • Passed to sigmoid layer, which gives importance of each pixel

7 of 33

Channel Attention

8 of 33

Squeeze and Excitation Network

  • Global Average Pooling to Image of HxWxC becomes -> 1x1xC
  • Passed to MLP with hidden nodes(r) = reduction ratio-> hyperparameter
  • MLP same o/p nodes as hidden nodes
  • Sigmoid Activation
  • Residual connection

9 of 33

Channel Attention

  • Same as SE network
  • New : Maxpool is used for preserving edge information of image

10 of 33

Channel Attention

  • Let, Mc(F) be the channel attention for image feature map (F),

  • Channel attention across feature map of CxHxW :
    • C dimensional vector which gives weight to each channels based on Glaucoma Classification

11 of 33

Student (Self Attention CNN network)

3d attention map across cxhxw

What does this mean?

12 of 33

Student (Self Attention CNN network)

Overall Attention Map

Refined Feature Map, F of input image then becomes

13 of 33

Example of attention map calculation

14 of 33

Example of attention map calculation

15 of 33

Student (Evidence Identification Algorithm)

3d attention map across cxhxw

  • Relevance matrix, E is computed which gives importance of all features

16 of 33

Student (Evidence Identification Algorithm)

  • Suppose label, c∈ [0, 1]; 0 = Normal, 1=Glaucoma
  • p(c|F) = prob. of occurrence of class c given all features
  • p(c|F \i )= prob. of occurrence of class c given all features except Fi

3d attention map across cxhxw

17 of 33

Student (Evidence Identification Algorithm)

  • Suppose label, c∈ [0, 1]; 0 = Normal, 1=Glaucoma
  • p(c|F) = prob. of occurrence of class c given all features
  • p(c|F \i )= prob. of occurrence of class c given all features except Fi

  • The difference gives us how prediction changes for each feature(i) of feature map

3d attention map across cxhxw

18 of 33

Student (Evidence Identification Algorithm)

  • Each feature Fi can be unknown, so we can't directly subtract feature Fi from features F & compute

  • So, using marginaling effects of F1 from joint distribution :

3d attention map across cxhxw

probability of observing feature Fi given other features F /i

probability of observing label c given feature Fi and F /i

19 of 33

Student (Evidence Identification Algorithm)

3d attention map across cxhxw

probability of observing feature Fi given other features F /i

probability of observing label c given feature Fi and F /i

20 of 33

Student (Evidence Identification Algorithm)

3d attention map across cxhxw

probability of observing feature Fi given other features F /i

probability of observing label c given feature Fi and F /i

  •  = joint probability of observing feature Fi given other features F/i & observing class c given F

  • Marganilize across Fi, to get probability of observing label c given F/i  ???

21 of 33

Student (Evidence Identification Algorithm)

3d attention map across cxhxw

Assumption by scientists :

  • probability of occurring feature Fi is independent of the features of neighboring pixel of image

Finally we get overall Evidence Matrix equal to size of input image

22 of 33

Overall Student Network

23 of 33

Method

Reweighted Loss Function

24 of 33

Dual Curriculum Generation

Curriculum learning [1] :

  • Ordering training data by way humans learn : from simple to complex samples

25 of 33

Dual Curriculum

26 of 33

Sample Curriculum

α Weight altered by not just evidance map but teacher network prediction

27 of 33

Sample Curriculum

  • Takes into account :
    • teacher model's estimated probability only for positive(disease) label
    • Evidence Map's estimated probability only for positive(disease) label
  • Note :
    • Evidance map's (Ei) passed to compact "correct sub-network" for sample xi to get evidence map estimated probability

  • Gives a weight value for each sample, if model correctly classifies then less weight & doesn't classify correctly...more weight
  • Details on next section 

28 of 33

Sample Curriculum

weight value (α) will be :

denote the model’s estimated probability for class with label y = 1 based on teacher network & evidence maps

=  Weight factor only affected by evidence map if prediction = wrong

29 of 33

Sample Curriculum

Weight altered by evidence maps

Weight altered by teacher n/w

30 of 33

Properties of Weight (α)

  • Above formula only works if booli = 1 i.e evidance map doesn't classify samples (hard samples)

  • As piE gets closer to 0.5 & sample is misclassified, weighting factor αi becomes larger and the loss is up-weighted 

Weight altered by evidence maps

31 of 33

Properties of Weight (α)

  • Teacher also focuses on hard samples if piT gets closer to 0.5

  • Weighting factor αi becomes larger and the loss is up-weighted

Weight altered by teacher model

32 of 33

Properties of Weight (α) - Summary

33 of 33

Feature Curriculum