1 of 45

Fully Automated

Multi-heartbeat Echocardiography

Video Segmentation and Motion Tracking

Yida Chen, Xiaoyan Zhang,

Christopher M. Haggerty, Joshua V. Stough

2 of 45

Outline

  • Echocardiography
  • Video-based Segmentation
  • CLAS-FV
  • R2+1D Convolution
  • EchoNet-Dynamic Dataset
  • Test-time EF Estimation
  • Results

2

3 of 45

Echocardiography

  • Non-invasive ultrasonic test
  • Common imaging modality
  • Allow the derivation of left-ventricular ejection fraction
  • Labor-intensive process
  • Deep learning-based segmentation methods have been extensively studied

3

Ouyang et al. Video-based AI for beat-to-beat assessment of cardiac function, https://doi.org/10.1038/s41586-020-2145-8

4 of 45

Sparse Annotation

  • Most deep learning methods rely on frame-based analysis
  • Only ED/ES frames have manual annotations
  • No manual label for other frames
  • CLAS provides a solution for generating pseudo labels on intermediate frames

ED Frame

ES Frame

4

LeClerc et al., Deep Learning for Segmentation Using an Open Large-Scale Dataset in 2D Echocardiography, https://doi.org/10.1109/TMI.2019.2900516

Wei et al., Temporal-consistent Segmentation of Echocardiography with Co-learning from Appearance and Shape, https://doi.org/10.1007/978-3-030-59713-9_60

5 of 45

CLAS Video Segmentation

  • Take in a sampled 10-frame ED-ES Half-heartbeat clip
  • A shared 3D-UNet is used for motion tracking and segmentation
  • Output frame-level segmentation and pixel-wise bi-directional motion estimation

5

Wei et al., Temporal-consistent Segmentation of Echocardiography with Co-learning from Appearance and Shape, https://doi.org/10.1007/978-3-030-59713-9_60

6 of 45

Pseudo Labels

  • Use motion estimation to spatially transformed the manual labels.
  • On a ED-ES echo clip:
    • Forward transform ED label
    • Backward transform ES label

6

Forward Deformed Pseudo Labels

Deformed with forward motions from ground truth ED label

7 of 45

Advantages & Limitations

  • Temporally-consistent Segmentation
  • Better EF estimation�
  • Confine analysis to systole videos
  • Require humans or other frameworks to identify ED-ES half-heartbeats
  • Undermined applicability

7

CLAS segmented ED-ES video clip

Chen et al. Assessing the Generalizability of Temporally-Coherent Echocardiography Video Segmentation, https://doi.org/10.1117/12.2580874

8 of 45

Outline

  • Echocardiography
  • Video-based Segmentation
  • CLAS-FV
  • R2+1D Convolution
  • EchoNet-Dynamic Dataset
  • Test-time EF Estimation
  • Results

8

9 of 45

CLAS-FV Full Video Segmentation

  • Extend the CLAS framework for full video segmentation on multi-heartbeat echocardiogram
  • Joint video segmentation and motion tracking network

9

10 of 45

CLAS-FV Full Video Segmentation

  • Trained on clips with one or more cardiac cycles
  • Leverage the efficient R2+1D spatiotemporal convolutions for feature extraction

10

11 of 45

R2+1D Convolution

  • Decompose a 3D convolution into a spatial 2D convolution followed by a 1D temporal convolution
  • Used as feature encoder for EF regression in Ouyang et al.’s work

11

t = kernel size on temporal dimension

d = kernel size on spatial dimension

3D conv

R2+1D conv

Tran et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition, https://doi.org/10.1109/CVPR.2018.00675

12 of 45

Compare with 3D-UNet

  • 10-Fold CV on CAMUS
  • 450 patients
  • Compare R2+1D ResNet (~32M params) with:
    • 3D-UNet (~19M)
    • Deeper 3D-UNet (~34M)
  • Under same experimental setup

12

Cicek et al. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation, https://doi.org/10.1007/978-3-319-46723-8_49

13 of 45

10-Fold CV Results

  • CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation

13

14 of 45

10-Fold CV Results

  • CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation
  • Mean Absolute Error
    • EDV: 8.2 ml vs. 8.7 ml

14

15 of 45

10-Fold CV Results

  • CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation
  • Mean Absolute Error
    • EDV: 8.2 ml vs. 8.7 ml
    • ESV: 5.6 ml vs. 6.2 ml

15

16 of 45

10-Fold CV Results

  • CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation
  • Mean Absolute Error
    • EDV: 8.2 ml vs. 8.7 ml
    • ESV: 5.6 ml vs. 6.2 ml
    • EF: 4.1% vs. 4.5%

16

17 of 45

Full Video Segmentation

  • In full video segmentation, we use the R2+1D ResNet as feature extractor and train the network on EchoNet-Dynamic dataset

17

18 of 45

EchoNet-Dynamic Dataset

  • 10,030 Echocardiogram
  • AP4 View
  • Multi-heartbeat video
  • Annotated LVendo in ED/ES

18

Ouyang et al. Video-based AI for beat-to-beat assessment of cardiac function, https://doi.org/10.1038/s41586-020-2145-8

19 of 45

Training Setup

  • We train network on a 32-frame video clip that starts randomly in an echocardiogram but covers a clinically identified ED-ES which contains manual annotations.

19

20 of 45

Training Setup

  • We train network on a 32-frame video clip that starts randomly in an echocardiogram but covers a clinically identified ED-ES which contains manual annotations.

20

21 of 45

Random 32-Frame Clip

  • For the same echocardiogram, we may find multiple 32-frame clips that covers ED-ES but with different starting points.

21

22 of 45

Data Augmentation

  • As a data augmentation, one of possible clips will be used in training during each epoch.

22

23 of 45

Extending from CLAS

  • The network outputs frame-level segmentation and bi-directional motion estimation.
  • Adapting CLAS’s method, we transform true ED/ES label fully backward and forward to generate the pseudo labels for all other frames.

23

Forward Deformed Pseudo Labels

Start from the true ED label to the end of clip

24 of 45

Outline

  • Echocardiography
  • Video-based Segmentation
  • CLAS-FV
  • R2+1D Convolution
  • EchoNet-Dynamic Dataset
  • Test-time EF Estimation
  • Results

24

25 of 45

Test-time Full Video Segmentation

  • Divide multi-heartbeat video into 32-frame non-overlapping consecutive clips

25

26 of 45

Test-time Full Video Segmentation

  • Analysis is discontiguous at boundaries between clips

26

27 of 45

Test-time Full Video Segmentation

  • Analysis is discontiguous at boundaries between clips
  • Shifted video 1 frame forward for 4 times

27

28 of 45

Label Fusion

  • Redo the segmentation on each shifted video
  • Apply SIMPLE label fusion to merge multi-segmentation

28

Langerak et al. Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE) https://doi.org/10.1109/TMI.2010.2057442

29 of 45

Label Fusion

  • Redo the segmentation on each shifted video
  • Apply SIMPLE label fusion to merge multi-segmentation

29

Langerak et al. Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE) https://doi.org/10.1109/TMI.2010.2057442

30 of 45

Improved Smoothness

30

31 of 45

Multi-heartbeat Echocardiogram

  • When inferring EF, we identify all ED-ES half-heartbeats using the sizes of fused LVendo segmentation.
  • Use the average of derived EFs

31

32 of 45

Results

  • Trained and tested on EchoNet-Dynamic
  • Use the original train test split provided by Ouyang et al.
    • 7332 Train
    • 1258 Valid
    • 1276 Test

32

Ouyang et al. Video-based AI for beat-to-beat assessment of cardiac function, https://doi.org/10.1038/s41586-020-2145-8

33 of 45

Results

  • 12 test patients have clinic ES frame before clinic ED frame (no clinic ED-ES)

33

34 of 45

Results

  • 12 test patients have clinic ES frame �before clinic ED frame (no clinic ED-ES)
  • CLAS failed to report Dice score on these 12 test patients

34

35 of 45

Results

  • 12 test patients have clinic ES frame before clinic ED frame (no clinic ED-ES)
  • Over the rest 1264 test patients
  • Dice Similarity (LV ED/ES):
    • CLAS-FV: 0.935/0.907
    • CLAS: 0.930/0.903

35

36 of 45

Results

  • 12 test patients have clinic ES frame before clinic ED frame (no clinic ED-ES)
  • Over the rest 1264 test patients
  • Dice Similarity (LV ED/ES):
    • CLAS-FV: 0.935/0.907
    • CLAS: 0.930/0.903
  • Paired Wilcoxon signed-rank test:
    • ED: p = 1.52e-26
    • ES: p = 8.42e-05

36

37 of 45

Ejection Fraction

  • The EF is derived using 4CH segmentation with Simpson’s Monoplane Method
  • Fail to identify ED-ES half-heartbeat in 7 out of 1276 test patients when inferring EF

37

38 of 45

Ejection Fraction

  • The EF is derived using 4CH segmentation with Simpson’s Monoplane Method
  • Fail to identify ED-ES half-heartbeat in 7 out of 1276 test patients when inferring EF
  • Over the rest 1269 patients
  • Mean Absolute Error (± std.):
    • CLAS-FV: 5.25% ± (4.45)
    • CLAS: 5.82% ± (5.08)

38

39 of 45

Ejection Fraction

  • On rest 1269 test patients:
  • Bias ± 1.96 std.
    • -2.1% ± 12.9

39

  • Cross-correlation:
    • 0.85

40 of 45

Conclusion

  • Fully automated framework
  • Achieve comparable segmentation accuracy and EF estimation on a large clinical dataset
  • Provide dense frame-level segmentation for full cardiac cycle
  • Motion outputs could be useful in strain analysis of heart muscles

40

41 of 45

Thank You

  • Xiaoyan Zhang, PhD
  • Christopher M. Haggerty, PhD
  • Joshua V. Stough, PhD

41

  • Ciffolillo Healthcare Technology Inventors Program

CAMUS

LeClerc et al., IEEE TMI 2019,

https://doi.org/10.1109/TMI.2019.2900516

EchoNet-Dynamic

Ouyang et al., Nature 2020,

https://doi.org/10.1038/s41586-020-2145-8

42 of 45

Loss Structure

  • (SGA) Segmentation on appearance level

42

True ED

True ES

Auto ED

Auto ES

43 of 45

Loss Structure

  • (SGA) Segmentation on appearance level�
  • (OTA) Object tracking on appearance level�

43

Original Echo Frame

Forward Deformed

Backward Deformed

44 of 45

Loss Structure

  • (SGS) Segmentation on shape level�

44

Auto Segmentation

Forward Deformation

Backward Deformation

45 of 45

Loss Structure

  • (SGS) Segmentation on shape level�
  • (OTS) Object tracking on shape level

45

True ED

True ES

Deformed ED

Deformed ES