1 of 45

Fully Automated

Multi-heartbeat Echocardiography

Video Segmentation and Motion Tracking

Yida Chen, Xiaoyan Zhang,

Christopher M. Haggerty, Joshua V. Stough

2 of 45

Outline

Echocardiography
Video-based Segmentation
CLAS-FV
R2+1D Convolution
EchoNet-Dynamic Dataset
Test-time EF Estimation
Results

2

3 of 45

Echocardiography

Non-invasive ultrasonic test
Common imaging modality
Allow the derivation of left-ventricular ejection fraction
Labor-intensive process
Deep learning-based segmentation methods have been extensively studied

3

Ouyang et al. Video-based AI for beat-to-beat assessment of cardiac function, https://doi.org/10.1038/s41586-020-2145-8

4 of 45

Sparse Annotation

Most deep learning methods rely on frame-based analysis
Only ED/ES frames have manual annotations
No manual label for other frames
CLAS provides a solution for generating pseudo labels on intermediate frames

ED Frame

ES Frame

4

LeClerc et al., Deep Learning for Segmentation Using an Open Large-Scale Dataset in 2D Echocardiography, https://doi.org/10.1109/TMI.2019.2900516

Wei et al., Temporal-consistent Segmentation of Echocardiography with Co-learning from Appearance and Shape, https://doi.org/10.1007/978-3-030-59713-9_60

5 of 45

CLAS Video Segmentation

Take in a sampled 10-frame ED-ES Half-heartbeat clip
A shared 3D-UNet is used for motion tracking and segmentation
Output frame-level segmentation and pixel-wise bi-directional motion estimation

5

Wei et al., Temporal-consistent Segmentation of Echocardiography with Co-learning from Appearance and Shape, https://doi.org/10.1007/978-3-030-59713-9_60

6 of 45

Pseudo Labels

Use motion estimation to spatially transformed the manual labels.
On a ED-ES echo clip:

Forward transform ED label
Backward transform ES label

6

Forward Deformed Pseudo Labels

Deformed with forward motions from ground truth ED label

7 of 45

Advantages & Limitations

Temporally-consistent Segmentation
Better EF estimation�
Confine analysis to systole videos
Require humans or other frameworks to identify ED-ES half-heartbeats
Undermined applicability

7

CLAS segmented ED-ES video clip

Chen et al. Assessing the Generalizability of Temporally-Coherent Echocardiography Video Segmentation, https://doi.org/10.1117/12.2580874

8 of 45

Outline

Echocardiography
Video-based Segmentation
CLAS-FV
R2+1D Convolution
EchoNet-Dynamic Dataset
Test-time EF Estimation
Results

8

9 of 45

CLAS-FV Full Video Segmentation

Extend the CLAS framework for full video segmentation on multi-heartbeat echocardiogram
Joint video segmentation and motion tracking network

9

10 of 45

CLAS-FV Full Video Segmentation

Trained on clips with one or more cardiac cycles
Leverage the efficient R2+1D spatiotemporal convolutions for feature extraction

10

11 of 45

R2+1D Convolution

Decompose a 3D convolution into a spatial 2D convolution followed by a 1D temporal convolution
Used as feature encoder for EF regression in Ouyang et al.’s work

11

t = kernel size on temporal dimension

d = kernel size on spatial dimension

3D conv

R2+1D conv

Tran et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition, https://doi.org/10.1109/CVPR.2018.00675

12 of 45

Compare with 3D-UNet

10-Fold CV on CAMUS
450 patients
Compare R2+1D ResNet (~32M params) with:

3D-UNet (~19M)
Deeper 3D-UNet (~34M)

Under same experimental setup

12

Cicek et al. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation, https://doi.org/10.1007/978-3-319-46723-8_49

13 of 45

10-Fold CV Results

CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation

13

14 of 45

10-Fold CV Results

CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation
Mean Absolute Error

EDV: 8.2 ml vs. 8.7 ml

14

15 of 45

10-Fold CV Results

CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation
Mean Absolute Error

EDV: 8.2 ml vs. 8.7 ml
ESV: 5.6 ml vs. 6.2 ml

15

16 of 45

10-Fold CV Results

CLAS with R2+1D ResNet achieves highest Dice scores and accurate EDV, ESV, EF estimation
Mean Absolute Error

EDV: 8.2 ml vs. 8.7 ml
ESV: 5.6 ml vs. 6.2 ml
EF: 4.1% vs. 4.5%

16

17 of 45

Full Video Segmentation

In full video segmentation, we use the R2+1D ResNet as feature extractor and train the network on EchoNet-Dynamic dataset

17

18 of 45

EchoNet-Dynamic Dataset

10,030 Echocardiogram
AP4 View
Multi-heartbeat video
Annotated LV_endoin ED/ES

18

Ouyang et al. Video-based AI for beat-to-beat assessment of cardiac function, https://doi.org/10.1038/s41586-020-2145-8

19 of 45

Training Setup

We train network on a 32-frame video clip that starts randomly in an echocardiogram but covers a clinically identified ED-ES which contains manual annotations.

19

20 of 45

Training Setup

We train network on a 32-frame video clip that starts randomly in an echocardiogram but covers a clinically identified ED-ES which contains manual annotations.

20

21 of 45

Random 32-Frame Clip

For the same echocardiogram, we may find multiple 32-frame clips that covers ED-ES but with different starting points.

21

22 of 45

Data Augmentation

As a data augmentation, one of possible clips will be used in training during each epoch.

22

23 of 45

Extending from CLAS

The network outputs frame-level segmentation and bi-directional motion estimation.
Adapting CLAS’s method, we transform true ED/ES label fully backward and forward to generate the pseudo labels for all other frames.

23

Forward Deformed Pseudo Labels

Start from the true ED label to the end of clip

24 of 45

Outline

Echocardiography
Video-based Segmentation
CLAS-FV
R2+1D Convolution
EchoNet-Dynamic Dataset
Test-time EF Estimation
Results

24

25 of 45

Test-time Full Video Segmentation

Divide multi-heartbeat video into 32-frame non-overlapping consecutive clips

25

26 of 45

Test-time Full Video Segmentation

Analysis is discontiguous at boundaries between clips

26

27 of 45

Test-time Full Video Segmentation

Analysis is discontiguous at boundaries between clips
Shifted video 1 frame forward for 4 times

27

28 of 45

Label Fusion

Redo the segmentation on each shifted video
Apply SIMPLE label fusion to merge multi-segmentation

28

Langerak et al. Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE) https://doi.org/10.1109/TMI.2010.2057442

29 of 45

Label Fusion

Redo the segmentation on each shifted video
Apply SIMPLE label fusion to merge multi-segmentation

29

Langerak et al. Label Fusion in Atlas-Based Segmentation Using a Selective and Iterative Method for Performance Level Estimation (SIMPLE) https://doi.org/10.1109/TMI.2010.2057442

30 of 45

Improved Smoothness

30

31 of 45

Multi-heartbeat Echocardiogram

When inferring EF, we identify all ED-ES half-heartbeats using the sizes of fused LV_endosegmentation.
Use the average of derived EFs

31

32 of 45

Results

Trained and tested on EchoNet-Dynamic
Use the original train test split provided by Ouyang et al.

7332 Train
1258 Valid
1276 Test

32

Ouyang et al. Video-based AI for beat-to-beat assessment of cardiac function, https://doi.org/10.1038/s41586-020-2145-8

33 of 45

Results

12 test patients have clinic ES frame before clinic ED frame (no clinic ED-ES)

33

34 of 45

Results

12 test patients have clinic ES frame �before clinic ED frame (no clinic ED-ES)
CLAS failed to report Dice score on these 12 test patients

34

35 of 45

Results

12 test patients have clinic ES frame before clinic ED frame (no clinic ED-ES)
Over the rest 1264 test patients
Dice Similarity (LV ED/ES):

CLAS-FV: 0.935/0.907
CLAS: 0.930/0.903

35

36 of 45

Results

12 test patients have clinic ES frame before clinic ED frame (no clinic ED-ES)
Over the rest 1264 test patients
Dice Similarity (LV ED/ES):

CLAS-FV: 0.935/0.907
CLAS: 0.930/0.903

Paired Wilcoxon signed-rank test:

ED: p = 1.52e-26
ES: p = 8.42e-05

36

37 of 45

Ejection Fraction

The EF is derived using 4CH segmentation with Simpson’s Monoplane Method
Fail to identify ED-ES half-heartbeat in 7 out of 1276 test patients when inferring EF

37

38 of 45

Ejection Fraction

The EF is derived using 4CH segmentation with Simpson’s Monoplane Method
Fail to identify ED-ES half-heartbeat in 7 out of 1276 test patients when inferring EF
Over the rest 1269 patients
Mean Absolute Error (± std.):

CLAS-FV: 5.25% ± (4.45)
CLAS: 5.82% ± (5.08)

38

39 of 45

Ejection Fraction

On rest 1269 test patients:
Bias ± 1.96 std.

-2.1% ± 12.9

39

Cross-correlation:

0.85

40 of 45

Conclusion

Fully automated framework
Achieve comparable segmentation accuracy and EF estimation on a large clinical dataset
Provide dense frame-level segmentation for full cardiac cycle
Motion outputs could be useful in strain analysis of heart muscles

40

41 of 45

Thank You

Xiaoyan Zhang, PhD
Christopher M. Haggerty, PhD
Joshua V. Stough, PhD

41

Ciffolillo Healthcare Technology Inventors Program

CAMUS

LeClerc et al., IEEE TMI 2019,

https://doi.org/10.1109/TMI.2019.2900516

EchoNet-Dynamic

Ouyang et al., Nature 2020,

https://doi.org/10.1038/s41586-020-2145-8

42 of 45

Loss Structure

(SGA) Segmentation on appearance level

42

True ED

True ES

Auto ED

Auto ES

43 of 45

Loss Structure

(SGA) Segmentation on appearance level�
(OTA) Object tracking on appearance level�

43

Original Echo Frame

Forward Deformed

Backward Deformed

44 of 45

Loss Structure

(SGS) Segmentation on shape level�

44

Auto Segmentation

Forward Deformation

Backward Deformation

45 of 45

Loss Structure

(SGS) Segmentation on shape level�
(OTS) Object tracking on shape level

45

True ED

True ES

Deformed ED

Deformed ES