1 of 18

Query by Video for Surgical Activities

Members: Gianluca Silva Croso, Felix Yu

�Mentors: Tae Soo Kim, Dr. Swaroop Vedula, Dr. Gregory Hager�

2 of 18

Project Goal

Design machine learning pipeline to

    • Query by video for similar activity from database
    • Query by video for similar skill level within specific activity

3 of 18

Relevance and Introduction

Problem:

  • Feedback for surgeons-in-training is often lacking
    • Length of surgeries makes analysis tedious
    • Few experts available capable of giving good feedback
  • Hard to pinpoint specific activities within a surgery for teaching or for finding potential complications

Prior Work:

  • Query-by-Example activity detection available with Kinematic data
    • Only for robotic surgeries

4 of 18

Relevance and Introduction

Solution:

  • Analyze videos, which are more accessible, to classify activities as well as evaluate skill of surgeon.
    • Would make possible to isolate specific activities at specific skill levels for comparison

Impact:

  • Training of novice surgeons could be significantly improved.
  • Experienced surgeons can find where their technique may differ from others. Specific surgery phases can be analyzed.

5 of 18

General Background

Real World Example:

  • Capsulorhexis technique during cataract surgery is difficult to perform.

Overarching Project:

  • Multiple other portions of the project:
    • Segmentation of whole surgery video into activity clips.
    • Finding activity clips in database that are similar to the query clip.
    • Encoding surgeon commentary of database videos into features.
    • Constructing new feedback for query video using the features of similar database videos.

6 of 18

Data

Cataract Surgery Data:

  • Whole surgery videos of cataract surgeries.
  • Hand annotations on which frames correspond to which phases in the surgery.
  • Skill level of each video measured by experience of surgeon provided as well.

Data Preprocessing:

  • Segmented videos based on annotations into activity clips.
  • Divided clips into database clips (training) and query clips (validation).

7 of 18

Technical Summary - Overall approach

8 of 18

Technical Summary - Development steps

  • Implement and train frame-by-frame extractor
    • 3D Convolutional neural network using Pytorch
    • Brief segments of the video
    • Triplet loss

Image taken from [6]

9 of 18

Technical Summary - Development steps

  • Implement and train video descriptor extractor
    • Uses feature vector of previous network to more accurately classify video
    • Temporal Convolutional Neural Network
    • Triplet loss

Image taken from [4]

10 of 18

Technical Summary - Development steps

  • Create and test similarity metric
    • Network generated or try multiple (such as euclidean distance, different vector norms, etc)

11 of 18

Deliverables

Minimum:

  • Create a working pipeline to generate video descriptors with the components described above.
  • Develop a similarity metric that can discriminate between videos of same and different activities.
  • Validate our model by analyzing similarity scores between clips in our dataset.

Expected:

  • Adapt the above model to instead discriminate between videos of same and different skill levels, and validate the results.

Maximum:

  • Rank a query clip’s skill level relative to the database’s videos of the same activity.

12 of 18

Assigned Responsibilities

Both members will contribute to all parts, but with varying amounts of contribution based on expertise.

  • Implementation of network architecture
  • Video data pre-processing
  • Implementation of loss function
  • Training data augmentation
  • Implementation of training procedure
  • Creating similarity metric and result analysis

Felix

Gianluca

13 of 18

Key dates and Milestones

14 of 18

Key dates and Milestones

  • 02/12 - environment setup complete
  • 02/16 - sufficient familiarity with background readings and libraries
  • 02/23 - data pre-processing and training dataset prepared
  • 03/16 - 3D convolutional neural network implemented and trained
    • If accuracy is insufficient, discuss potential changes with mentors
  • 03/30 - Temporal neural network implemented and trained
    • If accuracy is insufficient, discuss potential changes with mentors
  • 04/06 - Define similarity metric, analyze data for pipeline validation
  • 04/27 - Model is modified for skill level prediction
  • Optional (05/10) - Discuss methods and work on ranking a query clip within an existing database of same activities.

15 of 18

Dependencies and solutions

Dependency

Solution

MARCC cluster access (GPU processing)

Access will be obtained under Dr. Hager’s group

PyTorch and other python libraries

All Open Source and available on UNIX

Training dataset (surgical videos with activity and skill annotations)

Provided by the Cataract Project group

16 of 18

Management Plan

  • Data storage & processing on MARCC cluster
  • Codebase on private BitBucket git repository
  • Weekly meetings with Dr. Vedula and/or Tae Soo Kim and attendance to weekly meetings with Cataract Project group
  • Additional meetings with Dr. Hager and Dr. Taylor scheduled as needed
  • We’ll also meet every weekend to discuss progress and work on the project

17 of 18

Reading List / Bibliography

  • [1] Chopra, S., R. Hadsell, and Y. Lecun. "Learning a Similarity Metric Discriminatively, with Application to Face Verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR05), 2005. doi:10.1109/cvpr.2005.202.
  • [2] Gao, Yixin, S. Swaroop Vedula, Gyusung I. Lee, Mija R. Lee, Sanjeev Khudanpur, and Gregory D. Hager. "Query-by-example surgical activity detection." International Journal of Computer Assisted Radiology and Surgery 11, no. 6 (April 12, 2016): 987-96. doi:10.1007/s11548-016-1386-3.
  • [3] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi:10.1109/cvpr.2016.90
  • [4] Lea, Colin, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. "Temporal Convolutional Networks for Action Segmentation and Detection." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. doi:10.1109/cvpr.2017.113.�

18 of 18

Reading List / Bibliography

  • [5] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "FaceNet: A unified embedding for face recognition and clustering." 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. doi:10.1109/cvpr.2015.7298682.
  • [6] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning Spatiotemporal Features with 3D Convolutional Networks." 2015 IEEE International Conference on Computer Vision (ICCV), 2015. doi:10.1109/iccv.2015.510.