1 of 18

Query by Video for Surgical Activities

Members: Gianluca Silva Croso, Felix Yu

�Mentors: Tae Soo Kim, Dr. Swaroop Vedula, Dr. Gregory Hager�

2 of 18

Project Goal

Design machine learning pipeline to

Query by video for similar activity from database
Query by video for similar skill level within specific activity

3 of 18

Relevance and Introduction

Problem:

Feedback for surgeons-in-training is often lacking

Length of surgeries makes analysis tedious
Few experts available capable of giving good feedback

Hard to pinpoint specific activities within a surgery for teaching or for finding potential complications

Prior Work:

Query-by-Example activity detection available with Kinematic data

Only for robotic surgeries

4 of 18

Relevance and Introduction

Solution:

Analyze videos, which are more accessible, to classify activities as well as evaluate skill of surgeon.

Would make possible to isolate specific activities at specific skill levels for comparison

Impact:

Training of novice surgeons could be significantly improved.
Experienced surgeons can find where their technique may differ from others. Specific surgery phases can be analyzed.

5 of 18

General Background

Real World Example:

Capsulorhexis technique during cataract surgery is difficult to perform.

Overarching Project:

Multiple other portions of the project:

Segmentation of whole surgery video into activity clips.
Finding activity clips in database that are similar to the query clip.
Encoding surgeon commentary of database videos into features.
Constructing new feedback for query video using the features of similar database videos.

6 of 18

Data

Cataract Surgery Data:

Whole surgery videos of cataract surgeries.
Hand annotations on which frames correspond to which phases in the surgery.
Skill level of each video measured by experience of surgeon provided as well.

Data Preprocessing:

Segmented videos based on annotations into activity clips.
Divided clips into database clips (training) and query clips (validation).

7 of 18

Technical Summary - Overall approach

Felix

The following is the proposed pipeline of our model. Given a video query clip, which is 3D, we will feed it through a single frame (or short term) feature extractor. This will give us a feature representation of the video. We will then feed this through a video descriptor extractor, which captures long term temporal data, and encodes the whole video into a vector. This vector is compared with our database of video descriptors through a similarity metric, giving us an idea of which videos are similar to the query, and which ones are not. From this pipeline, it can be seen that the three things we must create are the single frame feature extractor, video descriptor extractor, and similarity metric

Pytorch based Neural Network implementation
Frame-by-Frame extractor - extract features from individual moments of surgical video

HxWxT tensor → F1xT tensor

Video descriptor extractor - extract features from time series and motion information

F1xT tensor → F2x1 tensor

Similarity metric - use descriptors for two different videos to assign interpretable “distance” metric - applicable both to skill and phase classification

8 of 18

Technical Summary - Development steps

Implement and train frame-by-frame extractor

3D Convolutional neural network using Pytorch
Brief segments of the video
Triplet loss

Image taken from [6]

9 of 18

Technical Summary - Development steps

Implement and train video descriptor extractor

Uses feature vector of previous network to more accurately classify video
Temporal Convolutional Neural Network
Triplet loss

Image taken from [4]

10 of 18

Technical Summary - Development steps

Create and test similarity metric

Network generated or try multiple (such as euclidean distance, different vector norms, etc)

11 of 18

Deliverables

Minimum:	Create a working pipeline to generate video descriptors with the components described above. Develop a similarity metric that can discriminate between videos of same and different activities. Validate our model by analyzing similarity scores between clips in our dataset.
Expected:	Adapt the above model to instead discriminate between videos of same and different skill levels, and validate the results.
Maximum:	Rank a query clip’s skill level relative to the database’s videos of the same activity.

12 of 18

Assigned Responsibilities

Both members will contribute to all parts, but with varying amounts of contribution based on expertise.

Implementation of network architecture
Video data pre-processing

Implementation of loss function
Training data augmentation

Implementation of training procedure
Creating similarity metric and result analysis

Felix

Gianluca

13 of 18

Key dates and Milestones

14 of 18

Key dates and Milestones

02/12 - environment setup complete
02/16 - sufficient familiarity with background readings and libraries
02/23 - data pre-processing and training dataset prepared
03/16 - 3D convolutional neural network implemented and trained

If accuracy is insufficient, discuss potential changes with mentors

03/30 - Temporal neural network implemented and trained

If accuracy is insufficient, discuss potential changes with mentors

04/06 - Define similarity metric, analyze data for pipeline validation
04/27 - Model is modified for skill level prediction
Optional (05/10) - Discuss methods and work on ranking a query clip within an existing database of same activities.

15 of 18

Dependencies and solutions

Dependency	Solution
MARCC cluster access (GPU processing)	Access will be obtained under Dr. Hager’s group
PyTorch and other python libraries	All Open Source and available on UNIX
Training dataset (surgical videos with activity and skill annotations)	Provided by the Cataract Project group

16 of 18

Management Plan

Data storage & processing on MARCC cluster
Codebase on private BitBucket git repository
Weekly meetings with Dr. Vedula and/or Tae Soo Kim and attendance to weekly meetings with Cataract Project group
Additional meetings with Dr. Hager and Dr. Taylor scheduled as needed
We’ll also meet every weekend to discuss progress and work on the project

17 of 18

Reading List / Bibliography

[1] Chopra, S., R. Hadsell, and Y. Lecun. "Learning a Similarity Metric Discriminatively, with Application to Face Verification." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR05), 2005. doi:10.1109/cvpr.2005.202.
[2] Gao, Yixin, S. Swaroop Vedula, Gyusung I. Lee, Mija R. Lee, Sanjeev Khudanpur, and Gregory D. Hager. "Query-by-example surgical activity detection." International Journal of Computer Assisted Radiology and Surgery 11, no. 6 (April 12, 2016): 987-96. doi:10.1007/s11548-016-1386-3.
[3] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. doi:10.1109/cvpr.2016.90
[4] Lea, Colin, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. "Temporal Convolutional Networks for Action Segmentation and Detection." 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. doi:10.1109/cvpr.2017.113.�

18 of 18

Reading List / Bibliography

[5] Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "FaceNet: A unified embedding for face recognition and clustering." 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. doi:10.1109/cvpr.2015.7298682.
[6] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning Spatiotemporal Features with 3D Convolutional Networks." 2015 IEEE International Conference on Computer Vision (ICCV), 2015. doi:10.1109/iccv.2015.510.