| Literature DB >> 34013464 |
Sanat Ramesh1,2, Diego Dall'Alba3, Cristians Gonzalez4, Tong Yu5, Pietro Mascagni5,6, Didier Mutter4,7, Jacques Marescaux7, Paolo Fiorini3, Nicolas Padoy5.
Abstract
PURPOSE: Automatic segmentation and classification of surgical activity is crucial for providing advanced support in computer-assisted interventions and autonomous functionalities in robot-assisted surgeries. Prior works have focused on recognizing either coarse activities, such as phases, or fine-grained activities, such as gestures. This work aims at jointly recognizing two complementary levels of granularity directly from videos, namely phases and steps.Entities:
Keywords: Surgical workflow analysis; deep learning; endoscopic videos; laparoscopic gastric bypass; multi-task learning; temporal modeling
Mesh:
Year: 2021 PMID: 34013464 PMCID: PMC8260406 DOI: 10.1007/s11548-021-02388-z
Source DB: PubMed Journal: Int J Comput Assist Radiol Surg ISSN: 1861-6410 Impact factor: 2.924
Fig. 2List of all the phases and steps defined in the dataset with their hierarchical relationship. The surgically critical activities are highlighted in red
Fig. 1Sample images from the dataset with phase labels on top-left and step labels on top-right corner. The labels can be inferred from Fig. 2
Fig. 3Average duration of phases and steps across videos in the dataset
Fig. 4Overview of our model setup. Multi-task architecture of the ResNet-50 feature extractor backbone on the left and the multi-task setup of the TCN temporal model on the right
Fig. 5Overview of all the models used for evaluation. All the models trained in a single-task setup are shown on the left, while all the models trained in multi-task setup are shown on the right
Baseline comparison on the dataset for phase recognition. Accuracy (ACC), precision (PR), recall (RE), and F1-score (F1) (%) are reported across all the 4-fold cross-validation
| Phase | |||||
|---|---|---|---|---|---|
| ACC | PR | RE | F1 | ||
| No TCN | ResNet | 82.1 ± 3.3 | 73.9 ± 3.3 | 72.2 ± 3.4 | 72.5 ± 3.6 |
| MT-ResNet | 81.7 ± 2.7 | 73.1 ± 2.8 | 72.1 ± 2.3 | 72.1 ± 2.6 | |
| ResNetLSTM | |||||
| MT-ResNetLSTM | 88.6 ± 2.7 | 81.4 ± 3.9 | 81.1 ± 3.5 | 80.7 ± 3.8 | |
| Stage I | TeCNO | 89.8 ±3.5 | 85.4 ± 4.0 | 82.3 ± 4.5 | 83.0 ± 4.1 |
| MTMS-TCN | |||||
| Stage II | TeCNO | 89.9 ± 3.3 | 84.4 ± 4.3 | 83.3 ± 3.9 | 83.5 ± 4.0 |
| MTMS-TCN | |||||
Bold numbers denote best performance for each metric
Baseline comparison on the dataset for step recognition. Accuracy (ACC), precision (PR), recall (RE), and F1-score (F1) (%) are reported across all the 4-fold cross-validation
| Step | |||||
|---|---|---|---|---|---|
| ACC | PR | RE | F1 | ||
| No TCN | ResNet | 65.5 ± 2.0 | 45.3 ± 3.0 | 43.2 ± 2.7 | 42.6 ± 2.3 |
| MT-ResNet | 66.6 ± 2.4 | 46.0 ± 3.1 | 44.7 ± 3.1 | 43.8 ± 2.9 | |
| ResNetLSTM | 71.3 ± 2.3 | 47.8 ± 4.1 | 47.7 ± 2.8 | 45.8 ± 2.7 | |
| MT-ResNetLSTM | |||||
| Stage I | TeCNO | 75.1 ± 2.4 | 54.7 ± 2.6 | 50.9 ± 2.4 | 49.9 ± 1.8 |
| MTMS-TCN | |||||
| Stage II | TeCNO | 74.8 ± 2.5 | 53.2 ± 2.5 | 50.8 ± 3.3 | 49.9 ± 3.7 |
| MTMS-TCN | |||||
Bold numbers denote best performance for each metric
Baseline comparison on the dataset for joint phase and step recognition. Accuracy (ACC) is reported after 4-fold cross-validation
| Phase ACC | Step ACC | Phase-Step ACC | ||
|---|---|---|---|---|
| No TCN | ResNet | 82.1 ± 2.9 | 65.5 ± 1.8 | 54.9 ± 2.6 |
| MT-ResNet | 81.7 ± 2.3 | 66.6 ± 2.1 | 64.8 ± 2.0 | |
| ResNetLSTM | 71.3 ± 2.0 | 68.5 ± 2.3 | ||
| MT-ResNetLSTM | 88.6 ± 2.3 | |||
| Stage I | TeCNO | 89.8 ± 3.0 | 75.1 ± 2.1 | 72.3 ± 3.0 |
| MTMS-TCN | ||||
| Stage II | TeCNO | 89.9 ± 2.8 | 74.8 ± 2.2 | 71.9 ± 2.7 |
| MTMS-TCN |
Bold numbers denote best performance for each metric
TeCNO vs MTMS-TCN: 4-fold cross-validation average precision, recall, and F1-score (%) reported for the critical steps
| TeCNO | MTMS-TCN | |||||
|---|---|---|---|---|---|---|
| ID | PR | RE | F1 | PR | RE | F1 |
| S4 | 84.2±5.7 | 85.6±4.1 | 88.3±3.9 | |||
| S5 | 77.4±6.7 | 79.2±6.8 | ||||
| S6 | 64.7±22.3 | 76.4±15.8 | ||||
| S7 | 72.1±8.0 | 66.4±9.8 | ||||
| S8 | 75.6±7.0 | |||||
| S16 | 76.4±7.1 | 67.7±4.0 | ||||
| S18 | 89.8±4.9 | 80.5±3.1 | 83.4±3.6 | |||
| S25 | 39.4±18.6 | 40.6±16.1 | 47.6±6.6 | |||
| S30 | 62.3±4.8 | 62.0±13.5 | 57.5±10.3 | |||
| S32 | 85.4±4.4 | 85.1±5.4 | ||||
| S39 | 46.2±27.1 | 39.0±22.2 | 42.9±27.2 | |||
Bold numbers denote best performance per step per metric
Fig. 6Phase recognition on complete videos in Bypass40 for quality assessment. The top row shows 3 videos on which our model performs best, and the bottom row shows 3 videos with worst performance
Fig. 7Step recognition on complete videos in Bypass40 for quality assessment. The figure shows best (top) and worst (bottom) performance of our model. The 44 distinct steps are mapped to the same 20 categorical colormap