| Literature DB >> 35505149 |
Ricardo Sanchez-Matilla1, Maria Robu2, Maria Grammatikopoulou2, Imanol Luengo2, Danail Stoyanov2,3.
Abstract
PURPOSE: Surgical workflow estimation techniques aim to divide a surgical video into temporal segments based on predefined surgical actions or objectives, which can be of different granularity such as steps or phases. Potential applications range from real-time intra-operative feedback to automatic post-operative reports and analysis. A common approach in the literature for performing automatic surgical phase estimation is to decouple the problem into two stages: feature extraction from a single frame and temporal feature fusion. This approach is performed in two stages due to computational restrictions when processing large spatio-temporal sequences.Entities:
Keywords: Multi-task; Scene segmentation; Surgical data science; Surgical phases
Mesh:
Year: 2022 PMID: 35505149 PMCID: PMC9110447 DOI: 10.1007/s11548-022-02616-0
Source DB: PubMed Journal: Int J Comput Assist Radiol Surg ISSN: 1861-6410 Impact factor: 3.421
Comparison of existing literature for surgical phase estimation regarding the proposed encoder and temporal model architecture, and the type of annotations used during the training of the encoder
| Model | Encoder | Temporal model | ||||
|---|---|---|---|---|---|---|
| Backbone | Phase | Instrument presence | Scene segmentation | |||
| [ | EndoNet | AlexNet | HMM | |||
| [ | MTRCNet-CL | Residual CNN | LSTM | |||
| [ | TeCNO | ResNet50 | MS-TCN | |||
| [ | OperA | ResNet50 | Transformers | |||
| Proposed | ResNet50 | MS-TCN | ||||
Scene refers to segmentation of both instrument and anatomy. KEY: HMM, hidden Markov models; LSTM: long short-term memory; MS-TCN: multi-stage temporal convolutional network
Fig. 1Proposed multi-task encoder. KEY: GAP, global average pooling; FC, fully connected layer; BN, batch-norm layer; x upscaling feature map i times. The numbers on the arrows indicate the dimensionality of the feature maps for a sample input image
Fig. 2Surgical phase visualisation. First bar indicates the annotation, and the second one the prediction of the proposed model. KEY: preparation, calot triangle dissection, clipping cutting ,gallbladder dissection, gallbladder packaging, cleaning coagulation, and gallbladder retraction
Comparison of the results of the proposed model against the state-of-the-art models for surgical phase estimation in Cholec80 dataset
| Split | Model | Phase metric | |
|---|---|---|---|
| Accuracy | F1-Score | ||
| 40:40 | [ | 0.8190 ± 0.0440 | – |
| [ | |||
| [ | 0.8856 ± 0.0027 | – | |
| 48:20 | ResNet50* | 0.8121 ± 0.0116 | 0.7298 ± 0.0117 |
| ResNet+LSTM* | 0.8794 ± 0.0080 | 0.8229 ± 0.0078 | |
| [ | 0.8564 ± 0.0021 | 0.8094 ± 0.0095 | |
| [ | 0.8905 ± 0.0079 | 0.8404 ± 0.0064 | |
| [ | 0.8449 ± 0.0064 | ||
| Proposed | 0.8951 ± 0.0270 | ||
Bold indicates the highest score in each metric and each split
*Results reported in [5]
Phase estimation performance of the proposed model when using different backbones and annotations during the training of the encoder
| Backbone | Annotations | Phase metric | |||
|---|---|---|---|---|---|
| Phase | Scene segment | Instrument presence | Accuracy | F1-Score | |
| ResNet50 | 0.8991 ± 0.0146 | 0.8382 ± 0.0252 | |||
| 0.9143 ± 0.0174 | 0.8704 ± 0.0138 | ||||
| ResNet18 | 0.9089 ± 0.0036 | 0.8639 ± 0.0073 | |||
| ResNet152 | 0.9119 ± 0.0027 | 0.8739 ± 0.0079 | |||
Bold indicates the highest score
Fig. 3Multi-task fusion modules under comparison: a fusion via concatenation and convolution; b fusion via convolution prior to concatenation; c proposed multi-task fusion with linear combination and learnable weights. Numbers in the figure indicate the dimensionality of the feature map for reference. KEY: , multiplication operator; , learnable scalar value; , addition operator; Cat, concatenation operator; conv 1 by 1 convolution; BN, batch normalisation layer
Multi-task fusion comparison when using phase, instrument presence, and scene segmentation as described in the text and in Fig. 3
| Fusion | Skip connection | Phase metric | |
|---|---|---|---|
| Accuracy | F1-Score | ||
| (a) Concat., and convolution | 0.9141 | 0.8509 | |
| 0.9126 | 0.8407 | ||
| (b) Convolution, concat., and convolution | 0.9159 | 0.8583 | |
| 0.9130 | 0.8450 | ||
| (c) Proposed (linear combination) | |||
| 0.9216 | 0.8609 | ||
Bold indicates the highest score
Results of the proposed model for the task of scene segmentation in Cholec80 dataset
| mPA | mIOU | mDICE |
|---|---|---|
KEY: mPA, mean Pixel Accuracy; mIOU, mean Intersection Over Union; mDICE, mean DICE