| Literature DB >> 35735584 |
Lalithkumar Seenivasan1, Mobarakol Islam2, Chi-Fai Ng3, Chwee Ming Lim4, Hongliang Ren1,5,6.
Abstract
Surgical scene understanding is a key barrier for situation-aware robotic surgeries and the associated surgical training. With the presence of domain shifts and the inclusion of new instruments and tissues, learning domain generalization (DG) plays a pivotal role in expanding instrument-tissue interaction detection to new domains in robotic surgery. Mimicking the ability of humans to incrementally learn new skills without forgetting their old skills in a similar domain, we employ incremental DG on scene graphs to predict instrument-tissue interaction during robot-assisted surgery. To achieve incremental DG, incorporate incremental learning (IL) to accommodate new instruments and knowledge-distillation-based student-teacher learning to tackle domain shifts in the new domain. Additionally, we designed an enhanced curriculum by smoothing (E-CBS) based on Laplacian of Gaussian (LoG) and Gaussian kernels, and integrated it with the feature extraction network (FEN) and graph network to improve the instrument-tissue interaction performance. Furthermore, the FEN's and graph network's logits are normalized by temperature normalization (T-Norm), and its effect in model calibration was studied. Quantitative and qualitative analysis proved that our incrementally-domain generalized interaction detection model was able to adapt to the target domain (transoral robotic surgery) while retaining its performance in the source domain (nephrectomy surgery). Additionally, the graph model enhanced by E-CBS and T-Norm outperformed other state-of-the-art models, and the incremental DG technique performed better than the naive domain adaption and DG technique.Entities:
Keywords: curriculum learning; domain generalization; scene graph; surgical scene understanding
Year: 2022 PMID: 35735584 PMCID: PMC9220121 DOI: 10.3390/biomimetics7020068
Source DB: PubMed Journal: Biomimetics (Basel) ISSN: 2313-7673
Figure 1Graph network for surgical scene understanding: Given a surgical scene with bounding boxes, visual features of tissues and instruments are extracted using the FEN. Visual features are embedded in the visual graph (); nodes and node names are embedded in the semantic graph () nodes. To improve graph network performance, E-CBS is appended to ’s node aggregation. Upon node aggregation in both graphs, they are combined to form , having edges embedded with spatial features. Upon aggregation, the readout function processes the edge features to predict interaction logits which are then temperature normalized. Incremental DG: Given a model trained on source domain, DG is achieved in 2 tiers. Firstly, the FEN is naively trained using IL [12] to include novel instruments and domain shifts. Secondly, the graph network is domain generalized based on the student–teacher training regime in knowledge distillation. A graph network initially trained in the source domain is considered the teacher model. A copy is then taken as a student model and further trained on the target domain and random samples from the source domain. Finally, the student model is fine-tuned on a balanced distribution from the source and target domains.
Comparison of our proposed model’s performance in the source and target domains against the state-of-the-art scene graph models when trained solely on source domain.
| Graph Network | Feature Extractor | Source Domain | Target Domain | ||||
|---|---|---|---|---|---|---|---|
| ResNet18 [ | Acc ↑ | mAP ↑ | Recall ↑ | Acc ↑ | mAP ↑ | Recall ↑ | |
| GAT [ | Vanilla | 33.21 | 0.0973 | - | 33.25 | 0.0773 | - |
| G-Hpooling [ | Vanilla | 33.21 | 0.1523 | - | 33.25 | 0.0790 | - |
| GPNN [ | Vanilla | 55.00 | 0.1934 | - | 29.52 | 0.1980 | - |
| Islam et al. [ | Label-smoothing [ | 48.02 | 0.2157 | - | 29.52 |
| - |
| VS-GAT [ | Vanilla | 62.96 | 0.2682 | 0.2888 | 35.49 | 0.0999 |
|
| Ours (VS-GAT [ | IL [ |
|
|
|
| 0.1009 | 0.1268 |
Comparison of the model performance trained using our proposed knowledge distillation (KD)-based incremental DG technique against the performances of models trained using naive domain adaptation and DG techniques on the source domain (SD) and target domain (TD).
| Technique | Phase 1 | Phase 2 | Fine-Tunning | Source Domain | Target Domain | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| SD | TD | KD ( | Acc ↑ | mAP ↑ | Recall ↑ | Acc ↑ | mAP ↑ | Recall ↑ | ||
| Domain adaptation | ✓ | ✓ | ✕ | ✕ | 42.89 | 0.3122 | 0.1994 | 32.76 | 0.1211 | 0.1489 |
| DG | ✓ | ✓ | ✕ | ✓ | 44.10 | 0.3273 |
| 32.42 | 0.1185 |
|
| Ours (incremental DG) | ✓ | ✓ | ✓ | ✓ |
|
| 0.2138 |
|
| 0.1515 |
Figure 2Qualitative analysis: (a) Source domain surgical scene with annotated bounding box (Bbox). (b) Ground truth (GT) interaction vs. our model’s prediction (in red and green text). (c) Target domain surgical scene with annotated Bbox and (d) GT interaction vs. our model’s prediction.
Ablation study of our proposed model trained using unsupervised DG and incremental DG in instrument–tissue interaction detection in the source and target domains.
| Model | Feature Extractor (ResNet18 [ | E-CBS | T-Norm | Source Domain | Target Domain | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Label-Smoothing [ | IL [ | Acc ↑ | mAP ↑ | Recall ↑ | Acc ↑ | mAP ↑ | Recall ↑ | |||
|
| ||||||||||
| Base | ✕ | ✕ | ✕ | ✕ | 62.96 | 0.2682 | 0.2888 | 35.49 | 0.0999 | 0.1327 |
| Base | ✓ | ✕ | ✕ | ✕ | 63.82 | 0.2649 | 0.2922 | 35.15 | 0.0988 | 0.1171 |
| Base | ✕ | ✓ | ✕ | ✕ | 63.57 | 0.2673 | 0.2650 | 36.86 |
| 0.1223 |
| Base | ✕ | ✓ | ✓ | ✕ | 63.65 |
| 0.2986 |
|
|
|
| Base | ✕ | ✓ | ✕ | ✓ |
| 0.2594 | 0.2987 | 35.84 | 0.0965 | 0.1223 |
| Ours | ✕ | ✓ | ✓ | ✓ | 63.31 | 0.2975 |
| 39.25 | 0.1009 | 0.1268 |
|
| ||||||||||
| Base | ✕ | ✓ | ✕ | ✕ | 55.47 | 0.3072 | 0.2025 |
| 0.1178 |
|
| Base | ✕ | ✓ | ✓ | ✕ |
| 0.2869 | 0.2088 | 33.11 |
| 0.1876 |
| Base | ✕ | ✓ | ✕ | ✓ | 54.87 | 0.3123 | 0.2000 | 34.47 | 0.1070 | 0.1877 |
| Ours | ✕ | ✓ | ✓ | ✓ | 56.59 |
|
| 33.11 | 0.1407 | 0.1515 |
Figure 3Reliability diagram: (a) Base + (ResNet18 [13] + IL [12]), (b) Base + (ResNet18 [13] + IL [12]) + E-CBS and (c) Base + (ResNet18 [13] + IL [12]) + E-CBS + T-Norm.
Ablation study of the FEN trained in classifying the tissues and instruments found in the source and target domains.
| Model | Acc ↑ | ||||
|---|---|---|---|---|---|
|
ResNet18 [ |
IL [ | E-CBS | T-Norm | Source Domain (9 Classes) | Source and Target Domain (11 Classes) |
| ✓ | ✕ | ✕ | ✕ | 35.24 | - |
| ✓ | ✕ | ✕ | ✓ | 35.24 | - |
| ✓ | ✕ | ✓ | ✕ |
| - |
| ✓ | ✕ | ✓ | ✓ |
| - |
| ✓ | ✓ | ✕ | ✕ | - | 31.85 |
| ✓ | ✓ | ✕ | ✓ | - | 28.90 |
| ✓ | ✓ | ✓ | ✕ | - | 32.19 |
| ✓ | ✓ | ✓ | ✓ | - |
|
Comparison of the proposed E-CBS and CBS (LoG) against the CBS [15].
| Modules and Parameters | Acc ↑ | ||||
|---|---|---|---|---|---|
|
| Decay | Model | Initial Conv Layer | Residual Blocks | |
| CBS [ | Gaussian | Gaussian | 89.23 | ||
| 1.0 | 0.9 | CBS (LoG) | LoG | LoG | 88.17 |
| E-CBS | LoG | Gaussian |
| ||
| CBS [ | Gaussian | Gaussian | 84.70 | ||
| 1.0 | 1.0 | CBS (LoG) | LoG | LoG |
|
| E-CBS | LoG | Gaussian | 87.02 | ||
| CBS [ | Gaussian | Gaussian | 86.48 | ||
| 2.0 | 0.9 | CBS (LoG) | LoG | LoG |
|
| E-CBS | LoG | Gaussian | 87.01 | ||