| Literature DB >> 36236424 |
Kevin Kasa1, David Burns1,2,3, Mitchell G Goldenberg4, Omar Selim5, Cari Whyne1,2,3, Michael Hardisty1,3.
Abstract
This paper introduces a new dataset of a surgical knot-tying task, and a multi-modal deep learning model that achieves comparable performance to expert human raters on this skill assessment task. Seventy-two surgical trainees and faculty were recruited for the knot-tying task, and were recorded using video, kinematic, and image data. Three expert human raters conducted the skills assessment using the Objective Structured Assessment of Technical Skill (OSATS) Global Rating Scale (GRS). We also designed and developed three deep learning models: a ResNet-based image model, a ResNet-LSTM kinematic model, and a multi-modal model leveraging the image and time-series kinematic data. All three models demonstrate performance comparable to the expert human raters on most GRS domains. The multi-modal model demonstrates the best overall performance, as measured using the mean squared error (MSE) and intraclass correlation coefficient (ICC). This work is significant since it demonstrates that multi-modal deep learning has the potential to replicate human raters on a challenging human-performed knot-tying task. The study demonstrates an algorithm with state-of-the-art performance in surgical skill assessment. As objective assessment of technical skill continues to be a growing, but resource-heavy, element of surgical education, this study is an important step towards automated surgical skill assessment, ultimately leading to reduced burden on training faculty and institutes.Entities:
Keywords: biomedical engineering; computer vision; deep learning; human activity recognition; machine learning; multi-modal; surgical education; surgical skills assessment
Mesh:
Year: 2022 PMID: 36236424 PMCID: PMC9571767 DOI: 10.3390/s22197328
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The trials were recorded using three modalities. The top is an image of the final product, the middle is a screen capture of the video data with a visualization of the joints tracked by the Leap sensor. The bottom is an example of the kinematic time series data, representing the temporal 3-dimensional movement of the hand joints during the knot tying task.
Rating scale used when evaluating surgical skill on the GRS Domains.
| Domain | Rating Scale |
|---|---|
| Respect for Tissue | 1—Very poor: Frequent or excessive pulling or sawing of tissue |
| 3—Competent: Careful handling of tissue with occasional sawing or pulling | |
| 5—Clearly superior: Consistent atraumatic handling of tissue | |
| Time and Motion | 1—Very poor: Many unnecessary movements |
| 3—Competent: Efficient time/motion but some unnecessary moves | |
| 5—Clearly superior: Clear economy of movement and maximum efficiency | |
| Quality of Final Product | 1—Very poor |
| 3—Competent | |
| 5—Clearly superior | |
| Overall Performance | 1—Very poor |
| 3—Competent | |
| 5—Clearly superior |
Figure 2Participants came from 10 surgical divisions, with experiences ranging from PGY1 to Fellow.
Figure 3Images were analyzed using a ResNet-based network, and the kinematic data was analyzed using a 1D ResNet-18 as a ‘feature extractor’, followed by 2 bidirectional LSTM layers. The combined multi-modal network is concurrently trained on both the image and kinematic data as input, and predicts all four GRS domains.
Summary of the hyper-parameters used to train the multi-modal network. Hyper-parameters were tuned heuristically.
| Hyperparameter | Value |
|---|---|
| Learning rate |
|
| Optimizer | Adam ( |
| Batch size | 16 |
| Dropout | 0.50 |
| Epochs (frozen backbone) | 50 |
| Epochs (fine-tuning backbone) | 50 |
| Loss function | Mean Squared Error |
| Image dimensions | (1024, 1024) |
| Timeseries length | 4223 timestamps |
The expert human raters demonstrate moderate to good agreement on their evaluations when as measured using the mean. The AI model was trained & evaluated on the mean value of the ratings.
| GRS Domain | ICC (2,3) | SEM (2,3) | ICC (2,1) | SEM (2,1) |
|---|---|---|---|---|
| Respect for Tissue | 0.71 | 0.45 | 0.47 | 0.62 |
| Time and Motion | 0.70 | 0.47 | 0.44 | 0.64 |
| Quality of Final Product | 0.83 | 0.40 | 0.63 | 0.61 |
| Overall Performance | 0.73 | 0.39 | 0.47 | 0.55 |
Test-retest performance of the human raters on the forty repeated trials. Although the raters performance varies, they all show moderate to good consistency.
| GRS Domains | Rater 1 | Rater 2 | Rater 3 | |||
|---|---|---|---|---|---|---|
| ICC | SEM | ICC | SEM | ICC | SEM | |
| Respect for Tissue | 0.84 | 0.43 | 0.49 | 0.55 | 0.55 | 0.54 |
| Time and Motion | 0.83 | 0.46 | 0.57 | 0.58 | 0.62 | 0.48 |
| Quality of Final Product | 0.88 | 0.40 | 0.79 | 0.47 | 0.69 | 0.43 |
| Overall Performance | 0.85 | 0.37 | 0.60 | 0.49 | 0.58 | 0.48 |
Human raters show good to excellent agreement on the held-out test set. Determining agreement on the same test set the AI model is evaluated on can help provide a better baseline for expected performance.
| GRS Domain | ICC (2,3) | SEM (2,3) | ICC (2,1) | SEM (2,1) |
|---|---|---|---|---|
| Respect for Tissue | 0.78 | 0.44 | 0.54 | 0.63 |
| Time and Motion | 0.81 | 0.41 | 0.58 | 0.61 |
| Quality of Final Product | 0.93 | 0.30 | 0.82 | 0.49 |
| Overall Performance | 0.86 | 0.30 | 0.68 | 0.30 |
Figure 4Participant experience and rating on the `Overall Performance’ domain. A significant difference was found between the Beginner and Intermediate groups.
Performance metrics, including mean squared Error (MSE), of the AI predictions and human ratings, compared to the ground truth (mean of human scores).
| Model | Metric | Respect for Tissue | Time and Motion | Quality of Final Product | Overall Performance |
|---|---|---|---|---|---|
| Image Model | MSE | - | - |
| - |
| RMSE | - | - | 0.392 | - | |
| MAE | - | - | 0.293 | - | |
| R2 | - | - | 0.778 | - | |
| Kinematic Model | MSE |
| 0.420 | - | 0.373 |
| RMSE | 0.579 | 0.648 | - | 0.610 | |
| MAE | 0.523 | 0.456 | - | 0.431 | |
| R2 |
| 0.244 | - | 0.453 | |
| Multi-modal Model | MSE | 0.480 |
| 0.186 |
|
| RMSE | 0.693 | 0.597 | 0.431 | 0.440 | |
| MAE | 0.545 | 0.459 | 0.331 | 0.315 | |
| R2 | 0.136 |
|
|
| |
| Rater 1 | MSE | 0.464 |
| 0.531 | 0.505 |
| RMSE | 0.681 | 0.590 | 0.729 | 0.710 | |
| MAE | 0.528 | 0.474 | 0.449 | 0.407 | |
| Rater 2 | MSE | 0.546 | 0.553 | 0.545 | 0.466 |
| RMSE | 0.739 | 0.744 | 0.738 | 0.683 | |
| MAE | 0.586 | 0.483 | 0.425 | 0.436 | |
| Rater 3 | MSE |
| 0.363 | 0.193 | 0.290 |
| RMSE | 0.537 | 0.602 | 0.439 | 0.539 | |
| MAE | 0.409 | 0.426 | 0.291 | 0.336 |
Figure 5Graphical comparison of the MSE on the GRS Domains—lower MSE is better.
Intraclass Correlation Coefficient (ICC) and Standard Error of Measurement (SEM) scores between the ground truth and the AI models & human raters.
| Model | Metric | Respect for Tissue | Time and Motion | Quality of Final Product | Overall Performance |
|---|---|---|---|---|---|
| Image Model | ICC(2,1) | - | - | 0.888 | - |
| SEM(2,1) | - | - |
| - | |
| Kinematic Model | ICC(2,1) |
|
| - | 0.534 |
| SEM(2,1) |
| 0.441 | - | 0.416 | |
| Multi-modal Model | ICC(2,1) | 0.301 | 0.591 |
|
|
| SEM(2,1) | 0.499 |
| 0.309 |
| |
| Rater 1 | ICC(2,1) | 0.717 | 0.779 | 0.823 | 0.616 |
| SEM(2,1) | 0.476 | 0.414 | 0.512 | 0.502 | |
| Rater 2 | ICC(2,1) | 0.606 | 0.627 | 0.758 | 0.508 |
| SEM(2,1) | 0.516 | 0.524 | 0.521 | 0.689 | |
| Rater 3 | ICC(2,1) |
|
|
|
|
| SEM(2,1) |
|
|
| 0.379 |
Spearman Correlation Coefficient between the multi-modal AI predictions and the ground truth. Best performing model on the JIGSAWS dataset included as reference [13].
| GRS Domain |
| |
|---|---|---|
| Multi-Modal Model (Ours) | FCN [ | |
| Respect for Tissue | 0.18 | - |
| Time and Motion | 0.73 | - |
| Quality of Final Product | 0.95 | - |
| Overall Performance | 0.82 | - |
| Mean |
| 0.65 |
Accuracy of the multi-modal model, determined by first rounding the continuous ground-truth and predicted scores. Best performing model on the JIGSAWS dataset included as reference [11].
| GRS Domain | Accuracy | |
|---|---|---|
| Multi-Modal Model (Ours) | Embedding Analysis [ | |
| Time and Motion |
| 0.32 |
| Quality of Final Product |
| 0.51 |
| Overall Performance |
| 0.41 |