Jason D Kelly1, Ashley Petersen2, Thomas S Lendvay3, Timothy M Kowalewski4. 1. Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN, USA. kell1917@umn.edu. 2. Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA. 3. Department of Urology, Seattle Children's Hospital, Seattle, WA, USA. 4. Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN, USA.
Abstract
PURPOSE: The majority of historical surgical skill research typically analyzes holistic summary task-level metrics to create a skill classification for a performance. Recent advances in machine learning allow time series classification at the sub-task level, allowing predictions on segments of tasks, which could improve task-level technical skill assessment. METHODS: A bidirectional long short-term memory (LSTM) network was used with 8-s windows of multidimensional time-series data from the Basic Laparoscopic Urologic Skills dataset. The network was trained on experts and novices from four common surgical tasks. Stratified cross-validation with regularization was used to avoid overfitting. The misclassified cases were re-submitted for surgical technical skill assessment to crowds using Amazon Mechanical Turk to re-evaluate and to analyze the level of agreement with previous scores. RESULTS: Performance was best for the suturing task, with 96.88% accuracy at predicting whether a performance was an expert or novice, with 1 misclassification, when compared to previously obtained crowd evaluations. When compared with expert surgeon ratings, the LSTM predictions resulted in a Spearman coefficient of 0.89 for suturing tasks. When crowds re-evaluated misclassified performances, it was found that for all 5 misclassified cases from peg transfer and suturing tasks, the crowds agreed more with our LSTM model than with the previously obtained crowd scores. CONCLUSION: The technique presented shows results not incomparable with labels which would be obtained from crowd-sourced labels of surgical tasks. However, these results bring about questions of the reliability of crowd sourced labels in videos of surgical tasks. We, as a research community, should take a closer look at crowd labeling with higher scrutiny, systematically look at biases, and quantify label noise.
PURPOSE: The majority of historical surgical skill research typically analyzes holistic summary task-level metrics to create a skill classification for a performance. Recent advances in machine learning allow time series classification at the sub-task level, allowing predictions on segments of tasks, which could improve task-level technical skill assessment. METHODS: A bidirectional long short-term memory (LSTM) network was used with 8-s windows of multidimensional time-series data from the Basic Laparoscopic Urologic Skills dataset. The network was trained on experts and novices from four common surgical tasks. Stratified cross-validation with regularization was used to avoid overfitting. The misclassified cases were re-submitted for surgical technical skill assessment to crowds using Amazon Mechanical Turk to re-evaluate and to analyze the level of agreement with previous scores. RESULTS: Performance was best for the suturing task, with 96.88% accuracy at predicting whether a performance was an expert or novice, with 1 misclassification, when compared to previously obtained crowd evaluations. When compared with expert surgeon ratings, the LSTM predictions resulted in a Spearman coefficient of 0.89 for suturing tasks. When crowds re-evaluated misclassified performances, it was found that for all 5 misclassified cases from peg transfer and suturing tasks, the crowds agreed more with our LSTM model than with the previously obtained crowd scores. CONCLUSION: The technique presented shows results not incomparable with labels which would be obtained from crowd-sourced labels of surgical tasks. However, these results bring about questions of the reliability of crowd sourced labels in videos of surgical tasks. We, as a research community, should take a closer look at crowd labeling with higher scrutiny, systematically look at biases, and quantify label noise.
Authors: Jeffrey H Peters; Gerald M Fried; Lee L Swanstrom; Nathaniel J Soper; Lelan F Sillin; Bruce Schirmer; Kaaren Hoffman Journal: Surgery Date: 2004-01 Impact factor: 3.982
Authors: Melina C Vassiliou; Liane S Feldman; Christopher G Andrew; Simon Bergman; Karen Leffondré; Donna Stanbridge; Gerald M Fried Journal: Am J Surg Date: 2005-07 Impact factor: 2.565
Authors: John D Birkmeyer; Jonathan F Finks; Amanda O'Reilly; Mary Oerline; Arthur M Carlin; Andre R Nunn; Justin Dimick; Mousumi Banerjee; Nancy J O Birkmeyer Journal: N Engl J Med Date: 2013-10-10 Impact factor: 91.245
Authors: Rodney L Dockter; Thomas S Lendvay; Robert M Sweet; Timothy M Kowalewski Journal: Int J Comput Assist Radiol Surg Date: 2017-05-18 Impact factor: 2.924
Authors: Timothy M Kowalewski; Lee W White; Thomas S Lendvay; Iris S Jiang; Robert Sweet; Andrew Wright; Blake Hannaford; Mika N Sinanan Journal: J Surg Res Date: 2014-06-04 Impact factor: 2.192