Carolyn Chen1, Lee White2, Timothy Kowalewski3, Rajesh Aggarwal4, Chris Lintott5, Bryan Comstock6, Katie Kuksenok7, Cecilia Aragon7, Daniel Holst8, Thomas Lendvay9. 1. University of Washington, School of Medicine, Seattle, Washington. 2. Department of Bioengineering, University of Washington, Seattle, Washington. 3. Department of Mechanical Engineering, University of Minnesota, Seattle, Washington. 4. Department of Surgery, University of Pennsylvania, Philadelphia, Pennsylvania. 5. Department of Physics, University of Oxford, Oxford, United Kingdom. 6. Department of Biostatistics, University of Washington, Seattle, Washington. 7. Department of Computer Science Engineering, University of Washington, Seattle, Washington. 8. University of Washington, School of Medicine, Seattle, Washington. Electronic address: dholst12@gmail.com. 9. Department of Urology, University of Washington, Seattle Children's Hospital, Seattle, Washington.
Abstract
BACKGROUND: Validated methods of objective assessments of surgical skills are resource intensive. We sought to test a web-based grading tool using crowdsourcing called Crowd-Sourced Assessment of Technical Skill. MATERIALS AND METHODS: Institutional Review Board approval was granted to test the accuracy of Amazon.com's Mechanical Turk and Facebook crowdworkers compared with experienced surgical faculty grading a recorded dry-laboratory robotic surgical suturing performance using three performance domains from a validated assessment tool. Assessor free-text comments describing their rating rationale were used to explore a relationship between the language used by the crowd and grading accuracy. RESULTS: Of a total possible global performance score of 3-15, 10 experienced surgeons graded the suturing video at a mean score of 12.11 (95% confidence interval [CI], 11.11-13.11). Mechanical Turk and Facebook graders rated the video at mean scores of 12.21 (95% CI, 11.98-12.43) and 12.06 (95% CI, 11.57-12.55), respectively. It took 24 h to obtain responses from 501 Mechanical Turk subjects, whereas it took 24 d for 10 faculty surgeons to complete the 3-min survey. Facebook subjects (110) responded within 25 d. Language analysis indicated that crowdworkers who used negation words (i.e., "but," "although," and so forth) scored the performance more equivalently to experienced surgeons than crowdworkers who did not (P < 0.00001). CONCLUSIONS: For a robotic suturing performance, we have shown that surgery-naive crowdworkers can rapidly assess skill equivalent to experienced faculty surgeons using Crowd-Sourced Assessment of Technical Skill. It remains to be seen whether crowds can discriminate different levels of skill and can accurately assess human surgery performances.
BACKGROUND: Validated methods of objective assessments of surgical skills are resource intensive. We sought to test a web-based grading tool using crowdsourcing called Crowd-Sourced Assessment of Technical Skill. MATERIALS AND METHODS: Institutional Review Board approval was granted to test the accuracy of Amazon.com's Mechanical Turk and Facebook crowdworkers compared with experienced surgical faculty grading a recorded dry-laboratory robotic surgical suturing performance using three performance domains from a validated assessment tool. Assessor free-text comments describing their rating rationale were used to explore a relationship between the language used by the crowd and grading accuracy. RESULTS: Of a total possible global performance score of 3-15, 10 experienced surgeons graded the suturing video at a mean score of 12.11 (95% confidence interval [CI], 11.11-13.11). Mechanical Turk and Facebook graders rated the video at mean scores of 12.21 (95% CI, 11.98-12.43) and 12.06 (95% CI, 11.57-12.55), respectively. It took 24 h to obtain responses from 501 Mechanical Turk subjects, whereas it took 24 d for 10 faculty surgeons to complete the 3-min survey. Facebook subjects (110) responded within 25 d. Language analysis indicated that crowdworkers who used negation words (i.e., "but," "although," and so forth) scored the performance more equivalently to experienced surgeons than crowdworkers who did not (P < 0.00001). CONCLUSIONS: For a robotic suturing performance, we have shown that surgery-naive crowdworkers can rapidly assess skill equivalent to experienced faculty surgeons using Crowd-Sourced Assessment of Technical Skill. It remains to be seen whether crowds can discriminate different levels of skill and can accurately assess human surgery performances.
Authors: Lena Maier-Hein; Daniel Kondermann; Tobias Roß; Sven Mersmann; Eric Heim; Sebastian Bodenstedt; Hannes Götz Kenngott; Alexandro Sanchez; Martin Wagner; Anas Preukschas; Anna-Laura Wekerle; Stefanie Helfert; Keno März; Arianeb Mehrabi; Stefanie Speidel; Christian Stock Journal: Int J Comput Assist Radiol Surg Date: 2015-04-18 Impact factor: 2.924
Authors: Shanley B Deal; Dimitrios Stefanidis; Dana Telem; Robert D Fanelli; Marian McDonald; Michael Ujiki; L Michael Brunt; Adnan A Alseidi Journal: Surg Endosc Date: 2017-04-25 Impact factor: 4.584
Authors: Rodney L Dockter; Thomas S Lendvay; Robert M Sweet; Timothy M Kowalewski Journal: Int J Comput Assist Radiol Surg Date: 2017-05-18 Impact factor: 2.924
Authors: Amir Baghdadi; Ahmed A Hussein; Youssef Ahmed; Lora A Cavuoto; Khurshid A Guru Journal: Int J Comput Assist Radiol Surg Date: 2018-11-20 Impact factor: 2.924
Authors: Jason D Kelly; Ashley Petersen; Thomas S Lendvay; Timothy M Kowalewski Journal: Int J Comput Assist Radiol Surg Date: 2020-04-15 Impact factor: 2.924
Authors: Geb W Thomas; Steven Long; Marcus Tatum; Timothy Kowalewski; Dominik Mattioli; J Lawrence Marsh; Heather R Kowalski; Matthew D Karam; Joan E Bechtold; Donald D Anderson Journal: Iowa Orthop J Date: 2020