Literature DB >> 35114732

Analysis of executional and procedural errors in dry-lab robotic surgery experiments.

Kay Hutchinson¹, Zongyu Li¹, Leigh A Cantrell², Noah S Schenkman³, Homa Alemzadeh¹.

Abstract

BACKGROUND: Analysing kinematic and video data can help identify potentially erroneous motions that lead to sub-optimal surgeon performance and safety-critical events in robot-assisted surgery.
METHODS: We develop a rubric for identifying task and gesture-specific executional and procedural errors and evaluate dry-lab demonstrations of suturing and needle passing tasks from the JIGSAWS dataset. We characterise erroneous parts of demonstrations by labelling video data, and use distribution similarity analysis and trajectory averaging on kinematic data to identify parameters that distinguish erroneous gestures.
RESULTS: Executional error frequency varies by task and gesture, and correlates with skill level. Some predominant error modes in each gesture are distinguishable by analysing error-specific kinematic parameters. Procedural errors could lead to lower performance scores and increased demonstration times but also depend on surgical style.
CONCLUSIONS: This study provides insights into context-dependent errors that can be used to design automated error detection mechanisms and improve training and skill assessment.

Entities: Chemical

Keywords: MIS; computer-assisted surgery; kinematics; minimal invasive surgery; robotic surgery; safety; training

Mesh：

Year: 2022 PMID： 35114732 PMCID： PMC9285717 DOI： 10.1002/rcs.2375

Source DB: PubMed Journal: Int J Med Robot ISSN： 1478-5951 Impact factor: 2.483

INTRODUCTION

With advances in sensing and computing technology, artificial intelligence, and data science, the next generation of robot‐assisted surgery (RAS) systems is envisioned to benefit from new capabilities for context‐specific monitoring and virtual coaching during simulation training as well as decision support and cognitive assistance during actual surgery to improve safety, efficiency, and quality of care. State‐of‐the‐art RAS systems and simulators are designed with data logging mechanisms to collect system logs, kinematics, and video data from surgical procedures. The recorded data has been mostly used for offline surgical skill evaluation, , , with the aim of improving surgeons' performance and making evaluations objective and scalable. Current methods for objective assessment of robotic technical skills can be classified into two general categories: manual assessment and automated assessment. Manual skill evaluation is usually performed globally, assessing performance over an entire demonstration using frameworks such as OSATS (Objective Structured Assessment of Technical Skills), GOALS (Global Operative Assessment of Laparoscopic Skills), GEARS, and R‐OSATS. However, manual assessment methods are subjective, cognitively demanding, and prone to errors. In response, automated assessment methods utilising kinematic, video, and system event data are being developed to provide objective and quantitative metrics and, and explainable feedback. Automated methods also allow the subdivision of demonstrations into subtasks or gestures, and to base performance assessment and technical skill evaluation on the quality and/or sequence of these components as proposed in references , , Further, some gestures are more indicative of skill level than others. The metrics used for surgical skill assessment can be classified into three broad categories of: (i) efficiency (e.g., path length, completion time), (ii) safety (e.g., instrument collisions, instruments out of view, excessive force, needle drops, and tissue damage ), and (iii) task/procedure specific metrics (e.g., task outcome metrics, camera movement, and energy activation ). While most previous works focussed on skill evaluation for distinguishing between expertise levels, less attention has been paid to identifying specific erroneous surgical motions that contribute to sub‐optimal performance and potential safety‐critical events. The closest related works are Moorthy et al. and Guni et al. which proposed objective gesture‐based checklists for laparoscopic and robot‐assisted suturing. Others have proposed general and custom rubrics for evaluation of human errors and technical errors in laparoscopic surgery. Related works on errors in RAS mainly focussed on analysing adverse events and system malfunctions as reported by the surgical teams and institutions. Augmenting RAS systems and simulators with mechanisms for monitoring the progress of surgical tasks and providing early and context‐specific feedback to surgeons on potentially sub‐optimal or unsafe motions could help improve performance scores in training and prevent safety‐critical events in actual surgery. In this study, we take a step towards developing automated safety monitoring mechanisms for RAS by defining a rubric for identifying task and gesture‐specific errors based on video and kinematic data. We use this rubric to analyse recorded dry‐lab demonstrations of two common tasks (suturing and needle passing) performed on the da Vinci Surgical System (dVSS). We focus on identifying which parts of a trajectory (spanning one or more gestures) are potentially erroneous (sub‐optimal) versus error‐free (optimal). We then characterise the erroneous trajectories by identifying the most common types of errors for each task and gesture, and the kinematic parameters and surgeon‐specific signatures that distinguish between optimal and sub‐optimal performance. The results from this study can aid in designing more efficient training modules, curricula, and simulation tools that reinforce optimal performance by providing detailed, quantitative, and context‐specific feedback to surgeons. In summary, we propose a novel framework for objective evaluation of RAS procedures with the following key contributions (Our labels, code, and data are available at https://github.com/UVA‐DSA/ExecProc_Error_Analysis): A task and gesture‐specific rubric for identification of executional and procedural errors using data collected from real or simulated surgical demonstrations. A set of executional error labels based on manual annotation of video data to augment the suturing and needle passing tasks of the JIGSAWS dataset. Quantitative analysis methods to characterise gesture‐specific executional and task‐specific procedural errors using pre‐collected kinematic data and gesture labels. Insights on the types, frequencies, and durations of executional and procedural errors across tasks and gestures and their correlations with skill levels which can provide a basis for the design of automated error detection mechanisms.

METHODS

Sources of errors in RAS are diverse and domain‐specific, including faults in the robotic system software and hardware, or human errors. In this study, we focus on errors in the execution of procedures that can be observed in video recordings and detected in kinematic data. Surgical procedures follow the hierarchy of levels defined in Neumuth et al. which provides context for actions during the operation, as shown in Figure 1. A surgical operation can involve multiple procedures which are divided into steps. Each step is subdivided into tasks comprised of gestures (also called sub‐tasks or surgemes) which are made of motions such as moving an instrument or closing the graspers. Errors can occur at any level of this hierarchy and can propagate and cause errors at other levels. We specifically focus on studying the quality of the task demonstrations at the gesture level to answer the following research questions:

FIGURE 1

Surgical hierarchy (adopted from ) for an example urological procedure of partial nephrectomy (based on and ) along with example gesture‐specific executional and procedural errors

RQ1: Which tasks and gestures are most prone to errors? RQ2: Are there common error modes or patterns across gestures and tasks? RQ3: Are erroneous gestures distinguishable from normal gestures? RQ4: What kinematic parameters can be used to distinguish between normal and erroneous gestures? RQ5: Do errors impact the duration of the trajectory? RQ6: Are there any correlations between errors and surgical skill levels? Surgical hierarchy (adopted from ) for an example urological procedure of partial nephrectomy (based on and ) along with example gesture‐specific executional and procedural errors

Rubric for objective assessment of errors in robotic surgery

Our goal is to define a rubric for identification of errors based on video and/or kinematic data, which can also be used for automated error detection using quantitative measures such as instrument position, amount of force, travelling distance, and system events. We adopt a previous categorisation of human errors in laparoscopic surgery from and define two types of errors in our rubric: Procedural errors and Executional errors. Procedural errors involve ‘the omission or re‐arrangement of correctly undertaken steps within the procedure,’ while executional errors are ‘the failure of a specific motor task within the procedure.’ Technical errors are the ‘failure of a planned action to achieve a goal’, including inadequate (too much/too little) use of force or distance, inadequate visualization and wrong orientation of instruments or dissection plane, and are considered a subtype of Executional errors that can be quantified with thresholds. In order to generalise these definitions to different procedures and tasks, we define Executional and Procedural errors at the gesture level. More specifically, we define a set of Executional error modes for each gesture as listed in the rubric in Table 1. Some errors are gesture‐specific such as ‘needle orientation’ which is only defined for G4 and G8 as those gestures specifically manipulate the needle in preparation for positioning the needle (G2) and throwing the next suture (G3), as shown in the grammar graph of Figure 1 (adopted from ). The standard acceptable practice for those gestures is to hold the needle in the grasper half to two‐thirds of the way from the tip of the needle and with the needle perpendicular to the jaws of the grasper. Other gestures that do not purposely alter the orientation of the needle in the grasper cannot have this error mode. For G3, the definition of a ‘multiple attempts’ error also includes ‘not moving along the curve’ of the needle (from ) since these two errors are very difficult to distinguish and often happen simultaneously. Other error modes, including ‘multiple attempts’, ‘needle drop’, and ‘out of view’, could occur at any time during a task and are considered for every gesture.

TABLE 1

Gesture‐specific executional errors for suturing and needle passing in the JIGSAWS dataset

		Error mode	Suturing		Needle passing
Gesture description		Error mode	Total number of errors	Erroneous gestures (%)	Total number of errors	Erroneous gestures (%)
G1	Reaching for needle with right hand	Multiple attempts	7	29 (28%)	N/A	11/30 (37%)
		Needle drop	0		2
		Out of view	1		10
G2	Positioning needle	Multiple attempts	21	22/166 (13%)	51	55/117 (47%)
		Needle drop	0		0
		Out of view	1		6
G3	Pushing needle through tissue	Not moving along the curve/multiple attempts	80	82/164(51%)	17	17/111 (15%)
		Needle drop	0		0
		Out of view	2		0
G4	Transferring needle from left to right	Multiple attempts	19	71/119 (60%)	15	23/83 (28%)
		Needle orientation	53		9
		Needle drop	0		0
		Out of view	14		3
G5	Moving to centre with needle in grip	Needle drop	1	2/37 (5%)	0	3/31 (10%)
G5	Moving to centre with needle in grip	Out of view	1	2/37 (5%)	3	3/31 (10%)
G6	Pulling suture with left hand	Multiple attempts	8	121/163 (74%)	14	46/112 (41%)
		Needle drop	2		0
		Out of view	120		37
G8	Orienting needle	Multiple attempts	18	28/48 (58%)	1	3/28 (11%)
		Needle orientation	22		1
		Needle drop	0		0
		Out of view	4		2
G9	Using right hand to help tighten suture	Multiple attempts	3	11/24 (46%)	1	1/1 (100%)
		Needle drop	0		1
		Out of view	11		0
All gestures	Total number of errors across all gestures	Multiple attempts	156	345/750 (46%)	99	159/513 (31%)
		Needle drop	3		3
		Needle orientation	75		10
		Out of view	154		61

Note: Example videos for each error mode can be found at https://www.youtube.com/watch?v=I7jQ6U9jaoc, https://www.youtube.com/watch?v=V‐NJjgRu2OI, https://www.youtube.com/watch?v=‐UNNWQ3j0yU, and https://www.youtube.com/watch?v=LhNg8uLRQzI.

Gesture‐specific executional errors for suturing and needle passing in the JIGSAWS dataset Note: Example videos for each error mode can be found at https://www.youtube.com/watch?v=I7jQ6U9jaoc, https://www.youtube.com/watch?v=V‐NJjgRu2OI, https://www.youtube.com/watch?v=‐UNNWQ3j0yU, and https://www.youtube.com/watch?v=LhNg8uLRQzI. We define Procedural errors as any deviation in the sequence of gestures performed in a demonstration from the standard accepted gesture sequences defined for that task and shown in the grammar graphs in Figures 1 and 2 defined several sub‐categories for procedural errors, including adding an unexpected step, skipping a step, out of order transition, and repetition of steps. These subcategories are included in our analysis of procedural errors as discussed in Section 2.4.

FIGURE 2

Overall methodology for analysis of executional and procedural errors

JIGSAWS dataset

The JHU‐ISI Gesture and Skill Assessment Working Set (JIGSAWS) is a publicly available dataset, collected using the Research API for the da Vinci Surgical System (dVSS) from eight surgeons of varying skill levels performing three dry‐lab surgical tasks: Suturing, Knot Tying, and Needle Passing. These tasks are among the standard modules in most surgical skills training curricula. The JIGSAWS dataset includes kinematic and video data from up to 39 demonstrations (or trials) per task along with manually annotated gesture transcripts (indicating the sequence of gestures, with the beginning and end of each gesture and its type) and surgical skill levels for each demonstration. The vocabulary of surgical gestures used for labelling is shown in Table 1. Surgical skills were characterised using both self‐proclaimed expertise levels and Global Rating Scale (GRS) score for each demonstration. Self‐proclaimed (SP) expertise levels were based on the number of hours of robotic surgical experience, divided into: SP‐Expert (>100 h), SP‐Intermediate (10–100 h), and SP‐Novice (<10 h). GRS scores were given using a modified Objective Structured Assessments of Technical Skills (OSATS) approach based on six elements (on a rating‐scale of 1–5 per element): Respect for tissue, suture/needle handling, time and motion, flow of operation, overall performance, and quality of final product. We also classified the demonstrations into three groups based on the GRS scores: GRS‐Novice (0 ≤ GRS ≤ 9), GRS‐Intermediate (10 ≤ GRS ≤ 19), and GRS‐Expert (20 ≤ GRS ≤ 30). Figure 2 shows our overall pipeline for the analysis of executional and procedural errors in the JIGSAWS dataset. Due to the limited number of demonstrations for the Knot Tying task in the dataset, our analysis only focussed on suturing and needle passing.

Executional error analysis

Kinematic and video data for each trial were first segmented into gestures based on the gesture transcript annotations. The video clip for each gesture was then reviewed and labelled by two to three independent annotators (with experience in robotic surgery and/or suturing) as normal or erroneous for each error mode. Final labels for each error mode were obtained by taking the consensus among annotators. A gesture example that exhibited one or more errors was marked as erroneous, otherwise, it was labelled as normal. We then proceeded with the analysis of the patient‐side manipulator (PSM) kinematic data corresponding to each gesture for all the normal and erroneous demonstrations of each task.

Dynamic time warping

We used dynamic time warping (DTW) to measure the similarity between normal and erroneous trajectories for each gesture. DTW is an effective method for aligning two temporal sequences, independent of the non‐linear variations in time, by minimising the Euclidean distance between the two signals. In our analysis, we performed independent DTW on each variable before summing the returned distances for each parameter listed in Table 2. We found no significant difference between this method and dependent DTW where all variables in each parameter group were warped together yielding a single distance instead of a sum of distances (similar observations were made in Shokoohi‐Yekta et al. ). DTW was performed on every combination of two example trajectories for each gesture. From this, we obtained comparisons of normal examples to other normal examples (‘Nor‐Nor’) and comparisons of erroneous examples to normal examples (‘Err‐Nor’). The DTW distance samples represented a distribution of distances for the ‘Nor‐Nor’ and ‘Err‐Nor’ subsets as shown in the histogram of Figure 2. This resulted in two sets of distance samples for each parameter, each representing a DTW distribution for a comparison subset.

TABLE 2

Kinematic variables in the JIGSAWS dataset (adopted from )

Index	Description of variables	Parameter name
39–41	Right PSM1 tool tip position (xyz)	R Pos
42–50	Right PSM1 tool tip rotation matrix (R)	R Rot Mat
51–53	Right PSM1 tool tip linear velocity (x′ y′ z′)	R Lin Vel
54–56	Right PSM1 tool tip rotational velocity (α′ β′ γ′)	R Rot Vel
57	Right PSM1 gripper angle (Θ)	R Grip Ang
58–60	Left PSM2 tool tip position (xyz)	L Pos
61–69	Left PSM2 tool tip rotation matrix (R)	L Rot Mat
70–72	Left PSM2 tool tip linear velocity (x′ y′ z′)	L Lin Vel
73–75	Left PSM2 tool tip rotational velocity (α′ β′ γ′)	L Rot Vel
76	Left PSM2 gripper angle (Θ)	L Grip Ang

Kinematic variables in the JIGSAWS dataset (adopted from )

Kullback–Liebler divergence

Kullback–Liebler (KL) divergence, also called relative entropy, is a non‐symmetric measure of the difference between two probability distributions. The KL Divergence between two identical distributions is zero. As shown in Equation (1), KL divergence was used to compare the ‘Err‐Nor’ and ‘Nor‐Nor’ DTW distance distributions for each gesture to determine which parameters had a significant difference between the two distributions.

Trajectory averaging

We examined the kinematic data for important parameters to verify differences between normal and erroneous gestures using a method based on. Each signal was time‐normalized by downsampling the signal by 3 (keeping only every third sample) and then linearly interpolated to stretch it to the average duration of the normal or erroneous gesture examples of that task (supported by our analysis of gesture durations in Section 3.1.4). Then, fuzzy c‐means clustering was performed on each variable and its normalized time index to obtain the average normal and erroneous trajectories (represented by 15 cluster centres), shown with blue (normal) and red (erroneous) dots in Figure 4B.

FIGURE 4

G1 in suturing: (A) KL divergence of kinematic parameters, (B) right gripper angle trajectories for normal and erroneous gestures

Procedural error analysis

Previous works proposed modelling the standard acceptable gesture sequences for a task using a grammar graph that shows the relationship, order, and flow of gestures , , The grammar graph of a task is a digraph with each vertex representing the set of gestures for the task and each edge representing a common transition between two gestures. We adopted the grammar graphs for suturing and needle passing from and included an additional directed link from G1 to G2 in suturing (see Figures 1 and 2). We acquired the gesture sequences performed for suturing and needle passing from the JIGSAWS transcripts. Then we developed a method for checking if each gesture sequence follows the standard acceptable sequence of gestures in the grammar graph. As shown in Algorithm 1, for each gesture we check if it is in the grammar graph for the task and if it is a valid successor of the previous gesture, otherwise it is marked as a procedural error. Each transcript can have multiple, possibly sequential, procedural errors. This algorithm, combined with a gesture segmentation algorithm, can be used for automated detection of procedural errors in real‐time. Deviations from the grammar graph might also happen because of variations in surgical style and expertise, as discussed in Section 3.2.

RESULTS

This section presents a summary of results and observations from our analysis of the JIGSAWS dataset.

Executional errors

Table 1 lists the number of examples of each error mode as well as the total number of erroneous examples for each gesture. Note that a gesture example could exhibit multiple error modes, so the sum of the total number of errors does not necessarily equal the number of erroneous gestures.

Distribution of executional errors among gestures

Figure 3 shows the distribution of errors of each type for each gesture from Table 1. If a gesture example had more than one error label, it was counted under the ‘multiple errors’ category.

FIGURE 3

Distribution of errors for each gesture

Distribution of errors for each gesture We made the following observations: G5 for both tasks and G1, G8, and G9 for Needle Passing did not have enough examples of executional errors, so further analysis was not performed on these gestures. G8 from needle passing and G5 from both tasks had the lowest percentage of errors because they may be less challenging than other gestures. G2 and G3 have the most ‘multiple attempts’ errors in both tasks because they require a high level of accuracy in positioning and driving the needle though the tissue, respectively. G2 has more errors in Needle Passing because the eye of the ring is a smaller target than the dot on the fabric. G3 has more errors in Suturing because surgeons often tried multiple times to align the tip of the needle with the exit point while the needle was not visible beneath the fabric. Comparatively in needle passing, the needle only had to pass through one point and was always visible. G4 and G6 from both tasks, and G8 from Suturing have the most gestures with multiple errors. G4 and G8 both involve manipulating the needle between the graspers and the predominant error modes were ‘needle orientation’ and ‘multiple attempts’ likely due to issues with hand coordination. For G6, the main error modes were ‘out of view’ and ‘multiple attempts’ due to multiple attempts at grasping the needle and pulling it through the ring or tissue and then moving off‐camera to pull the suture through. G6 has a large number of ‘out of view’ errors especially in suturing possibly because surgeons could not move the camera for the trials in the JIGSAWS data set. However, a different technique to pull the suture could have been used such as hand‐over‐hand or the pulley method that would have kept the tools within view.

Kinematic parameters for distinguishing errors in each gesture

We performed a comparative analysis of KL divergence values for parameters in each gesture and identified the kinematic parameters that are associated with error occurrence as listed in Table 3. The following are key observations from this analysis:

TABLE 3

Kinematic parameters with the greatest KL divergence distinguishing errors in different gestures

Task	Gesture	Parameters
Suturing	G1	Right gripper angle
		Right linear velocity
		Right position
	G3	Right linear velocity
		Right rotational velocity
		Right gripper angle
	G6	Left position
	G8	Right position
		Left gripper angle
		Left linear velocity
		Right gripper angle
	G9	Left gripper angle
Needle passing	G2	Left rotational velocity
	G2	Left linear velocity
	G3	Left rotational velocity
		Right rotation matrix
		Right gripper angle

For G1 in suturing, the predominant error mode was ‘multiple attempts’ at picking up the needle. Figures 4B and 5 show that erroneous gestures exhibited a second opening and closing of the grasper and a large difference in Y position trajectories. This explains the large KL divergences for those right hand parameters in Figure 4A.

FIGURE 5

Right tooltip XYZ position for normal and erroneous G1 in suturing

For G2 in needle passing, Figure 6 shows a large difference in KL divergence for left rotational and linear velocities which may be due to the active role the left hand plays in stabilizing the ring unlike in suturing. This is an important contextual difference between tasks.

FIGURE 6

KL divergence of kinematic parameters for G2

The main error mode for G3 was ‘not moving along the curve/Multiple attempts’. Erroneous gestures in suturing were caused by lateral, instead of characteristically rotational, movements of the needle while in the fabric. In surgery, lateral movements may tear tissue and contribute to a safety‐critical event. This explains the high KL divergences for the parameters listed in Table 3 and shown in Figure 7A and is consistent with who found that the rate of orientation change during needle insertion (i.e. rotational velocity during G3) was higher for experienced surgeons.

FIGURE 7

KL divergence of kinematic parameters for G3

Kinematic parameters with the greatest KL divergence distinguishing errors in different gestures However, needle passing shows nearly the opposite result in Figure 7B. Upon reviewing the gesture clips for both tasks, we noticed that clips for suturing showed the right grasper driving the needle through the fabric and the left grasper pulling it through, but clips for needle passing began with the needle halfway through the ring and only showed the left grasper pulling the needle through. Due to the large difference in KL divergences between the two tasks, we see that the part of G3 that involves driving the needle with the right grasper is important to this gesture's correct execution. In both tasks, G4 had KL divergences below 0.6 for all parameters meaning normal and erroneous examples have very similar kinematics. G6 in suturing had the most errors with primarily ‘out of view’ errors. Figure 8 shows that final Y and Z positions for the left grasper were much larger for erroneous gestures as the left grasper exceeded the threshold for visibility while pulling the suture. This explains the large KL divergence for the left position parameter in Figure 9.

FIGURE 8

Left Tooltip XYZ position for normal and erroneous G6 in suturing

FIGURE 9

KL divergence of kinematic parameters for G6 in suturing

There were two main error modes for G8 in suturing: ‘multiple attempts’ and ‘needle orientation’. Figure 10 shows a comparison of DTW and KL divergence analysis for G8 from suturing for all errors, for ‘multiple attempts’ versus all other examples, and for ‘needle orientation’ versus all other examples. The ‘needle orientation’ error alone had the greatest KL divergence and contributed the most to the results for all errors. For the ‘multiple attempts’ error, both the left and right position parameters had the highest KL divergence which suggests that hand coordination is important in this gesture. Since this gesture includes the right gripper moving to grasp the needle, we see that right position is an important parameter in the ‘multiple attempts’ error both in G1 and G8.

FIGURE 10

KL divergence of kinematic parameters for G8 in suturing

Executional errors and skill levels

We analysed the relationship between executional errors and surgical skill levels. Based on self‐proclaimed expertise levels, Figure 11A shows a clear difference in the number of errors across different self‐proclaimed expertise groups for Suturing. However, no similar pattern was seen in needle passing. This might be because suturing is a more difficult task so the number of executional errors is more reflective of self‐proclaimed skill levels in suturing. For GRS‐defined skill levels, the total number of executional errors per trial was larger for GRS‐Novices than for GRS‐experts in needle passing (Figure 11B), which is consistent with our expectation that experts with high GRS scores make fewer executional errors than novices. However, since there was only one GRS‐Novice trial for Suturing, we did not observe clear differences.

FIGURE 11

Total number of executional errors across surgical skill levels: (A) Self‐proclaimed skill levels in suturing, (B) GRS skill levels in needle passing

G1 in suturing: (A) KL divergence of kinematic parameters, (B) right gripper angle trajectories for normal and erroneous gestures Right tooltip XYZ position for normal and erroneous G1 in suturing KL divergence of kinematic parameters for G2 KL divergence of kinematic parameters for G3 Left Tooltip XYZ position for normal and erroneous G6 in suturing KL divergence of kinematic parameters for G6 in suturing KL divergence of kinematic parameters for G8 in suturing Total number of executional errors across surgical skill levels: (A) Self‐proclaimed skill levels in suturing, (B) GRS skill levels in needle passing

Executional errors and gesture duration

We compared erroneous and normal gesture durations using a one‐tailed t‐test. The null hypothesis is that normal and erroneous gestures have similar durations. The alternative hypothesis is that erroneous gestures are longer than normal gestures. Figures 12 and 13 respectively show average durations and several examples of differences in durations (along with the p‐values from the hypothesis test) for normal and erroneous gestures in both tasks.

FIGURE 12

Average normal and erroneous gesture durations

FIGURE 13

Erroneous versus normal gesture durations for suturing and needle passing

Average normal and erroneous gesture durations Erroneous versus normal gesture durations for suturing and needle passing We observed that some error types increase the gesture duration, for example, ‘multiple attempts’ for G1, G2, G3, and G8 in suturing, and G2 and G3 in needle passing; and ‘out of view’ for G6 and G9 for suturing, and G4 in needle passing. Erroneous gestures with ‘out of view’ errors are longer because the distance travelled by the tool is larger, while the speed is similar. We rejected the null hypothesis and found that erroneous gestures are longer than normal gestures for all gestures of both tasks. There is a relatively large p‐value (p = 0.308) for G4 compared to other p‐values. This could be because ‘needle orientation’ is the primary error mode in G4 and an erroneous needle orientation takes comparable time as a normal needle orientation.

Executional errors and trial duration

For each trial, we summed the executional errors of all gestures in the trial. Then we analysed the correlation between the total number of executional errors per trial and the duration of the trial (in number of frames). Figure 14 shows that there is a significant positive correlation for suturing (r = 0.837, p = 6.18e−12), but no significant correlation for needle passing. This is likely due to the limited number of trials and fewer errors for needle passing in the JIGSAWS dataset (see Table 1).

FIGURE 14

Correlation between executional errors and durations of trials

Procedural errors

We analysed the numbers and patterns of procedural errors by task, skill level, and subject. We hypothesise that the number of procedural errors is inversely proportional to surgical experience and negatively correlated with the demonstration duration.

Procedural errors and self‐proclaimed skill levels

We compared the percentage of erroneous trials for SP‐Novice, SP‐Intermediate, and SP‐Expert groups. As shown in Table 4, we observe that for both tasks, SP‐Expert surgeons on average had more procedural errors compared to SP‐Intermediate surgeons. For Needle Passing, SP‐Intermediate surgeons made more errors than SP‐Novice surgeons. This could be due to variations in surgical style especially in more experienced surgeon groups. For example, our analysis of error patterns by subject showed that one of the SP‐Expert subjects consistently made G9–G11 transitions in different trials of suturing (see Figure 1). This is a unique non‐safety‐critical pattern that was not observed in the trials by other subjects. However, procedural errors by SP‐Novice subjects were more random and did not follow specific patterns.

TABLE 4

Procedural errors and self‐proclaimed skill levels

Task—skill level	Total number of procedural errors	Percentage of erroneous trials	Longest erroneous gesture sequences
Suturing—SP‐Expert	11	6/10	G9–G6–G2
Suturing—SP‐Intermediate	2	2/10	G3–G11
Suturing—SP‐Novice	23	10/19	G4–G5–G6–G2
Needle passing—SP‐Expert	11	6/9	G2–G6–G10
Needle passing—SP‐Intermediate	9	5/8	G6–G8–G6
Needle passing—SP‐Novice	7	4/11	G6–G5–G6
Total	63	33/67	–

Procedural errors and self‐proclaimed skill levels Of the two tasks, the longest erroneous gesture sequence is G4–G5–G6–G2 in Suturing performed by an SP‐Novice surgeon. Upon review of the video, G5 may be a typo in the transcript.

Procedural errors and GRS skill levels

We analysed the correlation between the number of procedural errors and GRS score (Table 5). The strongest negative correlation between the number of procedural errors, GRS score, and GRS sub‐scores is in suturing. Among the sub‐scores of suturing, overall performance has the strongest negative correlation with procedural errors. This could happen because an inefficient procedure has the greatest impact on overall performance in suturing. Needle passing has a weaker negative correlation between procedural errors and GRS score. The needle handling sub‐score has the highest negative correlation with the number of procedural errors. This is expected as needle handling is the main component of the needle passing task and poor performance due to procedural errors will lead to a lower score.

TABLE 5

Correlation between number of procedural errors and GRS sub‐scores for suturing and needle passing

GRS sub‐score	Suturing		Needle passing
	Correlation	p value	Correlation	p value
	Coefficient	p value	Coefficient	p value
Respect for tissue	−0.41	0.009	−0.12	0.528
Suture and needle handling	−0.50	0.001	−0.26	0.184
Time and motion	−0.55	<0.001	−0.11	0.594
Flow of operation	−0.43	0.006	−0.22	0.268
Overall performance	−0.62	<0.001	−0.16	0.412
Quality of final product	−0.26	0.115	−0.02	0.920
GRS score	−0.51	<0.001	−0.15	0.434

Correlation between number of procedural errors and GRS sub‐scores for suturing and needle passing

Procedural errors and trial duration

In suturing, there is a significant positive correlation between procedural errors and the duration of the trials, so more procedural errors lead to longer trials. However, there is no significant correlation in needle passing possibly because needle passing is an easier task (Table 6).

TABLE 6

Correlation between procedural errors and trial durations

Task	r	p value
Suturing	0.71	<0.001
Needle passing	0.17	0.399

Correlation between procedural errors and trial durations

DISCUSSION

We used our insights from the analysis of executional and procedural errors in the JIGSAWS dataset to answer the research questions posed in Section 2: RQ1: Which tasks and gestures are most prone to errors? More challenging gestures in each task that require a high level of accuracy and hand coordination were more prone to executional errors. As shown in Table 1, suturing is more difficult than needle passing and had a greater number of executional errors. G6, G3, and G4 had the greatest number of executional errors in suturing while G2 and G6 had the greatest number of executional errors in needle passing. However, procedural errors were almost equally likely in both tasks. 18/39 suturing trials and 15/28 needle passing trials contained procedural errors (Table 4). RQ2: Are there common error modes or patterns across gestures and tasks? Within each task, each gesture had a different predominant error mode that correlated with the challenging aspects of performing the gesture. For both tasks, G2 and G3 had a large number of ‘multiple attempts’ errors, G5 had the fewest errors, G6 had the largest number of ‘out of view’ errors, and G4 and G6 had the greatest number of gestures with multiple errors. Thus, executional errors are context‐specific and their type and frequency depend on both task and gesture. RQ3: Are erroneous gestures distinguishable from normal gestures? KL divergence magnitude provides insight into which gestures have the greatest difference between normal and erroneous examples. We found that G9 from suturing, G2 from needle passing, and G3 from suturing had the three greatest KL divergences for any parameter. However, upon examination of kinematic data for the left gripper angle of G9 from suturing, the large KL divergence for this gesture could be due to the effect of three outlying gestures on an already relatively small sample of only 24 examples. RQ4: What kinematic parameters can be used to distinguish between normal and erroneous gestures? Table 3 lists the parameters with the greatest KL divergence for each gesture and task which can be used to develop automated error detectors. Focussing on a subset of variables for a given task and gesture may enable and improve real‐time error detection and skill‐assessment by reducing processing time and providing context. Our KL divergence analysis approximated the DTW distance distributions as Gaussian, which might not be always accurate. Future work will focus on further refining our analysis method to address this limitation. RQ5: Do errors impact the duration of the trajectory? Executional and procedural errors often lead to lengthier trials, especially during more complicated tasks such as suturing. Timely detection and correction during training or surgery will enable more efficient and safer patient care, and aid in reducing learning curves and time to certification. RQ6: Are there any correlations between errors and surgical skill levels? The total number of executional errors made per trial could help differentiate skill levels. We found this to be true for self‐proclaimed skill levels in Suturing and GRS skill levels in needle passing. There was a significant negative correlation between overall GRS scores and sub‐scores and the total number of procedural errors made per trial in Suturing meaning a greater number of procedural errors contributes to a lower GRS score. After examining procedural error patterns, we noticed that self‐proclaimed novice surgeons tend to closely follow the grammar graph, but experts have unique signatures that deviate from the graph. This motivates developing automated gesture identification and procedural error detection techniques based on grammar graphs for training novice surgeons in simulation experiments. Further verification of the correlation between errors and skill levels requires access to larger datasets representing more tasks and surgeons. Additionally, the grammar graphs cannot completely capture all possible valid gesture sequences and surgeon‐specific signatures, and manual labelling may introduce errors in the gesture transcripts (e.g., incorrectly adding or missing some gestures) that might lead to incorrect detection of errors.

CONCLUSION

We presented a new rubric and method for objective evaluation of RAS procedures with a focus on gesture and task‐specific executional and procedural errors. We used the proposed rubric to evaluate dry‐lab demonstrations of Suturing and Needle Passing tasks. Our analysis identified the most common error modes and their correlations with skill levels and demonstration times as well as important error‐specific kinematic parameters that distinguish erroneous gestures. This study is a step towards developing methods for automated error detection and providing real‐time context‐dependent feedback for performance improvement. Future work will extend the error rubric and analytic methods to larger datasets and other surgical tasks.

CONFLICT OF INTEREST

The authors declare that they have no conflict of interest.

24 in total

Analysis of executional and procedural errors in dry-lab robotic surgery experiments.

INTRODUCTION

METHODS

Rubric for objective assessment of errors in robotic surgery

JIGSAWS dataset

Executional error analysis

Dynamic time warping

Kullback–Liebler divergence

Trajectory averaging

Procedural error analysis

RESULTS

Executional errors

Distribution of executional errors among gestures

Kinematic parameters for distinguishing errors in each gesture

Executional errors and skill levels

Executional errors and gesture duration

Executional errors and trial duration

Procedural errors

Procedural errors and self‐proclaimed skill levels

Procedural errors and GRS skill levels

Procedural errors and trial duration

DISCUSSION

CONCLUSION

CONFLICT OF INTEREST

1. Modeling surgical processes: a four-level translational approach.

Review 2. Robotic partial nephrectomy: surgical technique.

3. Analysis of Energy-Based Metrics for Laparoscopic Skills Assessment.

4. Development of a technical checklist for the assessment of suturing in robotic surgery.

Review 5. Defining technical errors in laparoscopic surgery: a systematic review.

6. Robotic surgery training: construct validity of Global Evaluative Assessment of Robotic Skills (GEARS).

7. A system for learning statistical motion patterns.

8. Errors enacted during endoscopic surgery--a human reliability analysis.

9. Adverse Events in Robotic Surgery: A Retrospective Study of 14 Years of FDA Data.

10. Analysis of executional and procedural errors in dry-lab robotic surgery experiments.

1. Analysis of executional and procedural errors in dry-lab robotic surgery experiments.