Literature DB >> 35261056

Evaluating validity evidence for 2 instruments developed to assess students' surgical skills in a simulated environment.

Robin M Farrell¹, Gregory E Gilbert^2,3, Larry Betance⁴, Jennifer Huck⁵, Julie A Hunt⁶, James Dundas⁷, Eric Pope⁴.

Abstract

OBJECTIVE: To gather and evaluate validity evidence in the form of content and reliability of scores produced by 2 surgical skills assessment instruments, 1) a checklist, and 2) a modified form of the Objective Structured Assessment of Technical Skills (OSATS) global rating scale (GRS). STUDY
DESIGN: Prospective randomized blinded study. SAMPLE POPULATION: Veterinary surgical skills educators (n =10) evaluated content validity. Scores from students in their third preclinical year of veterinary school (n = 16) were used to assess reliability.
METHODS: Content validity was assessed using Lawshe's method to calculate the Content Validity Index (CVI) for the checklist and modified OSATS GRS. The importance and relevance of each item was determined in relation to skills needed to successfully perform supervised surgical procedures. The reliability of scores produced by both instruments was determined using generalizability (G) theory.
RESULTS: Based on the results of the content validation, 39 of 40 checklist items were included. The 39-item checklist CVI was 0.81. One of the 6 OSATS GRS items was included. The 1-item GRS CVI was 0.80. The G-coefficients for the 40-item checklist and 6-item GRS were 0.85 and 0.79, respectively.
CONCLUSION: Content validity was very good for the 39-item checklist and good for the 1-item OSATS GRS. The reliability of scores from both instruments was acceptable for a moderate stakes examination. IMPACT: These results provide evidence to support the use of the checklist described and a modified 1-item OSAT GRS in moderate stakes examinations when evaluating preclinical third-year veterinary students' technical surgical skills on low-fidelity models.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35261056 PMCID： PMC9314123 DOI： 10.1111/vsu.13791

Source DB: PubMed Journal: Vet Surg ISSN： 0161-3499 Impact factor: 1.618

INTRODUCTION

Veterinary graduates are expected to be competent in basic surgical skills. , The preclinical veterinary surgical skills curriculum is continuously evolving, with educators incorporating models and new methods of clinical skills training to ensure students attain competency in core skills. , , , , , , , , , , , , , , , As clinical skills training programs evolve, so do the assessment instruments used to evaluate educational interventions and students' performance on models, cadavers, and live animals. While veterinary educators have made significant progress in developing clinical skills assessments, relatively few reports including validity evidence for instruments to assess veterinary students' basic surgical skills using models have been published. , , , , , , , , The use of modified forms of the Objective Structured Assessment of Technical Skills (OSATS) global rating scale (GRS), or another validated assessment instrument adapted from medical education, has also been explored, but to date, limited validity evidence has been reported in the literature to support their use in veterinary education. , , , Validity is “the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of the tests.” Validity evidence for assessments informs the interpretation of results and decisions that educators and curriculum committees make regarding the consequences those results have for the students and program. , For instance, a must‐pass clinical skills examination required to complete a course would be considered a high‐stakes examination with significant consequences for the student and program if students did not pass or maybe even more crucial, passed despite a lack of competence. High‐stakes examinations can be defined as those that either allow or prevent students from progressing in their program, such as examinations that serve as progression hurdles. Creating a strong validity argument using validity evidence to support the high‐stakes nature of an examination instills confidence in both the assessor and student that the assessment and the scores produced are valid and reliable. Objective Structured Assessment of Technical Skills, a generally accepted assessment instrument used for human surgical residents, mimics an objective structured clinical examination (OSCE), except it consists of a benchtop model simulating a procedure or task, as opposed to individual skill or part of a task. , Learners' performance on an OSATS station is typically evaluated using a customized checklist for the task and GRS suitable for any surgical task. Hatala et al. (2015) found reasonable validity evidence supporting the use of the OSATS in a low stakes environment to provide formative feedback to physicians in surgical residency. A low stakes environment or assessment is one that does not impact students' progression through a program and is usually formative in nature, where outcomes such as rubric scores are used to provide feedback to help students improve their performance. Literature in veterinary and medical education has reviewed and evaluated simulation‐based instructional design and assessments and has suggested that better designed research projects are needed to collect data supporting the use of specific training methods, simulators, and assessments. , , , To collect accurate data, validated assessment instruments must be identified or developed to facilitate larger research projects, across multiple veterinary schools that can answer questions about simulation‐based teaching and assessment in veterinary education. The aim of this study was to gather and evaluate validity evidence in the areas of content and reliability of scores produced when using a task‐specific checklist assessment instrument developed at Ross University School of Veterinary Medicine (RUSVM) and an OSATS GRS assessment instrument adapted from medical education to assess preclinical third‐year veterinary students' surgical technical skills performed on a low‐fidelity ovariohysterectomy model.

MATERIAL AND METHODS

This study was approved by the institutional review board at Ross University School of Veterinary Medicine (approval number 493) and was conducted according to the tenets of the Declaration of Helsinki.

Context

Ross School of Veterinary Medicine (Basseterre, St. Kitts and Nevis, WI) was established in 1982. In 2008, the school began a curriculum revision, which incorporated models and simulation to enhance teaching, learning, and assessment of surgical skills. The curriculum revision expanded to include medical and professional skills and resulted in a task‐based vertically integrated spiral professional and clinical skills curriculum in which students were introduced to skills training in the first year and built on those skills through exposure to simulated tasks and procedures at increasing levels of complexity. , , Students' skills development was assessed through low‐stakes formative assessments during learning activities and through regularly scheduled OSCEs. Surgical skills simulation‐based training was nested within the professional and clinical skills curriculum. The training program culminated with a 15‐week compulsory surgery laboratory course in which students practiced basic surgical skills learned in the early curriculum, including aseptic technique, instrument handling, ligature placement, and suturing in simulated tasks, and procedures they would encounter in general practice such as wound closure, ovariohysterectomy, and cystotomy.

Development of the surgical skills examination

To assess student's surgical skills gained in the simulation‐based curriculum prior to students moving on to supervised live animal surgeries, a summative surgical technical skills examination was developed to be delivered within the surgical skills laboratory course. Students were required to pass the comprehensive surgical skills examination consisting of a simulated ovariohysterectomy (OVH) performed on a model developed at the university and evaluated in a previous study. The OVH model consisted of a wood and polyvinyl chloride (PVC) frame covered by a replaceable foam and fabric 3‐layer closure pad that was rotated for reuse with a replaceable latex reproductive tract (Figure 1). The use of an OVH model for practice and assessment has subsequently been supported by research demonstrating that students benefit from presurgical skills practice on OVH models. , , , The 40‐item checklist used to assess the examination was developed through a collaborative process among RUSVM faculty and has gone through several revisions to improve the clarity of the criteria and feasibility of administering this time‐intensive examination to a large class. Although the technical surgical skills examination has been revised based on an iterative consensus process, expert review, and student performance review, a formal evaluation of validity evidence has not previously been performed.

FIGURE 1

Ross ovariohysterectomy surgical simulation (ROSSie) model

Content validity evidence

A panel of expert surgical skills educators (n = 10) were recruited from multiple institutions according to the authors' international networks to validate the examination content. Experts were defined as veterinarians who self‐reported at least 2 years' experience teaching surgical skills to veterinary students in a simulated and/or live surgical environment. Panelists received an email stating the purpose of the study and requesting their participation. If they agreed to participate, they were sent a demographic survey, data collection form, and instructions on how to perform the content review. The content review required the experts to spend approximately 1 hour to rate each item on the surgical skills checklist and modified OSAT GRS as “essential,” “useful,” or “not necessary.” The OSATS GRS used in this study was derived from the OSATS GRS developed by Martin et al. (Table 1). Panelists were asked to consider each item on the basis of relevance to teaching and evaluating third‐year preclinical veterinary students' performance of surgical technical skills. They were not incentivized or compensated for their time.

TABLE 1

Checklist items meeting Wilson's criterion for inclusion

	Item description	Item content validity ratio as assessed by expert raters
1	Ovarian pedicle #1 – clamp placement	1.00
2	Excessive force on the pedicle is avoided	0.60
3	Ligature placement	1.00
4	Absorbable suture used	0.60
5	Two secure knots placed (surgeon's knot followed by a square knot)	1.00
6	Ligatures tight	0.80
7	Appropriate spacing between ligatures (2‐7 mm)	0.60
8	Pedicle severed just distal to middle forcep	0.60
9	Ovarian pedicle #2 – spacing between forceps (2‐5 mm inside distance)	0.60
10	Excessive force on the pedicle is avoided	0.56
11	Ligature placement	1.00
12	Absorbable suture used	0.60
13	Two secure knots placed (miller's knot followed by a square knot)	1.00
14	Ligature tight	1.00
15	Pedicle severed just distal to middle forcep	0.60
16	Uterine body – clamp placement	1.00
17	Ligature placement	1.00
18	Absorbable suture used	0.60
19	Two secure knots placed on each ligature (surgeon's or miller's knot followed by a square knot)	1.00
20	Ligatures tight	1.00
21	Appropriate spacing between ligatures (2‐7 mm)	0.60
22	Pedicle severed just distal to middle forcep	0.60
23	Body wall closure – place a minimum of 2 simple interrupted sutures ^a	1.00
24	Absorbable suture used	0.80
25	Full thickness bites of the fascia, muscle not included in suture	1.00
26	Sutures should be snug (tips of mosquito hemostats cannot easily slip underneath suture)	0.60
27	Two secure knots placed for each suture	0.80
28	Subcutaneous closure – technique of burying the knot correctly performed at beginning of pattern	1.00
29	Simple continuous pattern placed correctly (place a minimum of 3 stitches with bites 0.4‐1.3 cm apart, no backhanding)	0.80
30	Only subcutaneous tissue engaged in the pattern (no fascia or skin)	0.60
31	Continuous pattern ended correctly burying the knot	1.00
32	Knots are secure	1.00
33	Skin closure – 2 secure knots placed for each suture	0.80
34	Skin edges apposed	0.80
35	Sutures not too tight (tips of hemostat can slip easily into suture loop)	0.56
36	General – holds instruments correctly; uses correct instruments	0.80
37	Refrains from grasping suture with instrument, other than tag to be discarded; does not damage suture	0.80
38	Does not engage any tissue other than the pedicle in their hemostat	1.00
39	No major breaks in asepsis or multiple minor breaks in aseptic technique	1.00
39	Note: 2 or more breaks in asepsis (not corrected properly) will result in failure of the entire examination.	1.00
	Scale (entire checklist) Content Validity Index	0.81

Simple interrupted sutures were chosen for the body wall due to novice surgeons' potential for flaws in knot quality that may lead to dehiscence if a simple continuous pattern was used.

Checklist items meeting Wilson's criterion for inclusion Simple interrupted sutures were chosen for the body wall due to novice surgeons' potential for flaws in knot quality that may lead to dehiscence if a simple continuous pattern was used. The panelists' ratings were used to calculate the content validity ratio (CVR) for each item and the content validity index (CVI) for the overall rubric. Interrater reliability was evaluated using Gwet's AC2. Lawshe's method using CVR and CVI is an international standard for establishing content validity, providing concrete measurements to identify rubric items for acceptance or rejection, and allowing for generalizability of findings as it requires at least 10 reviewers of varying backgrounds to participate in the content review. , The content validity index (CVI) is a rubric‐level statistic that is equal to the calculated mean CVR of all items included on the rubric. The CVR values from 0 to 1 indicate that more than half of the experts considered the item(s) to be essential, and negative values mean that fewer than half of the experts considered the item(s) to be essential. Wilson et al.'s recommended CVR cut‐off values were used for rubric item inclusion. , Gwet's AC2 was chosen over other interrater reliability statistics, such as Cohen's kappa, because Gwet's AC2 can be used for categorical data and is more stable than kappa, being less subject to fluctuations resulting from different outcome values and marginal probability. , , Reliability measures for Gwet's AC2 were interpreted using George and Mallery's guidelines stating values over 0.9 are excellent, 0.8‐0.89 are good, 0.7‐0.79 are acceptable, 0.6‐0.69 are questionable, 0.5‐0.59 are poor, and less than 0.5 are unacceptable. Reviewers' data were entered into a spreadsheet. Validity and internal consistency analyses were completed using R v3.3.1 (Vienna, Austria).

Reliability of student performance scores

Digital recording was used to allow multiple raters to rate each student's performance on the surgical skills examination. The optimal camera angle to record the examination was determined by setting up a mock surgery examination station (Figure 2). A research assistant performed a simulated examination while being digitally recorded to establish an optimal camera position and angle allowing viewers to see most of the important aspects of each skill performed without discerning the student's identity. Masking tape marked the locations of the model, instrument table, and camera to ensure standardization. A Samsung hmx‐f80 camcorder positioned on a stand was used to capture the recordings.

FIGURE 2

Examination table setup

Examination table setup Five raters with at least 1 year of experience teaching surgical skills and assessing student performance on the surgical skills examination were recruited from the RUSVM faculty. Raters completed a 1‐hour interactive training session to review and discuss the criteria for the checklist items and global rating scales. Following the training session, raters reviewed the simulated examination digital recording to determine their ability to rate student performance on each checklist item. Raters identified 2 items that proved difficult to adequately assess from the digital recording alone; the security of the ligatures placed on the pedicles and sutures in the body wall. Based on the raters' experience assessing live examinations, it was decided ligature security on the pedicles would be determined by physically examining knot security and indentation the ligatures created on the actual cut pedicles, and body wall sutures would be evaluated by looking at apposition and suture placement through a zoomed‐in view of the model provided at the end of the digital recording. Veterinary students enrolled in the third‐year surgery laboratory (n = 136) course participated in the surgical skills examination and were examined by an in‐person rater as part of the normal delivery of the course. The students were made aware that the digital recordings of their performance would be used as part of a research study but would have no bearing on their examination grades. Each examination was digitally recorded using the method described above. A research assistant entered the student examination roster into an Excel spreadsheet and assigned a random number to each student using the randomization function in Excel. At the start of each examination, a technician placed an index card with the student's assigned number on the model briefly for purposes of identifying the recording. At the conclusion of each examination, a technician removed the skin and subcutaneous sutures, held the model up to the camera to provide a close‐up view of the body‐wall sutures, and removed the ligated pedicles from the model, taping them to the numbered index card for later physical examination by the examiners. The digital recordings were uploaded to a secure password‐protected network drive and labeled by number. Twenty of the 136 digital recordings were randomly selected using the randomization function in Excel and reviewed by the primary researcher for use in the generalizability study. Four of the 20 digital recordings could not be used as the recordings were incomplete. Three months following the live assessment, raters were given access to a network drive folder holding the digital recordings for review (n = 16), and index cards with cut pedicles. Three months was chosen to reduce potential bias that may be introduced by raters' memory of the live assessments. A longer period of time could not be facilitated due to a risk that the absorbable sutures would degrade, compromising the ability of the raters to evaluate the ligatures on the cut pedicles. Each rater scored all 16 recordings using the 40‐item checklist and 6‐item OSATS GRS. The G‐study was completed using the data collected on all 40 checklist items and six OSATS items. Reliability measures from the G‐study were interpreted using George and Mallery's guidelines. Generalizability (G) theory was used to assess the reliability of the scores produced. , , In this fully crossed 2‐facet G theory study (participants × raters x items), five raters independently rated all 16 of the digitally recorded student performances, evaluating no more than four recordings a day to minimize rater fatigue. A decision study (D study) was used to determine the relationship between the number of raters and the resultant G coefficient. Decision studies help to determine how many raters must rate a single student to maintain an adequate reliability of scores. The generalizability analysis was performed using GENOVA (Iowa City, Iowa, USA).

RESULTS

Ten veterinary surgical skills educators from veterinary teaching institutions in North America, Europe, and Australia participated in the study. Thirty‐nine of the 40 items on the checklist used in the G‐study met Wilson's criterion for inclusion based on their CVR (Table 1). The redundancy in some items is due to the students being required to perform certain skills multiple times during the examination, such as clamping and ligating an ovarian pedicle. The CVI for the modified 39‐item checklist was 0.81. Only 1 of the six items on the modified OSATS GRS, respect for tissue, met Wilson's criterion for inclusion, indicating that an inadequate number of expert reviewers deemed the other OSATS GRS items to be essential or useful. The CVI for the modified 1‐item GRS was 0.8. Gwet's AC2, a measurement for interrater reliability was .84 (95% CI: 0.81, 0.86) for the checklist, which was good, and .77 (95% CI: 0.63, 0.92) for the GRS, which was acceptable (Table 2).

TABLE 2

Modified OSAT global rating scale rubric

Time and motion	1	2	3	4	5
Time and motion	Not efficient, many unnecessary moves	Somewhat efficient, moderate amount of unnecessary moves	Efficient time/motion but some unnecessary moves	Efficient Good economy of movement	Maximum efficiency, great economy of movement
Instrument handling	1	2	3	4	5
Instrument handling	Novice, repeatedly makes tentative awkward moves with instruments	Advanced beginner, makes some tentative or awkward moves with instruments	Competent use of instruments although occasionally appeared stiff or awkward	Proficient use of instruments, fluid moves	Expert use of instrument, very fluid moves with instruments no awkwardness
Tissue Handling	1	2	3	4	5
Tissue Handling	Extremely rough with the tissue, repeatedly causing unnecessary trauma to the tissue	Moderately rough with the tissue, sometimes causing unnecessary trauma to the tissue	Competent tissue handling, occasionally handles it roughly	Proficient tissue handling, gentle use of hands and instruments	Expertly handled tissue with no unnecessary trauma
Knowledge of instruments	1	2	3	4	5
Knowledge of instruments	Frequently used the incorrect instruments for the task	Sometime used the incorrect instruments for the task	Used appropriate instruments for the task but hesitated at times	Used the appropriate instruments for the task	Obviously familiar with the instruments required
Flow of procedure and forward planning	1	2	3	4	5
Flow of procedure and forward planning	Frequently stopped the procedure or hesitated to perform the next step	Stopped or hesitated a few times to perform the next step of the procedure	Demonstrated ability to progress through the task at a slow pace ^a	Demonstrated ability for forward planning with steady progression through the task	Obviously planned course of task with effortless flow from one move to the next
Knowledge of specific procedure	1	2	3	4	5
Knowledge of specific procedure	Deficient knowledge, needed specific instruction at most operative steps	Deficient knowledge, needed guidance at some of the operative steps	Knew all important aspects of the task but lacks confidence in knowledge ^b	Knew all important aspects of the task	Demonstrated familiarity with all aspects of the operation
Overall rating	1	2	3	4	5
Overall rating	Needs significant amount of development in basic technical skills, tissue handling and/or procedural knowledge	Needs moderate amount of development in basic technical skills, tissue handling and/or procedural knowledge	Needs minimal to moderate amount of development in basic technical skills, tissue handling and/or procedural knowledge	Needs minimal development in basic technical skills, tissue handling and/or procedural knowledge	Has mastered basic technical skills, tissue handling and has a good understanding of procedural knowledge

slow pace could be defined as a rate of action during parts or all of the assessment that appeared too slow for the student to meet the overall time limit placed on the assessment.

A lack of confidence could be inferred based on students delaying the next step of the procedure while thinking or making tentative movements.

Modified OSAT global rating scale rubric Needs significant amount of development in basic technical skills, tissue handling and/or procedural knowledge slow pace could be defined as a rate of action during parts or all of the assessment that appeared too slow for the student to meet the overall time limit placed on the assessment. A lack of confidence could be inferred based on students delaying the next step of the procedure while thinking or making tentative movements.

Reliability of scores – checklist

The G study for the 40‐item checklist revealed students accounted for minimal variance (5%), suggesting individual students performed similarly to one another. Raters accounted for almost no variance (0.4%), suggesting excellent interrater reliability. Minimal variance (7%) was attributable to items, suggesting individual rubric items were rated similarly. Very little variance was attributable to the student by rater interaction (0.6%), indicating the rating was fair and free of bias. Moderate variance was attributable to the student by item interaction (17%) and rater by item interaction (14%), suggesting some students and/or raters may have identified certain items as more difficult than others and vice versa. A considerable variance was attributable to student by rater by item (sri) interaction and residual or unknown factors (55%), suggesting a number of other factors not assessed in this two facet G‐study contributed to the variance. The overall G‐coefficient was good at 0.85 when five raters evaluated each student's performance (Table 3). The G‐study was run using the original 40‐item checklist and was not repeated using only the 39 items that met the CVR threshold for inclusion; however, if the study were repeated, the impact of dropping a single item would likely be minimal. The D‐study results demonstrated that one rater per student would result in a G‐coefficient of 0.64, which is considered to indicate questionable reliability, and two raters would generate a G‐coefficient of 0.76, which is considered acceptable reliability. If 3, 4, or 5 raters were used, the G‐coefficients would be 0.81, 0.83, and 0.85, respectively, which indicate good reliability.

TABLE 3

Estimated variance components and G‐coefficient for the surgical skills checklist

Factor	Variance (%)	G coefficient
Students	5.4	0.85
Raters	0.4
Item	7.4
Students by rater	0.6
Students by items	17.1
Raters by item	14.2
Students by rater by item, and residual	54.8

Estimated variance components and G‐coefficient for the surgical skills checklist

Reliability of scores – modified OSAT global rating scale

The G study for the OSATS GRS revealed students accounted for about one quarter (24%) of the variation in scores, suggesting student performance varied moderately. The items (4%) and raters (13%) accounted for a minimal amount of the variance. The student by rater interaction accounted for a moderate amount of variance (16%). Students scored consistently across the items as evidenced by the low percentage of variance due to the student by item interaction (0.5%). Similarly, items were ranked consistently by the raters, contributing just 8% of the variance. Thirty percent of the variance is accounted for by the student by rater by item (sri) interaction confounded with all other sources of error. The G coefficient was .79, approaching the .80 cutoff for “good” (Table 4).

TABLE 4

Estimated variance components and G‐coefficient for the modified Objective Structured Assessment of Technical Skills global rating scale

Factor	Variance (%)	G coefficient
Students	24.0	0.79
Raters	13.0
Item	4.0
Students by rater	16.0
Students by items	0.5
Raters by item	8.0
Students by rater by item, and residual	30.0

Estimated variance components and G‐coefficient for the modified Objective Structured Assessment of Technical Skills global rating scale

DISCUSSION

We presented content validity evidence and reported the evidence of interrater reliability and generalizability of the scores produced. Each of these measures supported the use of the checklist in evaluating veterinary students' surgical skills during the third year of their preclinical studies, with surgical skills educators deeming 98% of the checklist essential and G study results demonstrating adequate interrater reliability. While the modified OSATS GRS did not have adequate content validity as assessed by the study's 10 experts, with only 1 of the six items deemed essential by the expert panelists, the reliability of scores was adequate. Furthermore, the generalizability results for the OSATS GRS attributed a moderate amount of variance to students, suggesting the OSATS GRS may facilitate raters to differentiate student performance better than the checklist which attributed minimal variance to students. This finding is similar to results reported by Read et al. which demonstrated that global rating scales in general – but not the OSATS GRS specifically – are reliable when used for scoring student performance in a clinical skills OSCE and therefore the use of a checklist in conjunction with a GRS may better differentiate student performance than a checklist alone as the GRS allows the rater to evaluate more qualitative aspects of performance. , While the results of the content review suggest that the OSATS GRS was not suitable for assessing preclinical veterinary students' simulated surgical skills examination, outside of the item, respect for tissue, more research is necessary to determine at what stage of a veterinary student's education learners are experienced enough to be expected to be competent at the more qualitative OSAT GRS items of time and motion, flow of procedure and forward planning, and knowledge of specific procedure – items that are probably not expected of preclinical third‐year students performing surgical skills on models. The OSATS GRS could be modified to a competency‐based veterinary education (CBVE) assessment by defining how the 1‐5 values correspond with expectations of each level of learner, similar to how milestones have been developed for some CBVE competencies. For example, students may be expected to be at an OSATS GRS of 3 upon entering their clinical phase of training and a GRS of 4 upon graduation, with a GRS of 5 only being achieved after further postgraduation surgical experience. Several pieces of validity evidence are necessary to determine how to interpret assessment results and set the consequences those results have for the students and program. , Reliability evidence is a crucial piece of validity evidence for any assessment method. , G theory is a robust measure of reliability, allowing investigators to evaluate a number of facets of variance at once, and the associated D study allows researchers to measure what impact a change will have on reliability, for example, how increasing the number of raters will impact the reliability. Therefore, a G study is useful for planning improvements to existing assessments and the way in which the scores are used. The G study results indicated that both the checklist and global rating scale produced scores reliable enough for moderate stakes testing. To reduce the high‐stakes nature of an examination there are a number of things that can be done, such as offering in‐course resits or reducing the weighting of an assessment component. In this case, while the examination is a must‐pass examination, students are allowed more than 1 attempt within the course and receive detailed feedback and support to help them prepare for a resit if they are unsuccessful on their first attempt. The D‐study results suggested the current 40 item checklist would require two raters to score each student's performance in order to maintain acceptable reliability for a high‐stakes examination. Increasing the number of raters for the surgical skills examination may not be feasible due to faculty workload, and it is unlikely that additional rater training would substantially improve reliability given the minimal contribution of the raters to the variance. Instead, reliability values may be further improved by investigating and standardizing subtle differences in students' examination environment and experience which were not facets included in this G study yet contributed to the residual (sri) variance. The minimal variance attributed to student performance on the checklist scores may have been due to the small random sample of students performing similarly well, or it may reflect the overall success of the students' extended clinical skills training program at building surgical skills. Surgical skills are learned through deliberate practice, , , which may be best delivered spaced out over a longer period of time to facilitate improved retention. , , , Clinical skills programs allowing students to participate in regular surgical skills practice with feedback from instructors over a period of weeks or months, as the RUSVM skills training is, are most likely to see students demonstrate consistent skills gains on assessments. The generalizability findings also issued support for the reliability of scoring surgical skills examinations via digital recordings. This is important to consider when it is not possible to bring students together for in‐person assessment, as with the recent COVID‐19 pandemic, or when an inadequate number of raters are available for real‐time assessment of students. Assessing digital recordings can take more time on the part of the raters as compared with live assessment of student performance. Raters in this study reported they spent a considerable amount of time rating the videos and welcomed the prescribed break after rating four recordings. Similar findings were reported by Tan et al. (2020) in a study evaluating rating of digitally recorded OSCE stations. Digitally recorded evaluations of surgical skills have also been assessed for evidence of reliability in other veterinary studies, , , , , so this remains a feasible option. While the digital recordings in this study were used solely to collect data for the reliability study, digital recordings can be a powerful tool to provide feedback to students on their performance to help them enhance their proficiency and will be considered for this purpose in the future. The results of this study supported and enhanced use of the comprehensive surgical skills examination on an OVH model in the local context and provided some validity evidence to support use of the checklist instrument in other veterinary programs. Relatively few rubrics with evidence of validity for assessing veterinary students' surgical skills have been published; however, these rubrics exist for live canine ovariohysterectomy, simulated canine ovariohysterectomy, live canine castration, simulated canine castration, and celiotomy closure in a canine cadaver. The previously existing simulated canine ovariohysterectomy rubric was an operative component rating scale, a task‐specific rubric requiring raters to score each step of the procedure on a 0‐3 point scale for a total of 102 points. Our study collected validation evidence for a dichotomous checklist having 39 points (Table 1), which may be easier for a rater to use in a busy teaching environment. This study had several limitations. Although the content review panel included experts from North America, Europe, and Australia, a more diverse panel with representation from other continents would have been preferable. Additionally, only 16 students' digital recordings were assessed due to technical errors and time constraints on the part of the raters. While specific guidelines on minimal sample size for generalizability studies have not been established, a minimum of 20 persons for a 1 facet design has been suggested. Studies in veterinary and nursing education have reported successfully using fewer than 20 persons in conjunction with larger number of conditions per facet. , The small sample size may have contributed to the observed low variance in student scores as assessed by the checklist. If the surgical skills examination on a model, scored using the checklist, is to be used as a high‐stakes assessment, particularly at other institutions, further validity evidence and additional reliability data should be gathered to maintain a solid validity argument for its use. Furthermore, the use of a global rating scale in conjunction with the checklist to assess students may help differentiate student performance better than the checklist alone as it allows the rater to evaluate more qualitative aspects of the students' performance. In conclusion, content validity was very good for the 39‐item checklist and was good for the 1‐item OSATS GRS, as tested here. The reliability of scores from both instruments was acceptable for a moderate stakes' examination. These results provide evidence to support the use of the checklist described over the OSATS GRS in a moderate‐stakes examination when evaluating preclinical third‐year veterinary students' technical surgical skills on low‐fidelity models. Additional research is necessary to understand at what point in a veterinary students' education the OSATS GRS becomes suitable for assessing surgical skills.

ETHICAL APPROVAL

Approval was granted by the institutional review board at Ross University School of Veterinary Medicine, 7 July 2015 (approval no. 493).

CONFLICT OF INTEREST

The authors declare no conflicts of interest related to this report.

57 in total

1. Reliability: on the reproducibility of assessment data.

Authors: Steven M Downing
Journal: Med Educ Date: 2004-09 Impact factor: 6.251

2. Expected frequency of use and proficiency of core surgical skills in entry-level veterinary practice: 2009 ACVS core surgical skills diplomate survey results.

Authors: Daniel D Smeak; Lawrence N Hill; Linda K Lord; L Clare V Allen
Journal: Vet Surg Date: 2012-03-01 Impact factor: 1.495

3. Assessment of first-year veterinary students' clinical skills using objective structured clinical examinations.

Authors: Kent Hecker; Emma K Read; Andrea Vallevand; Gord Krebs; Darlene Donszelmann; Christoph K W Muelling; Sarah L Freeman
Journal: J Vet Med Educ Date: 2010 Impact factor: 1.027

4. Making memories stick.

Authors: R Douglas Fields
Journal: Sci Am Date: 2005-02 Impact factor: 2.142

5. High agreement but low kappa: I. The problems of two paradoxes.

Authors: A R Feinstein; D V Cicchetti
Journal: J Clin Epidemiol Date: 1990 Impact factor: 6.437

Review 6. What counts as validity evidence? Examples and prevalence in a systematic review of simulation-based assessment.

Authors: David A Cook; Benjamin Zendejas; Stanley J Hamstra; Rose Hatala; Ryan Brydges
Journal: Adv Health Sci Educ Theory Pract Date: 2013-05-02 Impact factor: 3.853

7. Training Veterinary Students to Perform Ovariectomy Using theMOOSE Spay Model with Traditional Method versus the Dowling Spay Retractor.

Authors: Maria Fahie; Amanda Cloke; Minette Lagman; Ohad Levi; Peggy Schmidt
Journal: J Vet Med Educ Date: 2016-04-13 Impact factor: 1.027

8. Evaluation of a method to assess digitally recorded surgical skills of novice veterinary students.

Authors: Julie A Williamson; Robin Farrell; Casey Skowron; Brigitte A Brisson; Stacy Anderson; Dawn Spangler; Jason Johnson
Journal: Vet Surg Date: 2018-01-30 Impact factor: 1.495

9. Development and Evaluation of Two Canine Low-Fidelity Simulation Models.

Authors: Maria Aulmann; Maren März; Iwan A Burgener; Michaele Alef; Sven Otto; Christoph K W Mülling
Journal: J Vet Med Educ Date: 2015-05-07 Impact factor: 1.027

Review 10. Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review.

Authors: S Barry Issenberg; William C McGaghie; Emil R Petrusa; David Lee Gordon; Ross J Scalese
Journal: Med Teach Date: 2005-01 Impact factor: 3.650

1 in total

1. Evaluating validity evidence for 2 instruments developed to assess students' surgical skills in a simulated environment.

Authors: Robin M Farrell; Gregory E Gilbert; Larry Betance; Jennifer Huck; Julie A Hunt; James Dundas; Eric Pope
Journal: Vet Surg Date: 2022-03-08 Impact factor: 1.618

1 in total