| Literature DB >> 32520766 |
Chiara M Santomauro1, Andrew Hill, Tara McCurdie, Hannah L McGlashan.
Abstract
STATEMENT: Simulation is increasingly being used in healthcare improvement projects. The aims of such projects can be extremely diverse. Accordingly, the outcomes or participant attributes that need to be measured can vary dramatically from project-to-project and may include a wide range of nontechnical skills, technical skills, and psychological constructs. Consequently, there is a growing need for simulation practitioners to be able to identify suitable measurement tools and incorporate them into their work. This article provides a practical introduction and guide to the key considerations for practitioners when selecting and using such tools. It also offers a substantial selection of example tools, both to illustrate the key considerations in relation to choosing a measure (including reliability and validity) and to serve as a convenient resource for those planning a study. By making well-informed choices, practitioners can improve the quality of the data they collect, and the likelihood that their projects will succeed.Entities:
Mesh:
Year: 2020 PMID: 32520766 PMCID: PMC7531509 DOI: 10.1097/SIH.0000000000000442
Source DB: PubMed Journal: Simul Healthc ISSN: 1559-2332 Impact factor: 2.690
Potential Sources of Unreliability Commonly Considered in Published Studies of Relevant Measures, Typical Traditional Forms of Reliability Evidence, Terminology, Tips for Interpretation, and Examples From Relevant Literature
| Potential Source of Unreliability | Typical Traditional Form of Reliability Evidence | Relevant Terms Used in the Literature | Tips for Interpreting Reliability Evidence for Research/Evaluation Purposes | Example From Relevant Literature |
|---|---|---|---|---|
| Items | The degree to which participants completing a measure give consistent responses to items intended to tap into the same construct. | • Internal consistency | If the tool incorporates several subscales, each measuring a different construct (or subconstruct), the internal consistency of each subscale should be evaluated. | Participants completed the 12-item HuFSHI. Scores were consistent across items (α = 0.92), suggesting that the 12 items were assessing the same underlying construct.[ |
| Occasions | The degree to which participants' scores on a measure are similar when they complete it multiple (≥2) times. | • Test-retest reliability (or consistency) | The time points should be sufficiently spaced out to prevent participants from relying on memory. | Participants completed the NTS on 2 occasions, 2 weeks apart. Scores were consistent across the 2 time points ( |
| Raters | The degree to which multiple observers (≥2) using a measure provide similar ratings of participants. | • Interrater (or rater) reliability (or agreement) | Observers should be appropriate for the context and may require training to use the measure. | Three observers rated the same participants on teamwork using the CTS. Their ratings were similar (ICC = 0.98), suggesting that the measure allows for consistent ratings.[ |
* DeVellis[46] suggests the following ranges for interpreting α: <0.60, unacceptable; 0.60–0.65, undesirable; 0.65–0.70, minimally acceptable; 0.70–0.80, respectable, 0.80–0.90, very good; and >0.90, excellent but indicating possible redundancy.
† McHugh[50] suggests the following ranges for interpreting Cohen κ: 0–0.20, no agreement; 0.21–0.39, minimal agreement; 0.40–0.59, weak agreement; 0.60–0.79, moderate agreement; 0.80–0.90, strong agreement; and ≥0.90, almost perfect agreement. In addition, McHugh[50] argues that κ values of less than 0.60 indicate inadequate agreement.
‡ Koo and Li[51] suggest the following ranges for interpreting an ICC: <0.50, poor agreement; 0.50–0.75, moderate agreement; 0.75–0.90, good agreement; and >0.90, excellent agreement.
CTS, Clinical Teamwork Scale; HuFSHI, Human Factors Skills for Healthcare Instrument; ICC, intraclass correlation; NTS, Nursing Teamwork Survey.
Common Forms of Validity Evidence Based on Relations to Other Variables, Terminology, Tips for Interpretation, and Examples From Relevant Literature
| Form of Validity Evidence | Relevant Terms Used in the Literature* | Tips for Interpreting Validity Evidence for Research/Evaluation Purposes | Example From Relevant Literature |
|---|---|---|---|
| The degree to which scores on the measure are associated with scores on an | • Convergent evidence (of validity) | The previously established measure must itself produce reliable scores that can be argued to be valid for the proposed score interpretation. | Scores on the NOTSS positively correlated with scores on the ANTS—a previously established measure of nontechnical skills ( |
| The degree to which scores on a measure are associated with scores on an | • Convergent evidence (of validity) | The previously established measure must itself produce reliable scores that can be argued to be valid for the proposed score interpretation. | Scores on the SAGAT positively correlated with scores on a traditional checklist assessment of task performance ( |
| The degree to which scores on a measure are associated with performance on a | • Criterion-related evidence (of validity) | The criterion measure must itself produce reliable scores that can be argued to be valid for the proposed score interpretation. | Scores on the NASA-TLX—a measure of subjective workload—positively correlated with the time taken to complete a task ( |
| The degree to which scores on a measure can distinguish between | • Contrasted groups | The difference between the groups must be in the expected direction (eg, participants who are more experienced in the content domain would be predicted to receive better scores than novices). | An ANOVA on |
* Many of the terms listed in this column are now regarded as outdated and potentially misleading because they imply that there are multiple “types” of validity (eg, predictive validity),[70] which is inconsistent with the contemporary unitary conceptualisation.[65–68] They have been included only to assist with the interpretation of published work that uses these terms and not as an endorsement of their continued use.
† Cohen[75] suggests the following benchmarks for classifying the strength of a correlation in the context of validity testing: |0.10| = small, |0.30| = medium, and |0.50| = large.
‡ If the outcome (or external criterion) is measured at a meaningfully later point in time, the association may be regarded as predictive evidence of validity, rather than concurrent evidence.
§ This term has been used inconsistently over time, shifting from its traditional constrained meaning (as a “type” of validity) to a general term encompassing all validity evidence,[71,72] and then to the present situation where it no longer appears in the Standards and its use is discouraged entirely.[64,70] Unfortunately, inconsistent usage of this term continues to afflict the literature.[70]
ANOVA, analysis of variance; ANTS, Anesthetists' Non-Technical Skills; NOTSS, Non-Technical Skills for Surgeons; SAGAT, Situation Awareness Global Assessment Technique.