| Literature DB >> 35387582 |
Andres Jung1, Julia Balzer2, Tobias Braun3,4, Kerstin Luedtke5.
Abstract
BACKGROUND: Internal and external validity are the most relevant components when critically appraising randomized controlled trials (RCTs) for systematic reviews. However, there is no gold standard to assess external validity. This might be related to the heterogeneity of the terminology as well as to unclear evidence of the measurement properties of available tools. The aim of this review was to identify tools to assess the external validity of RCTs. It was further, to evaluate the quality of identified tools and to recommend the use of individual tools to assess the external validity of RCTs in future systematic reviews.Entities:
Keywords: Applicability; External validity; Generalizability; Measurement properties; Randomized controlled trial; Tools
Mesh:
Year: 2022 PMID: 35387582 PMCID: PMC8985274 DOI: 10.1186/s12874-022-01561-5
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Flow diagram “of systematic search strategy used to identify clinimetric papers”[24]
Experimental/statistical methods to evaluate the EV of RCTs
Abbreviations: EV external validity, NNT numbers needed to treat, RCT randomized controlled trial
For non-experimental methods, please refer to Table 2
Characteristics of included tools
| Dimension and/or tool | Authors | Construct(s), as described by the authors | Target population | Domains, nr. of items | Response options | Development and validation |
|---|---|---|---|---|---|---|
| “Applicability”-dimension of LEGEND | Clark et al. [ | Applicability of results to treating patients | P1: RCTs and CCTs P2: reviewers and clinicians | 3 items | 3-point-scale | Deductive and inductive item-generation. Tool was pilot tested among an interprofessional group of clinicians. |
| “Applicability”-dimension of Carr´s evidence-grading scheme | Carr et al. [ | Generalizability of study population | P1: clinical trials P:2 authors of SRs | 1 item | 3-point-classification-scale | No specific information on tool development. |
| Bornhöft´s checklist | Bornhöft et al. [ | External validity (EV) and Model validity (MV) of clinical trials | P1: clinical trials P2: authors of SRs | 4 domains with 26 items for EV and MV each | 4-point-scale | Development with a comprehensive, deductive item-generation from the literature. Pilot-tests were performed, but not for the whole scales. |
| Cleggs´s external validity assessment | Clegg et al. [ | Generalizability of clinical trials to England and Wales | P1: clinical trials P2: authors of SRs and HTAs | 5 items | 3-point-scale | No specific information on tool development |
| Clinical applicability | Haraldsson et al. [ | Report quality and applicability of intervention, study population and outcomes | P1: RCTs P2: reviewers | 6 items | 3-point-scale and 4-point-scale | No specific information on tool development |
| Clinical Relevance Instrument | Cho & Bero [ | Ethics and Generalizability of outcomes, subjects, treatment and side effects | P1: clinical trials P2: reviewers | 7 items | 3-point-scale | Tool was pilot tested on 10 drug studies. Content validity was confirmed by 7 reviewers with research experience. - interrater reliability: ICC = 0.56 ( |
| “Clinical Relevance” according to the CCBRG | Van Tulder et al. [ | Applicability of patients, interventions and outcomes | P1: RCTs P2: authors of SRs | 5 items | 3-point-scale (Staal et al., 2008) | Deductive item-generation for Clinical Relevance. Results were discussed in a workshop. After two rounds, a final draft was circulated for comments among editors of the CCBRG. |
| Clinical Relevance Score | Karjalainen et al. [ | Report quality and applicability of results | P1: RCTs P2: reviewers | 3 items | 3-point-scale | No specific information on tool development. |
| Estrada´s applicability assessment criteria | Estrada et al. [ | Applicability of population, intervention, implementation and environmental context to Latin America | P1: RCTs P2: reviewers | 5 domains with 8 items | 3-point-scale for each domain | Deductive item generation from the review by Munthe-Kaas et al. [ |
| EVAT (External Validity Assessment Tool) | Khorsan & Crawford [ | External validity of participants, intervention, and setting | P1: RCTs and non-randomized studies P2: reviewers | 3 items | 3-point-scale | Deductive item-generation. Tool developed based on the GAP-checklist [ |
| “External validity”-dimension of the Downs & Black-Checklist | Downs & Black [ | Representativeness of study participants, treatments and settings to source population or setting | P1: RCTs and non-randomised studies P2: reviewers | 3 items | 3-point-scale | Deductive item-generation, pilot test and content validation of pilot version. Final version tested for: - internal consistency: KR-20 = 0.54 ( - reliability: test-retest: interrater reliability: ICC = 0.76 ( |
| “External validity”-dimension of Foy´s quality checklist | Foy et al. [ | External validity of patients, settings, intervention and outcomes | P1: intervention studies P2: reviewers | 6 items | not clearly described | Deductive item-generation. No further information on tool development. |
| “External validity”-dimension of Liberati´s quality assessment criterias | Liberati et al. [ | Report quality and generalizability | P1: RCTS P2: reviewers | 9 items | dichotomous and 3-point-scale | Tool is a modified version of a previously developed checklist [ |
| “External validity”-dimension of Sorg´s checklist | Sorg et al. [ | External validity of population, interventions, and endpoints | P1: RCTs P2: reviewers | 4 domains with 11 items | not clearly described | Developed based on Bornhöft et al. [ |
| “external validity”-criteria of the USPSTF | USPSTF Procedure manual [ | Generalizability of study population, setting and providers for US primary care | P1: clinical studies P2: USPSTF reviewers | 3 items | Sum-score- rating: 3-point-scale | Tool developed for USPSTF reviews. No specific information on tool development. - interrater reliability: ICC = 0.84 ( |
| FAME (Feasibility, Appropriateness, Meaningfulness and Effectiveness) scale | Averis et al. [ | Grading of recommendation for applicability and ethics of intervention | P1: intervention studies P2: reviewers | 4 items | 5-point-scale | The FAME framework was created by a national group of nursing research experts. Deductive and inductive item-generation. No further information on tool development. |
GAP (Generalizability, Applicability and Predictability) checklist | Fernandez-Hermida et al. [ | External validity of population, setting, intervention and endpoints | P1: RCTs P2: Reviewers | 3 items | 3-point-scale | No specific information on tool development. |
| Gartlehner´s tool | Gartlehner et al. [ | To distinguish between effectiveness and efficacy trials | P1: RCTs P2: reviewers | 7 items | Dichotomous | Deductive and inductive item-generation. - criterion validity testing with studies selected by 12 experts as gold standard.: specificity = 0.83, sensitivity = 0.72 ( - measurement error: 78.3% agreement ( - interrater reliability: |
| Green & Glasgow´s external validity quality rating criteria | Green & Glasgow [ | Report quality for generalizability | P1: trials (not explicitly described) P2: reviewers | 4 Domains with 16 items | Dichotomous | Deductive item-generation. Mainly based on the Re-Aim framework.[ - interrater reliability: ICC = 0.86 ( - discriminative validity: TREND studies report on 77% and non-TREND studies report on 54% of scale items ( - ratings across included studies ( |
| “Indirecntess”-dimension of the GRADE handbook | Schünemann et al. [ | Differences of population, interventions, and outcome measures to research question | P1: intervention studies P2: authors of SRs, clinical guidelines and HTAs | 4 items | Overall: 3-point-scale (downgrading options) | Deductive and inductive item-generation, pilot-testing with 17 reviewers ( - interrater reliability: ICC = 0.00–0.13 ( |
| Loyka´s external validity framework | Loyka et al. [ | Report quality for generalizability of research in psychological science | P1: intervention studies P2: researchers | 4 domains with 15 items | Dichotomous | Deductive item generation (including Green & Glasgow [ - measurement error: 60-100% agreement ( |
| Modified “Indirectness” of the Checklist for GRADE | Meader et al. [ | Differences of population, interventions, and outcome measures to research question. | P1: meta-analysis of RCTs P2: authors of SRs, clinical guidelines and HTAs | 5 items | Item-level: 2-and 3-point-scale Overall: 3-point-scale (grading options) | Developed based on GRADE method, two phase pilot-tests, - interrater reliability: kappa was poor to almost perfect on item-level [ |
| external validity checklist of the NHMRC handbook | NHMRC handbook [ | external validity of an economic study | P1: clinical studies P2: clinical guideline developers, reviewers | 6 items | 3-point-scale | No specific information on tool development. |
| revised GATE in NICE manual (2012) | NICE manual [ | Generalizability of population, interventions and outcomes | P1: intervention studies P2: reviewers | 2 domains with 4 items | 3-point-scale and 5-point-scale | Based on Jackson et al. [ |
| RITES (Rating of Included Trials on the Efficacy-Effectiveness Spectrum) | Wieland et al. [ | To characterize RCTs on an efficacy-effectiveness continuum. | P1: RCTs P2: reviewers | 4 items | 5-point-likert-scale | Deductive and inductive item-generation, modified Delphi procedure with 69–72 experts, pilot testing in 4 Cochrane reviews, content validation with Delphi procedure and core expert group ( - interrater reliability: ICC = 0.54-1.0 ( - convergent validity with PRECIS 2 tool: |
| Section A (Selection Bias) of EPHPP (Effective Public health Practice Project) tool | Thomas et al. [ | Representativeness of population and participation rate. | P1: clinical trials P2: reviewers | 2 items | Item-level: 4-point-scale and 5-point-scale Overall: 3-point-scale | Deductive item-generation, pilot-tests, content validation by 6 experts, - convergent validity with Guide to Community Services (GCPS) instrument: 52.5–87.5% agreement ( - test-retest reliability: |
| Section D of the CASP checklist for RCTs | CASP Programme [ | Applicability to local population and outcomes | P1: RCTs P2: participants of workshops, reviewers | 2 items | 3-point-scale | Deductive item-generation, development and pilot-tests with group of experts. |
| Whole Systems research considerations´ checklist | Hawk et al. [ | Applicability of results to usual practice | P1: RCTs P2: Reviewers (developed for review) | 7 domains with 13 items | Item-level: dichotomous Overall: 3-point-scale | Deductive item-generation. No specific information on tool development. |
Abbreviations: CASP Critical Appraisal Skills Programme, CCBRG Cochrane Collaboration Back Review Group, CCT controlled clinical trial, GATE Graphical Appraisal Tool for Epidemiological Studies, GRADE Grading of Recommendations Assessment, Development and Evaluation, HTA Health Technology Assessment, ICC intraclass correlation, LEGEND Let Evidence Guide Every New Decision, NICE National Institute for Health and Care Excellence, PRECIS PRagmatic Explanatory Continuum Indicator Summary, RCT randomized controlled trial, TREND Transparent Reporting of Evaluations with Nonrandomized Designs, USPSTF U.S. Preventive Services Task Force
Methodological quality of included studies based on COSMIN risk of bias (RoB) checklist
| Tool or dimension | Report | Content validity | Internal structure | Remaining measurement properties | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
| Clark et al. [ | doubtful | ||||||||||
|
| Carr et al. [ | inadequate | ||||||||||
|
| Bornhöft et al. [ | inadequate | ||||||||||
|
| Clegg et al. [ | inadequate | ||||||||||
|
| Haraldsson et al. [ | inadequate | ||||||||||
|
| Cho & Bero [ | doubtful | doubtful | doubtful | doubtful | |||||||
| Cho & Bero [ | adequate | |||||||||||
|
| Van Tulder et al. [ | inadequate | doubtful | doubtful | doubtful | |||||||
|
| Karjalainen et al. [ | inadequate | ||||||||||
|
| Estrada et al. [ | doubtful | ||||||||||
|
| Khorsan & Crawford [ | doubtful | ||||||||||
|
| Downs & Black [ | doubtful | doubtful | doubtful | doubtful | doubtful | very gooda | inadequatea | adequate | |||
| very gooda | inadequatea | |||||||||||
| O´Connor et al. [ | very good | |||||||||||
|
| Foy et al. [ | inadequate | ||||||||||
|
| Liberati et al. [ | inadequate | ||||||||||
|
| Sorg et al. [ | inadequate | ||||||||||
|
| USPSTF manual [ | inadequate | ||||||||||
| O´Connor et al. [ | very good | |||||||||||
|
| Averis et al. [ | inadequate | ||||||||||
|
| Fernandez-Hermida et al. [ | inadequate | ||||||||||
|
| Gartlehner et al. [ | inadequate | very good | adequate | adequate | |||||||
| Zettler et al. [ | very good | |||||||||||
|
| Green & Glasgow [ | inadequate | ||||||||||
| Laws et al. [ | doubtful | |||||||||||
| Mirza et al. [ | adequate | doubtful | ||||||||||
| Atkins et al. [ | adequate | |||||||||||
| Wu et al. [ | inadequate | |||||||||||
|
| Loyka et al. | doubtful | adequate | |||||||||
|
| Meader et al. [ | adequate | adequateb | |||||||||
| Llewellyn et al. [ | ||||||||||||
|
| NHMRMC Handbook [ | inadequate | ||||||||||
|
| NICE Guideline [ | inadequate | ||||||||||
|
| Wieland et al. [ | adequate | adequate | very good | very good | |||||||
| Aves et al. [ | inadequate | very good | ||||||||||
|
| Thomas et al. [ | inadequate | doubtful | doubtful | doubtful | doubtful | doubtful | |||||
| Armijo-Olivo et al. [ | doubtful | |||||||||||
|
| Critical Appraisal Skills Programme [ | inadequate | ||||||||||
|
| Hawk et al. [ | inadequate | ||||||||||
Fields left blank indicate that those measurement properties were not assessed by the study authors
Abbreviations: CB comprehensibility, RE relevance, CV comprehensiveness, CCBRG Cochrane Collaboration Back Review Group, EPHPP Effective Public Health Practice Project, EVAT External Validity Assessment Tool, FAME Feasibility, Appropriateness, Meaningfulness and Effectiveness, GAP Generalizability, Applicability and Predictability; GATE Graphical Appraisal Tool for Epidemiological Studies, GRADE Grading of Recommendations Assessment, Development and Evaluation; LEGEND Let Evidence Guide Every New Decision, NHMRC National Health & Medical Research Council, NICE National Institute for Health and Care Excellence, RITES Rating of Included Trials on the Efficacy-Effectiveness Spectrum, USPSTF U.S. Preventive Services Task Force
a two studies on reliability (test-retest & inter-rater reliability) in the same article
b results from the same study on reliability reported in two articles [94, 95]
Criteria for good measurement properties & certainty of evidence according to the modified GRADE method
| Tool or dimension | Content validity | Internal consistency | Reliability | Measurement error | Criterion validity | Construct validity |
|---|---|---|---|---|---|---|
| CGMP | (?) | |||||
| GRADE | Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | (-) | ||||
| GRADE | Moderate | Moderate | ||||
| CGMP | (?) | |||||
| GRADE | Moderate | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Low | |||||
| CGMP | (?) | (?) | (±)a | (?) | (-) | |
| GRADE | Moderate | Very Low | Moderate | Very Low | Very Low | |
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | (+) | ||||
| GRADE | Very Low | Very Low | ||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | (-) | (?) | (+) | ||
| GRADE | Very Low | Moderate | Very Low | Very Low | ||
| CGMP | (?) | (+) | (-) | |||
| GRADE | Very Low | Very Low | Very Low | |||
| CGMP | (?) | (-) | ||||
| GRADE | Moderate | Very Low | ||||
| CGMP | (?) | (?) | ||||
| GRADE | Very Low | Low | ||||
| CGMP | (?) | (-) | ||||
| GRADE | Low | Very Low | ||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (+) | (+) | (+) | |||
| GRADE | Moderate | Very Low | Low | |||
| CGMP | (?) | (-) | (+) | |||
| GRADE | Moderate | Low | Very Low | |||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
| CGMP | (?) | |||||
| GRADE | Very Low | |||||
Abbreviations: CCBRG Cochrane Collaboration Back Review Group, CGMP criteria for good measurement properties, EPHPP Effective Public Health Practice Project, GRADE Grading of Recommendations Assessment, Development and Evaluation, LEGEND Let Evidence Guide Every New Decision, NICE National Institute for Health and Care Excellence, USPSTF U.S. Preventive Services Task Force;
Criteria for good measurement properties: (+) = sufficient; (?) = indeterminate; (-) = insufficient, (±) or inconsistent
Level of evidence according to the modified GRADE approach: high, moderate, low, or very low evidence.Note: the measurement properties “structural validity” and “cross-cultural validity” are not presented in this table, since they were not assessed in any of the included studies
Fields left blank indicate that those measurement properties were not assessed by the study authors
a please refer to Table S4 for more information on reliability of the “external validity”-dimension of the Downs & Black checklist