| Literature DB >> 31890896 |
Daniël A Korevaar1, Gowri Gopalakrishna2, Jérémie F Cohen3,4, Patrick M Bossuyt5.
Abstract
Most randomized controlled trials evaluating medical interventions have a pre-specified hypothesis, which is statistically tested against the null hypothesis of no effect. In diagnostic accuracy studies, study hypotheses are rarely pre-defined and sample size calculations are usually not performed, which may jeopardize scientific rigor and can lead to over-interpretation or "spin" of study findings. In this paper, we propose a strategy for defining meaningful hypotheses in diagnostic accuracy studies. Based on the role of the index test in the clinical pathway and the downstream consequences of test results, the consequences of test misclassifications can be weighed, to arrive at minimally acceptable criteria for pre-defined test performance: levels of sensitivity and specificity that would justify the test's intended use. Minimally acceptable criteria for test performance should form the basis for hypothesis formulation and sample size calculations in diagnostic accuracy studies.Entities:
Year: 2019 PMID: 31890896 PMCID: PMC6921417 DOI: 10.1186/s41512-019-0069-2
Source DB: PubMed Journal: Diagn Progn Res ISSN: 2397-7523
Commonly used terminology in statistics of randomized controlled trials
| Term | Explanation |
|---|---|
| Null hypothesis | Claims that there is no difference in outcome across two or more groups (e.g., drug A is as good as placebo) |
| Alternative hypothesis | Claims that there is a difference in outcome across two or more groups (e.g., drug A is better than placebo) |
| Type 1 error (α) | Rejection of a true null hypothesis (i.e., a false-positive result) |
| Type 2 error (β) | Failure to reject a false null hypothesis (i.e., a false-negative result) |
| Effect size | A quantitative measure of the magnitude of the effect (e.g., mean difference, relative risk, or odds ratio) |
| Probability of obtaining the identified result (or something more extreme) under the assumption that the null hypothesis is true |
Diagnostic accuracy studies
In diagnostic accuracy studies, a series of patients suspected of having a target condition undergo both an index test (i.e., the test that is being evaluated) and the clinical reference standard (i.e., the best available method for establishing if a patient does or does not have the target condition) [ Assuming that the results of the index test and reference standard are dichotomous—either positive or negative—we can present the results of the study in a contingency table (or “2 × 2 table”), which shows the extent to which both tests agree (Fig. Although it is possible to generate a single estimate of the index test’s accuracy, such as the diagnostic odds ratio [ |
Fig. 1Typical output of a diagnostic accuracy study: the contingency table (or “2 × 2 table”)
Fig. 2Receiver operating characteristic (ROC) space with “target region” based on minimally acceptable criteria for accuracy. ROC space has two dimensions: sensitivity (y-axis) and 1-specificity (x-axis). When the sum of sensitivity and specificity is ≥ 1.0, the test’s accuracy will be a point somewhere in the upper left triangle. The “target region” of a diagnostic accuracy study will always touch the upper left corner of ROC space, which is the point for perfect tests, where both sensitivity and specificity are 1.0. From there, the rectangle extends down, to MAC for sensitivity, and extend to the right, towards MAC for specificity. The gray square represents the target region of a diagnostic accuracy study with a MAC (sensitivity) of 0.70, and a MAC (specificity) of 0.60. MAC, minimally acceptable criteria
Fig. 3Defining minimally acceptable criteria (MAC) for diagnostic accuracy
Working example on how to define minimally acceptable criteria (MAC) for diagnostic accuracy
| Identify the existing clinical pathway in which the index test will be used | |
In children with pharyngitis, about one third of cases are due to bacterial infection with group A | |
| Define the role of the index test in the clinical pathway | |
| In case of a GAS pharyngitis, clinical guidelines recommend treatment with antibiotics. Misdiagnosis of GAS pharyngitis, however, could lead to unnecessary initiation of antibiotic treatment. Rapid antigen detection testing has a high specificity, but a sensitivity around 86%, which may lead to false-negative results [ | |
| Define the expected proportion of patients with the target condition | |
| In establishing MAC for sensitivity and specificity, the authors assumed “a prevalence of group A streptococcal infection of 35%” [ | |
| Identify the downstream consequences of test results | |
| The aim of the study is to identify a clinical decision rule that is able to accurately detect patients at low risk or at high risk of GAS pharyngitis [ | |
| Weigh the consequences of test misclassifications | |
| In weighing the consequences of test misclassifications for sensitivity, the authors refer to expert opinion in previous literature: “Clinicians do not want to miss GAS cases that could transmit the bacterium to other individuals and/or lead to complications. […] Several clinical experts consider that diagnostic strategies for sore throat in children should be at least 80–90% sensitive” [ | |
| Define the study hypothesis by setting minimally acceptable criteria (MAC) for sensitivity and specificity | |
The authors define MAC for sensitivity and specificity as follows: “After reviewing the literature and discussing until consensus within the review team, and assuming a prevalence of GAS infection of 35% and a maximally acceptable antibiotics prescription rate of 40%, we defined the target zone of accuracy as sensitivity and specificity greater than 85%. For each rules-based selective testing strategy, we used a graphical approach to test whether the one-sided rectangular 95% confidence region for sensitivity and specificity lay entirely within the target zone of accuracy” [ | |
| Perform a sample size calculation | |
Since the aim of the study was to externally validate clinical prediction rules in an existing dataset, no sample size calculation was performed, which the authors acknowledge as a limitation in their discussion section: “A further limitation lies in the absence of an a priori sample size calculation. One of the clinical prediction rules met our target zone of accuracy based on the point estimates alone (Attia’s rule), but it was considered insufficient because the boundaries of the confidence intervals for sensitivity and specificity went across the prespecified limits for significance. This could be due to lack of power, and our results should be considered with caution until they are confirmed with a larger sample of patients” [ When using the calculator proposed in Additional file | |
| Arrive at meaningful conclusions | |
| In their article, the authors graphically illustrate the performance of the investigated clinical prediction rules in ROC space (Fig. |
Fig. 4External validation of the diagnostic accuracy of rules-based selective testing strategies (figure derived from Cohen and colleagues [16]). Graph shows sensitivity and specificity estimates with their one-sided rectangular 95% confidence regions. Numbers indicate the rules-based selective testing strategies