Literature DB >> 22648675

Chapter 7: grading a body of evidence on diagnostic tests.

Sonal Singh¹, Stephanie M Chang, David B Matchar, Eric B Bass.

Abstract

INTRODUCTION: Grading the strength of a body of diagnostic test evidence involves challenges over and above those related to grading the evidence from health care intervention studies. This chapter identifies challenges and outlines principles for grading the body of evidence related to diagnostic test performance. CHALLENGES: Diagnostic test evidence is challenging to grade because standard tools for grading evidence were designed for questions about treatment rather than diagnostic testing; and the clinical usefulness of a diagnostic test depends on multiple links in a chain of evidence connecting the performance of a test to changes in clinical outcomes. PRINCIPLES: Reviewers grading the strength of a body of evidence on diagnostic tests should consider the principle domains of risk of bias, directness, consistency, and precision, as well as publication bias, dose response association, plausible unmeasured confounders that would decrease an effect, and strength of association, similar to what is done to grade evidence on treatment interventions. Given that most evidence regarding the clinical value of diagnostic tests is indirect, an analytic framework must be developed to clarify the key questions, and strength of evidence for each link in that framework should be graded separately. However if reviewers choose to combine domains into a single grade of evidence, they should explain their rationale for a particular summary grade and the relevant domains that were weighed in assigning the summary grade.

Entities: Disease Gene Species

Mesh：

Year: 2012 PMID： 22648675 PMCID： PMC3364356 DOI： 10.1007/s11606-012-2021-9

Source DB: PubMed Journal: J Gen Intern Med ISSN： 0884-8734 Impact factor: 5.128

INTRODUCTION

“Grading” refers to the assessment of the strength of the body of evidence supporting a given statement or conclusion rather than to the quality of an individual study.1 Grading can be valuable for providing information to decisionmakers, such as guideline panels, clinicians, caregivers, insurers and patients who wish to use an evidence synthesis to promote improved patient outcomes.1,2 In particular, such grades allow decisionmakers to assess the degree to which any decision can be based on bodies of evidence that are of high, moderate, or only low strength of evidence. That is, decisionmakers can make a more defensible recommendation about the use of the given intervention or test than they might make without the strength of evidence grade. The Evidence-based Practice Center (EPC) Program supported by the Agency for Healthcare Research and Quality (AHRQ) has published guidance on assessing the strength of a body of evidence when comparing medical interventions.1,3 That guidance is based on the principles identified by the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) working group4–6with minor adaptations for EPCs. It is important to distinguish between the quality of a study and the strength of a body of evidence on diagnostic tests as assessed by the GRADE and EPC approaches. EPCs consider “The extent to which all aspects of a study’s design and conduct can be shown to protect against systematic bias, nonsystematic bias, and inferential error” as the quality or internal validity or risk of bias of an individual study.7 In contrast to the GRADE approach, the EPC approach prefers to use the term “strength of evidence” instead of “quality of evidence” to describe the grade of an evidence base for a given outcome because the latter term is often equated with the quality of individual studies without consideration of the other domains for grading a body of evidence. An assessment of the strength of the entire body of evidence includes an assessment of the quality of an individual study along with other domains. Although the GRADE approach can be used to make judgments about the strength of an evidence base and the strength of recommendations, this chapter considers using GRADE as a tool for assessing only the strength of an evidence base. When assessing the strength of an evidence base, systematic reviewers should consider four principle domains—risk of bias, consistency, directness, and precision.5Additionally, reviewers may wish to consider publication bias as a fifth principle domain as recently suggested by the GRADE approach.6 Additional domains to consider are dose-response association, existence of plausible unmeasured confounders, and strength of association (i.e., magnitude of effect). Of note, GRADE considers applicability as an element of directness. This is distinct from the EPC approach, which encourages users to evaluate applicability as a separate component. EPCs grade the strength of evidence for each of the relevant outcomes and comparisons identified in the key questions addressed in a systematic review. The process of defining the important intermediate and clinical outcomes of interest for diagnostic tests is further described in a previous article.8 Because most diagnostic test literature focuses on test performance (e.g., sensitivity and specificity), at least one key question will normally relate to that evidence. In the uncommon circumstance in which a diagnostic test is studied in the context of a clinical trial (e.g., test versus no test) with clinical outcomes as the study endpoint, the reader is referred to the Methods Guide for Effectiveness and Comparative Effectiveness Reviews on evaluating interventions.1,3 For other key questions, such as those related to analytic validity, clinical validity, and clinical utility, the principles described in the present document and the Methods Guide for Effectiveness and Comparative Effectiveness Reviews should apply. This paper is meant to complement the EPC Methods Guide for Comparative Effectiveness Reviews, and not to be a complete review. Although we have written this paper to serve as guidance for EPCs, we also intend for this to be a useful resource for other investigators interested in conducting systematic reviews on diagnostic tests. In this paper, we outline the particular challenges that systematic reviewers face in grading the strength of a body of evidence on diagnostic test performance. The focus of this article will be on diagnostic tests, meaning tests that are used in the diagnostic and management strategy of a patient symptom or complaint, as opposed to prognostic tests, which are for predicting responsiveness to treatment. We then propose principles for addressing these challenges.

COMMON CHALLENGES

Diagnostic test studies commonly focus on accuracy of the test to make a disease diagnosis, and the task of grading this body of evidence is a challenge in itself. Through discussion with EPC investigators and a review of recent EPC reports on diagnostic tests,9–13 we identified common challenges that reviewers face when assessing the strength of a body of evidence on diagnostic test performance. One common challenge is that standard tools for assessing the quality of a body of evidence associated with an intervention—in which the body of evidence typically relates directly to the overarching key question—are not so easily applied to a body of evidence associated with a diagnostic test, where evidence is often indirect. Indeed, this is the reason that establishing a logical chain with an analytic framework and the associated key questions is particularly important for evaluating a diagnostic test (see Paper 2).8 It is also the reason we must assess the strength of the body of evidence for each link in the chain. The strength of the body of evidence regarding the overarching question of whether a test will improve clinical outcomes depends both on the total body of evidence, as well as the body of evidence for the weakest link in this chain. Although there is a temptation to use diagnostic accuracy as an intermediate outcome for the effect of a diagnostic test on clinical outcomes, there is often no direct linkage between the diagnostic accuracy outcome and a clinical outcome. This is particularly challenging when tests are used as a part of an algorithm. While rates of false positives and false negatives may be directly related to adverse effects or harms, other accuracy outcomes such as sensitivity or specificity may not directly correlate to effective management and treatment of disease, especially when the test under question is not directly linked to the use of an established treatment algorithm. When tests are used in regular practice not for final diagnosis and treatment, but as a triage for further testing, then accuracy of diagnosis is less important than accuracy of risk classification. A second challenge arises in the application of the strength of evidence domains for studies of diagnostic tests. For example, in assessing the precision of estimates of test performance, it is particularly difficult to judge whether a particular confidence interval is sufficiently precise; because of the logarithmic nature of diagnostic performance measurements—such as sensitivity, specificity, likelihood ratios, and diagnostic odds ratios—even a relatively wide confidence interval suggesting imprecision may not necessarily translate into imprecision that is clinically meaningful. Table 1 shows an example where a 10% reduction in the sensitivity of various biopsy techniques (from 98% to 88% in the far right column) changes the estimated probability of having cancer after a negative test by less than 5%.11

Table 1

Example of the Impact of Precision of Sensitivity on Negative Predictive Value

Type of biopsy	Post biopsy probability of having cancer after a negative core-needle biopsy result^a
Type of biopsy	Analysis results	Analysis overestimated sensitivity by 1% (e.g., sensitivity 97% rather than 98%)	Analysis overestimated sensitivity by 5% (e.g., sensitivity 93% rather than 98%)	Analysis overestimated sensitivity by 10% (e.g., sensitivity 88% rather than 98%)
Freehand automated gun	6%	6%	8%	9%
Ultrasound guidance automated gun	1%	1%	3%	5%
Stereotactic guidance automated gun	1%	1%	3%	5%
Ultrasound guidance vacuum-assisted	2%	2%	3%	6%
Stereotactic guidance vacuum-assisted	0.4%	0.8%	3%	5%

aFor a woman with a BI-RADS® 4 score following mammography and expected to have an approximate prebiopsy risk of malignancy of 30%. Note that an individual woman’s risk may be different from these estimates depending on her own individual characteristics11

Example of the Impact of Precision of Sensitivity on Negative Predictive Value aFor a woman with a BI-RADS® 4 score following mammography and expected to have an approximate prebiopsy risk of malignancy of 30%. Note that an individual woman’s risk may be different from these estimates depending on her own individual characteristics11

PRINCIPLES FOR ADDRESSING THE CHALLENGES

Principle 1

Methods for grading intervention studies can be adapted for studies evaluating studies on diagnostic tests with clinical outcomes A body of evidence evaluating diagnostic test outcomes such as diagnostic thinking, therapeutic choice, and clinical outcomes can be assessed in very much the same way as a body of evidence evaluating outcomes of therapeutic interventions. Grading issues in this type of diagnostic test study are more straightforward than in studies measuring accuracy outcomes. Although this is rarely done, the effect of tests on the clinical outcomes described above can be assessed directly with trial evidence. In cases where trial evidence is available, application of the grading criteria, such as GRADE, should not significantly differ from the methods used for intervention evidence. An unresolved issue is what to do when there is no direct evidence available linking the test to the outcome of interest. For grading intervention studies, the use of intermediate outcomes, such as accuracy outcomes, would be considered “indirect” evidence and would reduce the strength of the grade. The linkage of accuracy outcomes such as true positives and false positives to clinical outcomes depend in part upon the benefits and harms of available treatments as well as the cognitive or emotional outcomes resulting from the knowledge itself, as outlined in a previous article [ref Segal et al.].14 Currently there is no consensus for one particular approach to grading an overall body of evidence when it is entirely indirect, such as when only studies of accuracy are available. As discussed in a previous article,8 there are circumstances in which accuracy outcomes may be sufficient to conclude that there is or is not a benefit on clinical outcomes.15 In other cases in which only indirect evidence on intermediate accuracy outcomes is available, EPCs should discuss with decisionmakers and methodologists the benefits of including such indirect evidence and the specific methods to be used.

Principle 2

Consider carefully what test characteristic measures are the most appropriate intermediate outcomes for assessing the impact of a test on clinical outcomes and their precision in the clinical context represented by the key question. Consistent with EPC and GRADE principles of emphasizing the patient important outcomes, reviewers should consider how any surrogates such as accuracy outcomes will lead to changes in clinical outcomes. Use of an analytic framework and decision models as described in paper two [ref Samson et al.],8 help to clarify the linkage between accuracy outcomes and clinical outcomes for systematic reviewers, and users of systematic reviews alike. If accuracy outcomes are presented as true positives, true negatives, false positives, and false negatives, then they can be easily translated into other accuracy outcomes such as sensitivity and specificity, positive predictive value (PPV) and negative predictive value (NPV). Systematic reviewers need to carefully consider which of these accuracy outcomes to assess based on which outcome will relate most directly to clinical outcomes as well as necessary levels of precision. Sometimes it is more important to “rule out” a particular disease that has severe consequences if missed. In these cases, use of a triage test with high sensitivity and NPV may be what is most important, and actual diagnosis of a particular disease is less important. When the treatment of a disease has high associated risks, multiple tests are often used to assure the highest accuracy. Tests used in isolation need to have both high sensitivity and specificity, or high PPV and NPV, but if no such test is available, clinicians may be interested in the added benefits and harms of “adding-on” a test. The accuracy outcome of interest of these tests would primarily be high specificity or PPV. Tests that are more invasive will naturally have greater harms. Additional harms may result from misdiagnosis, so it is almost always important to consider the measurement of false positives and false negatives when assessing the harms of a diagnostic test. The degree of harms from false negatives depend on the severity of disease if there is a missed diagnosis, in addition to the risks from the testing itself (i.e. if the test is invasive and associated with risks in and of itself). The degree of harms from false positives depends on the invasiveness of further testing or treatment, as well as the emotional and cognitive effects of inaccurate disease labeling. As a simple example, one might have compelling data regarding the value of outcomes resulting from true positive test result, as well as true negative, false positive and false negative results. In a simple decision model it is possible to identify a threshold line for the combinations of test sensitivity and specificity for which testing vs. not testing is a toss-up—where net benefits are equivalent to net harms. To the extent that the confidence intervals for sensitivity and specificity derived from the body of evidence are contained within one territory or the other (“testing better”, as in this illustration), these intervals are sufficiently precise for purposes of decision making.16 Of course, this formulation is a simplification for many situations. Tests are rarely used alone to diagnosis disease and determining treatment choices, but are more often used as part of an algorithm of testing and management. The accuracy outcome of most interest depends on how the test is used in a clinical algorithm, as well as the mechanism by which the test could improve clinical outcomes or cause harms. Whether or not one uses a decision model to help sort out these issues, considering the important test characteristics and their precision in the clinical context represented by the key question is a necessary step in the process of assessing a body of evidence.

Principle 3

The principle domains of GRADE can be adapted to assess a body of evidence on diagnostic test accuracy. To assess a body of evidence related to diagnostic test performance, we can adapt the GRADE’s principle domains of risk of bias, consistency, directness, and precision.(Table 2) Evaluating risk of bias includes considerations of how the study type and study design and conduct may have contributed to systematic bias. The potential sources of bias relevant to diagnostic test performance and strategies for assessing the risk of systematic error in such studies are discussed in a previous article [ref Santaguida et al.].17 Diagnostic tests, particularly laboratory tests, can yield heterogeneous results due to different technical methods. For example, studies may report using different antibodies for immunoassays, or standards with different values and units assigned to them.

Table 2

Required and Additional Domains and their Definitions*

Domain	Definition and Elements	Application to evaluation of diagnostic test performance
Risk of Bias	Risk of bias is the degree to which the included studies for a given outcome or comparison have a high likelihood of adequate protection against bias (i.e., good internal validity), assessed through main elements:	Use one of three levels of aggregate risk of bias:
	• Study design (e.g., RCTs or observational studies)	• Low risk of bias
	• Aggregate quality of the studies under consideration from the rating of quality (good/fair/poor) done for individual studies	• Medium risk of bias
		• High risk of bias
		Well designed and executed studies of new tests compared to an adequate criterion standard are rated as “low risk of bias”
Consistency	Consistency is the degree to which reported study results (e.g., sensitivity, specificity, likelihood ratios) from included studies are similar. This can be assessed through two main elements:	Use one of three levels of consistency:
	• The range of study results is narrow	• Consistent (i.e., no inconsistency)
	• Variability in study results is explained by differences in study design, patient population or test variability	• Inconsistent
		• Unknown or not applicable (e.g., single study)
		Single-study evidence bases should be considered as “consistency unknown (single study).”
Directness	Directness relates to whether the evidence links the interventions directly to outcomes. For a comparison of two diagnostic tests, directness implies head-to-head comparisons against a common criterion standard	Score dichotomously as one of two levels of directness
	Directness may be contingent on the outcomes of interest	• Direct
		• Indirect
		When assessing the directness of the overarching question, if there are no studies linking the test to a clinical outcome, then evidence that only provides diagnostic accuracy outcomes would be considered indirect. If indirect, specify which of the two types of indirectness account for the rating (or both, if that is the case)—namely, use of intermediate/ surrogate outcomes rather than health outcomes, and use of indirect comparisons. If the decision is made to grade the strength of evidence of an intermediate outcome such as diagnostic accuracy, then the reviewer does not need to automatically “downgrade” that outcome for being indirect
Precision	Precision is the degree of certainty surrounding an effect estimate with respect to a given outcome (i.e., for each outcome separately)	Score dichotomously as one of two levels of precision:
	If a meta-analysis was performed, this will be the confidence interval around the summary measure(s) of test performance (e.g sensitivity, true positive)	• Precise
		• Imprecise
		A precise estimate is an estimate that would allow a clinically useful conclusion. An imprecise estimate is one for which the confidence interval is wide enough to include clinically distinct conclusions
Publication bias†	Publication bias indicates that studies may have been published selectively, with the result that the estimate of test performance based on published studies does not reflect the true effect. Methods to detect publication bias for medical test studies are not robust. Evidence from small studies of new tests or asymmetry in funnel plots should raise suspicion for publication bias	Publication bias can influence ratings of consistency, precision, magnitude of effect (and, to a lesser degree, risk of bias and directness). Reviewers should comment on publication bias when circumstances suggest that relevant empirical findings, particularly negative or no-difference findings, have not been published or are unavailable
Dose-response association	This association, either across or within studies, refers to a pattern of a larger effect with greater exposure (dose, duration, and adherence)	The dose-response association may support an underlying mechanism of detection and potential relevance for some tests that have continuous outcomes and possibly multiple cutoffs [e.g., gene expression, serum PSA (prostate-specific antigen) levels, and ventilation/perfusion scanning]
Plausible unmeasured confounding and bias that would decrease an observed effect or increase an effect if none was observed	Occasionally, in an observational study, plausible confounding factors would work in the direction opposite to that of the observed effect. Had these confounders not been present, the observed effect would have larger. In such case the evidence can be upgraded	The impact of plausible unmeasured confounders may be relevant to testing strategies that predict outcomes. A study may be biased to find low diagnostic accuracy via spectrum bias and yet despite this find very high diagnostic accuracy
Strength of association (magnitude of effect)	Strength of association refers to the likelihood that the observed effect or association is large enough that it cannot have occurred solely as a result of bias from potential confounding factors	The strength of association may be relevant when comparing the accuracy of two different medical tests with one being more accurate than the other
Strength of association (magnitude of effect)		It is possible that the accuracy of a test is better than the reference standard because of an imperfect reference standard. It is important to consider this and modify the analysis to take into consideration alternative assumptions about the best reference standard

*Adapted from the Methods Guide for Effectiveness and Comparative Effectiveness Reviews3

†The GRADE approach is moving towards considering publication bias a GRADE principle domain

Abbreviations: EPC = Evidence-based Practice Center

Required and Additional Domains and their Definitions* *Adapted from the Methods Guide for Effectiveness and Comparative Effectiveness Reviews3 †The GRADE approach is moving towards considering publication bias a GRADE principle domain Abbreviations: EPC = Evidence-based Practice Center Consistency concerns homogeneity in the direction and magnitude of results across different studies. The concept can be similarly applied to diagnostic test performance studies, although the method of measurement may differ. For example, consistency among intervention studies with quantitative data may be assessed visually with a forest plot. However, for diagnostic test performance reviews, the most common presentation format is a summary receiver operating characteristic (ROC) curve, which displays the sensitivity and specificity results from various studies. A bubble plot of true positive versus false positive rates showing spread in ROC space is one method of assessing the consistency of diagnostic accuracy among studies. As with intervention studies, the strength of evidence is reduced by unexplained heterogeneity—that is, heterogeneity not explained by different study designs, methodologic quality of studies, diversity in subject characteristics, or study context. Directness, according to AHRQ’s Methods Guide for Effectiveness and Comparative Effectiveness Reviews,3 occurs when the evidence being assessed “reflects a single, direct link between the interventions of interest [diagnostic tests] and the ultimate health outcome under consideration.”1 When assessing the directness of the overarching question, if there are no studies linking the test to a clinical outcome, then evidence that only provides diagnostic accuracy outcomes would be considered indirect. If the decision is made to grade the strength of evidence of an intermediate outcome such as diagnostic accuracy, then the reviewer does not need to automatically “downgrade” that outcome for being indirect. Of note, directness does apply to how a test is used in comparison to another test. For example, a study may compare the use of a d-dimer test as a replacement to venous ultrasound for the diagnosis of venous thromboembolism, but in actual practice the relevant question may be the comparison of d-dimer test as a triage for venous ultrasound compared to the use of ultrasound alone. It is worth noting EPCs consider some aspects of directness separately as described in the applicability chapter [Hartmann et al.].18 Although not included when EPCs assess directness or the strength of evidence, other schemas, such as GRADE, rate directness based on whether the test evaluated is not the exact test used in practice, or if the test accuracy is being calculated in a population or for a use (diagnosis, prognosis, etc.) that is different than the population or use evaluated in the report. Because EPC reports are intended to be used by a broad spectrum of stakeholders, describing the applicability of the evidence on these factors allows the decision-makers to consider how the evidence relates to their test and population. Precision refers to the width of confidence intervals for diagnostic accuracy measurements and is integrally related to sample size.1 Before downgrading the strength of an evidence base for imprecision, reviewers could consider how imprecision for one measure of accuracy may impact clinically meaningful outcomes. This may involve a simple calculation of posttest probabilities over a range of values for sensitivity and specificity, as shown in Table 1, or, as illustrated above, a more formal analysis with a decision model (see Trikalinos et al.).19 If the impact of imprecision on clinical outcomes is negligible or if the demonstrated precision is sufficient to make the decision, the evidence should not be downgraded.

Principle 4

Additional GRADE domains can be adapted to assess a body of evidence on diagnostic test accuracy. When grading a body of evidence about a diagnostic test, additional domains should be considered. These additional domains are summarized in Table 2.1–3 These additional domains include publication bias, dose-response association, existence of plausible unmeasured confounders, and strength of association. Reviewers should comment on publication bias when circumstances suggest that negative or no-difference findings have not been published or are unavailable. The dose-response association may support an underlying mechanism of detection and potential relevance for some tests that have continuous outcomes and possibly multiple cutoffs (e.g., gene expression, serum PSA [prostate-specific antigen] levels, and ventilation/perfusion scanning). The impact of plausible unmeasured confounders may be relevant to testing strategies that predict outcomes. A study may be biased to find low diagnostic accuracy via spectrum bias and yet despite this find very high diagnostic accuracy. The strength of association may be relevant when comparing the accuracy of two different diagnostic tests with one being more accurate than the other.

Principle 5

Multiple domains should be incorporated into an overall assessment in a transparent way The overall strength of evidence reflects a global assessment of the principle domains and any additional domains, as needed, into an overall summary grade—high, moderate, low, or insufficient evidence. The focus should be on providing an overall grade for the relevant key question link in the analytic chain or for outcomes considered relevant for patients and decisionmakers. These should ideally be identified a priori. Consideration should be given on how to incorporate multiple domains into the overall assessment. There is no empirical evidence to suggest any difference in assigning a summary grade based on qualitative versus quantitative approaches. GRADE advocates an ordinal approach with a ranking from high, moderate, or low, to very low. These “grades” or “overall ratings” are developed using the eight domains suggested by GRADE. The EPC approach for intervention studies described in the Methods Guide for Effectiveness and Comparative Effectiveness Reviews1,3 allows for more flexibility on grading the strength of evidence. Whichever approach reviewers choose for diagnostic tests, they should consider describing their rationale for which of the required domains were weighted the most in assigning the summary grades.

ILLUSTRATION

An illustration in Table 3 provides guidance on how reviewers should approach grading a body of evidence on diagnostic test accuracy. This is adapted from the GRADE approach and the EPC Methods Guide for Comparative Effectiveness Reviews. Reviewers should carefully consider which accuracy outcomes are linked to clinical outcomes. In choosing the accuracy outcomes, if the diagnostic test is followed by an invasive procedure, then the number of false positives may be considered most important. However when “diagnostic tests” are used as part of a management strategy, consideration should also be given to grading the positive predictive value and the negative predictive value or likelihood ratios if these additional outcomes assist decision makers. An additional example of grading for positive predictive value and negative predictive value is shown in the norovirus table below (Table 4).20,21This table illustrates how presentation of the same information in different ways can be helpful in considering how to link the accuracy outcomes to clinical practice and projecting how the test would impact clinical outcomes.

Table 3

Steps in Grading a Body of Evidence on Diagnostic Test Accuracy Outcomes

*Adapted from the Methods Guide for Effectiveness and Comparative Effectiveness Reviews3

Table 4

Illustration of the Approach to Grading a Body of Evidence on Diagnostic Tests- Identifying Norovirus in a Healthcare Setting *

Outcome	Quantity and type of evidence	Findings	Starting grade	Decrease GRADE‡					GRADE of Evidence for Outcome	Overall GRADE§
Outcome	Quantity and type of evidence	Findings	Starting grade	Risk of Bias‡	Consistency‡	Directness‡	Precision‡	Publication Bias‡	GRADE of Evidence for Outcome	Overall GRADE§
Sensitivity†	1 DIAG	68%	High	0	0	0	-1	0	Moderate	Moderate
Specificity†	1 DIAG	99%	High	0	0	0	-1	0	Moderate
PPV†	1 DIAG	97%	High	0	0	0	-1	0	Moderate
NPV†	1 DIAG	82%	High	0	0	0	-1	0	Moderate

*Adapted from MacCannell T, Umscheid CA, Agarwal RK, Lee I, Kuntz G, Stevenson, KB, and the Healthcare Infection Control Practices Advisory Committee. Guideline for the prevention and control of norovirus gastroenteritis outbreaks in healthcare settings. Infection Control and Hospital Epidemiology. 2011; 32(10): 939-96920

†These outcomes were considered the most critical by the guideline developers

‡These modifiers can impact the GRADE by 1 or 2 points

§Consider the additional domains of strength of association, dose-response and impact of plausible confounders if applicable

Steps in Grading a Body of Evidence on Diagnostic Test Accuracy Outcomes *Adapted from the Methods Guide for Effectiveness and Comparative Effectiveness Reviews3 Illustration of the Approach to Grading a Body of Evidence on Diagnostic Tests- Identifying Norovirus in a Healthcare Setting * *Adapted from MacCannell T, Umscheid CA, Agarwal RK, Lee I, Kuntz G, Stevenson, KB, and the Healthcare Infection Control Practices Advisory Committee. Guideline for the prevention and control of norovirus gastroenteritis outbreaks in healthcare settings. Infection Control and Hospital Epidemiology. 2011; 32(10): 939-96920 21 †These outcomes were considered the most critical by the guideline developers ‡These modifiers can impact the GRADE by 1 or 2 points §Consider the additional domains of strength of association, dose-response and impact of plausible confounders if applicable Another review on the use of non-invasive imaging in addition to standard workup after recall for evaluation of a breast lesion detected on screening mammography or physical examination illustrates how accuracy does not relate to outcomes when it is being used as part of an algorithm of whether to treat versus watchful waiting.13 This evidence review focused on the non-invasive imaging studies intended to guide patient management decisions after the discovery of a possible abnormality. The studies were intended to provide additional information to enable women to be appropriately triaged into “biopsy,” “watchful waiting,” or “return to normal screening intervals” care pathways. Thus the usual strategy of assuming the clinical outcome would be simply a downgrade of the surrogate doesn’t always hold true. Reviewers should evaluate the surrogate in the context of the clinical outcome. As the table on summary of key findings in the evidence report illustrates, despite the accuracy of the exact diagnosis being low, clinical management may be the same if the post-test probability did not cross a certain decision threshold to alter management decisions. Two reviewers should independently score these relevant major outcomes and comparisons, within each key question. They should consider the principle domains of directness, precision, consistency, risk of bias, and publication bias, as well as dose response association, strength of association, and impact of unmeasured confounders. Reviewers should explicitly assess each domain to arrive at a grade for each outcome. Reviewer’s choice of various accuracy outcomes to grade may affect how the various domains of directness, precision and consistency are assessed. This is illustrated in the example by the GRADE working group about multisplice coronary CT scans as compared to coronary angiography.4 Evidence was considered direct for certain accuracy outcome such as true positives, true negatives, and false positives since there was little uncertainity about the clinical implications of these results. However, since there was uncertainity about the clinical implications of a false negative test result, this was considered indirect.4 This resulted in a low strength of evidence grade for false negatives as compared to moderate for other accuracy outcomes. It is reasonable to consider either a more flexible qualitative approach to grading or the standard ordinal approach ranging from high to insufficient strength of evidence. Reviewers should resolve differences in domain assessments and grades of outcomes and describe how the consensus score was reached (e.g., by discussion or by third-party adjudication). If appropriate they should consider arriving at a single summary grade for the diagnostic test through transparent and systematic methods. If reviewers chose to assign an overall summary grade they should consider the impact of various accuracy outcomes on the overall strength of evidence grade and identify which of these accuracy outcomes was considered “key”.

SUMMARY

Grading the strength of a body of diagnostic test evidence involves challenges over and above those related to grading the evidence from therapeutic intervention studies. The greatest challenge appears to be assessing multiple links in a chain of evidence connecting the performance of a test to changes in clinical outcomes. In this chapter, we focused primarily on grading the body of evidence related to a crucial link in the chain—diagnostic test performance—and described less fully the challenges involved in assessing other links in the chain. No one system for grading the strength of evidence for diagnostic tests has been shown to be superior to any other and many are still early in development. However, we conclude that, in the interim, applying the consistent and transparent system of grading using the domains described above, and giving an explicit rationale for the choice of grades based on these domains, will make EPC and other reports on diagnostic tests more useful for decisionmakers.

15 in total

1. GRADE guidelines: 3. Rating the quality of evidence.

Authors: Howard Balshem; Mark Helfand; Holger J Schünemann; Andrew D Oxman; Regina Kunz; Jan Brozek; Gunn E Vist; Yngve Falck-Ytter; Joerg Meerpohl; Susan Norris; Gordon H Guyatt
Journal: J Clin Epidemiol Date: 2011-01-05 Impact factor: 6.437

2. When is measuring sensitivity and specificity sufficient to evaluate a diagnostic test, and when do we need randomized trials?

Authors: Sarah J Lord; Les Irwig; R John Simes
Journal: Ann Intern Med Date: 2006-06-06 Impact factor: 25.391

Review 3. Better information for better health care: the Evidence-based Practice Center program and the Agency for Healthcare Research and Quality.

Authors: David Atkins; Kenneth Fink; Jean Slutsky
Journal: Ann Intern Med Date: 2005-06-21 Impact factor: 25.391

4. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies.

Authors: Holger J Schünemann; A Holger J Schünemann; Andrew D Oxman; Jan Brozek; Paul Glasziou; Roman Jaeschke; Gunn E Vist; John W Williams; Regina Kunz; Jonathan Craig; Victor M Montori; Patrick Bossuyt; Gordon H Guyatt
Journal: BMJ Date: 2008-05-17

5. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations.

Authors: Gordon H Guyatt; Andrew D Oxman; Gunn E Vist; Regina Kunz; Yngve Falck-Ytter; Pablo Alonso-Coello; Holger J Schünemann
Journal: BMJ Date: 2008-04-26

Review 6. Guideline for the prevention and control of norovirus gastroenteritis outbreaks in healthcare settings.

Authors: Taranisia MacCannell; Craig A Umscheid; Rajender K Agarwal; Ingi Lee; Gretchen Kuntz; Kurt B Stevenson
Journal: Infect Control Hosp Epidemiol Date: 2011-09-01 Impact factor: 3.254

7. AHRQ series paper 5: grading the strength of a body of evidence when comparing medical interventions--agency for healthcare research and quality and the effective health-care program.

Authors: Douglas K Owens; Kathleen N Lohr; David Atkins; Jonathan R Treadwell; James T Reston; Eric B Bass; Stephanie Chang; Mark Helfand
Journal: J Clin Epidemiol Date: 2010-05 Impact factor: 6.437

8. The threshold approach to clinical decision making.

Authors: S G Pauker; J P Kassirer
Journal: N Engl J Med Date: 1980-05-15 Impact factor: 91.245

9. Reevaluation of epidemiological criteria for identifying outbreaks of acute gastroenteritis due to norovirus: United States, 1998-2000.

Authors: Reina M Turcios; Marc-Alain Widdowson; Alana C Sulka; Paul S Mead; Roger I Glass
Journal: Clin Infect Dis Date: 2006-02-27 Impact factor: 9.079

10. Chapter 10: deciding whether to complement a systematic review of medical tests with decision modeling.

Authors: Thomas A Trikalinos; Shalini Kulasingam; William F Lawrence
Journal: J Gen Intern Med Date: 2012-06 Impact factor: 5.128

13 in total

Review 1. A systematic review of BNP and NT-proBNP in the management of heart failure: overview and methods.

Authors: Mark Oremus; Robert McKelvie; Andrew Don-Wauchope; Pasqualina L Santaguida; Usman Ali; Cynthia Balion; Stephen Hill; Ronald Booth; Judy A Brown; Amy Bustamam; Nazmul Sohel; Parminder Raina
Journal: Heart Fail Rev Date: 2014-08 Impact factor: 4.214

2. Serum or plasma ferritin concentration as an index of iron deficiency and overload.

Authors: Maria Nieves Garcia-Casal; Sant-Rayn Pasricha; Ricardo X Martinez; Lucero Lopez-Perez; Juan Pablo Peña-Rosas
Journal: Cochrane Database Syst Rev Date: 2021-05-24

Review 3. Evaluating the efficacy of bronchoscopy for the diagnosis of early stage lung cancer.

Authors: David M DiBardino; Anil Vachani; Lonny Yarmus
Journal: J Thorac Dis Date: 2020-06 Impact factor: 2.895

4. C-reactive protein for diagnosing late-onset infection in newborn infants.

Authors: Jennifer Valeska Elli Brown; Nicholas Meader; Jemma Cleminson; William McGuire
Journal: Cochrane Database Syst Rev Date: 2019-01-14

5. Diagnostic test accuracy of jolt accentuation for headache in acute meningitis in the emergency setting.

Authors: Masahiro Iguchi; Yoshinori Noguchi; Shungo Yamamoto; Yuu Tanaka; Hiraku Tsujimoto
Journal: Cochrane Database Syst Rev Date: 2020-06-11

6. A protocol for a systematic review of the diagnostic accuracy of blood markers, synovial fluid, and tissue testing in periprosthetic joint infections (PJI).

Authors: Paul E Beaule; Beverley Shea; Hesham Abedlbary; Nadera Ahmadzai; Becky Skidmore; Ranjeeta Mallick; Brian Hutton; Alexandra C Bunting; Julian Moran; Roxanne Ward; David Moher
Journal: Syst Rev Date: 2015-11-02

7. Grading Evidence for Laboratory Test Studies Beyond Diagnostic Accuracy: Application to Prognostic Testing.

Authors: Andrew C Don-Wauchope; Pasqualina L Santaguida
Journal: EJIFCC Date: 2015-08-24

Review 8. Common evidence gaps in point-of-care diagnostic test evaluation: a review of horizon scan reports.

Authors: Jan Y Verbakel; Philip J Turner; Matthew J Thompson; Annette Plüddemann; Christopher P Price; Bethany Shinkins; Ann Van den Bruel
Journal: BMJ Open Date: 2017-09-01 Impact factor: 2.692

9. Accuracy and utility of using administrative healthcare databases to identify people with epilepsy: a protocol for a systematic review and meta-analysis.

Authors: Gashirai K Mbizvo; Kyle Bennett; Colin R Simpson; Susan E Duncan; Richard F M Chin
Journal: BMJ Open Date: 2018-06-30 Impact factor: 2.692

10. Impact on prognosis of rebiopsy in advanced non-small cell lung cancer patients after epidermal growth factor receptor-tyrosine kinase inhibitor treatment: a systematic review.

Authors: Takuma Imakita; Hirotaka Matsumoto; Katsuya Hirano; Toshiyuki Morisawa; Azusa Sakurai; Yuki Kataoka
Journal: BMC Cancer Date: 2019-01-25 Impact factor: 4.430