Literature DB >> 27683492

Grading Evidence for Laboratory Test Studies Beyond Diagnostic Accuracy: Application to Prognostic Testing.

Andrew C Don-Wauchope¹, Pasqualina L Santaguida².

Abstract

BACKGROUND: Evidence-based guideline development requires transparent methodology for gathering, synthesizing and grading the quality and strength of evidence behind recommendations. The Grading of Recommendations Assessment, Development and Evaluation (GRADE) project has addressed diagnostic test use in many of their publications. Most of the work has been directed at diagnostic tests and no consensus has been reached for prognostic biomarkers. AIM OF THIS PAPER: The GRADE system for rating the quality of evidence and the strength of a recommendation is described. The application of GRADE to diagnostic testing is discussed and a description of application to prognostic testing is detailed. Some strengths and limitations of the GRADE process in relation to clinical laboratory testing are presented.
CONCLUSIONS: The GRADE system is applicable to clinical laboratory testing and if correctly applied should improve the reporting of recommendations for clinical laboratory tests by standardising the style of recommendation and by encouraging transparent reporting of the actual guideline process.

Entities: Chemical Disease Gene Species

Year: 2015 PMID： 27683492 PMCID： PMC4975301

Source DB: PubMed Journal: EJIFCC ISSN： 1650-3414

INTRODUCTION

The Grading of Recommendations Assessment, Development and Evaluation (GRADE) project was initiated to standardise the grading of guideline recommendations (1). The GRADE system addresses both the quality of evidence as well as the level of recommendation (2). Numerous systems exist for grading the evidence and recommendations, generated by a range of organisations representing professional societies and national/provincial/international bodies amongst others (3). The GRADE project has published two sets of papers with the most recent series still appearing in the literature (4). These provide a combination of general guidance and examples of specific application to a range of areas in medicine. This article will briefly describe the GRADE approach to evaluating the quality of evidence for diagnostic testing with a focus on laboratory tests. Figure 1 gives an overview of how this fits into the overall GRADE process that includes a number of other factors in the formation of a recommendation classified as strong or weak. Subsequently, we will describe how this can be applied to prognostic testing using our previous work on natriuretic peptides as the example. Finally, the strengths and limitations of the GRADE approach will be considered in the context of laboratory medicine.

Figure 1

The GRADE domains – the basis for the evaluation of quality of evidence

OVERVIEW OF THE GRADE SYSTEM OF RATING THE QUALITY OF THE EVIDENCE

The GRADE system uses four major domains to evaluate the quality of the evidence for a research question (Figure 1). Typically research questions would be expected to follow the Population-Intervention-Comparator-Outcome (PICO) format (5). There are four major domains and several minor domains that can be considered as modifiers of the final quality of evidence (6). The first major domain investigates the risk of bias or limitations of primary papers that are considered for answering the specific PICO research question behind the guideline recommendations (7). This is based on evaluation of the study design (i.e. cohort or randomized trials), the application of the study design (identification of any threats to internal validity), the reporting and analysis of the results and the conclusions presented. There are a range of validated tools available to assist researchers and guideline teams to evaluate the risk of bias in the primary papers. Systematic reviewers should include their GRADE assessment and the supporting data in the results of the systematic review. The second major domain investigates the inconsistency of the evidence (8). This domain considers all the primary papers related to each outcome (defined in the PICO) and evaluates the direction of the effect for consistency. The presence of inconsistency in the direction or magnitude of the effect (i.e. specificity) would result in a downward grading of the evidence for the outcome. It is evaluated by considering the range of point estimates, the confidence interval around each point estimate and the statistical testing for heterogeneity. When several outcomes are considered, inconsistency is evaluated separately for each outcome. The third major domain investigates the indirectness of evidence in relation to outcomes (9). This domain considers the plausible or proved link between the factor (e.g. the diagnostic intervention) being considered and the outcome being evaluated. This requires consideration of the potential differences in population, type of intervention, outcome measures and the comparisons made. The overall indirectness needs to be judged based on the PICO and if present would downgrade the quality of evidence. Similar to inconsistency, each outcome is evaluated for indirectness. The fourth major domain is about the imprecision of the evidence (10). Ideally, this domain evaluates outcomes for which a summary pooled estimate is calculated in a meta-analysis to provide a measure of overall effect across different studies. The width of the 95% CI in this context would give an estimate of the imprecision of the summarised data. If an intervention is being compared to a control then the 95% CI of the individual point estimates for each included study would be precise if there was no overlap, and imprecise if there was overlap. When the study effects cannot be meta-analyzed a number of factors (such as sample size) are considered across the literature being evaluated and graded for imprecision. There are several minor domains that can also be considered when grading evidence and recommendations. One minor domain is publication bias (11). This domain is generally evaluated using statistical techniques to assess the probability of publication bias. There must be sufficient number of studies included so that the statistical test has validity. In the case where there are too few studies, one may likely assume that publication bias is likely present. Other aspects to consider when assessing publication bias are small numbers of studies with small populations and predominate funding from industry sponsors whose role within the study is not specified. Other minor domains include any evidence for dose response, the magnitude of the effect size and plausible residual confounding (12). Using the GRADE approach, the quality of evidence is reported as one of 4 levels: High (++++); Moderate (+++o); Low (++oo); or Very Low (oooo) (13). The use of symbols to convey the strength of evidence is becoming more apparent in clinical practice guidelines and assists readers in quickly assessing the quality upon which the recommendations are based. The definitions of these categories have been well described for therapeutic interventions (13) and we have suggested some additional descriptions applicable to diagnostic accuracy and prognostic studies. Table 1 (on the next page) is an adaptation of the practical interpretation of the quality of the evidence when considering intervention (13), diagnostic accuracy (14), and prognostic studies (15).

Table 1

Interpretation of the quality of evidence for GRADE

Quality	Interventions (13)	Diagnostic test for diagnostic accuracy (14)	Prognostic use of diagnostic test (15)
High Quality	We are confident that the true effect lies close to the estimate of the effect	We are confident that the diagnostic accuracy estimates are accurate.	We are confident that the test makes an important contribution to the determination of outcome (predictive strength).
Moderate Quality	We are moderately confident in the effect estimate. The true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different.	We are moderately confident in the estimates of accuracy. The true accuracy estimate is likely to be close to the observed accuracy, but there is a possibility that it is substantially different.	We are moderately confident that the test makes an important contribution to the determination of the outcome. The estimate of the observed predictive strength is likely close to the true effect, but there is a possibility that it is substantially different.
Low Quality	Our confidence in the effect estimate is limited: the true effect may be substantially different from the estimate of the effect	Our confidence in the accuracy estimate is limited; the true accuracy may be substantially different from the accuracy observed.	Our confidence in the predictive strength is limited; the true predictive strength may be substantially different from the estimate of predictive strength observed.
Very Low Quality	We have very little confidence in the effect estimate: The true effect is likely to be substantially different from the estimate of effect.	We have very little confidence in the accuracy. The true accuracy is likely to be substantially different from the observed accuracy.	We have very little confidence in the predictive estimate of the test. The true predictive strength is likely to be substantially different from the estimate of predictive strength.

GRADE FOR DIAGNOSTIC TESTING USING LABORATORY TESTS

Diagnostic testing was considered a separate category when the GRADE project published the first set of articles describing the process for evaluating quality of the evidence and recommendations (16). This was received with some scepticism from the laboratory community but has been successfully applied in some situations with a number of limitations. The challenge to diagnostic testing is often in the nature of the study design providing data to support the PICO question. The Oxford Centre for Evidence-Based Medicine (CEBM) has articulated this well in their table for levels of evidence in diagnostic accuracy testing (17). Within this hierarchy, the highest order (i.e. most rigorous and valid) of study are cohort and case-control studies and thus quite different from therapeutic interventions where randomised controlled trials are considered the highest order of study design. This is noted in the GRADE description for diagnostic test strategies, where exception is made for diagnostic accuracy studies that would include cross-sectional or cohort designs as an acceptable study type with no downgrading based on for the domain of study limitations. However, the evidence is quickly down-ranked when considering the indirectness and imprecision often associated with these study design types. As more experience with the use of GRADE was gained, the approach to evaluating diagnostic accuracy studies was further developed (18, 19). The same general principles and categories apply and it remains essential to set the question well with consideration of the PICO elements. There is some evidence to suggest that many clinical questions posed in diagnostic test studies do not distinguish between the population being tested and the problem (disease) of interest (20). The PICO format for interventions typically combines the problem with population while for diagnosis it may be important to separately define these two components. For diagnostic accuracy studies the outcomes are typically the classification of the results into the proportion of true positive, true negative, false positive and false negative (21). This assumes that the patient-relevant clinical outcome is the correct diagnosis, and this encourages focus on diagnostic accuracy data. However, there is debate about what is considered the most appropriate clinical outcome of testing and that more emphasis should be placed on the role of testing in clinical pathways, and that the purpose of the test (diagnosis, monitoring, screening, prognosis, risk stratification and guiding therapy) and the clinical effectiveness of testing should be considered in the wider context of health care and the role for diagnostic testing (22). If the clinically important outcome includes appropriate management and improvement in patient health, then there is great difficulty in linking the diagnostic test to the health outcome directly and the assessment of imprecision requires that multiple other factors are considered (22, 23). There are a number of outcome options that could be considered for diagnostic testing and the most appropriate of these should be defined as part of the PICO (22, 24). Thus far most of the published literature has focused on diagnostic accuracy studies. The STARD document has helped improve the reporting of diagnostic accuracy studies (25). The comparator could be a “gold” standard test but this may not be available and other options are mentioned in the STARD document. This concept has been explored further by the Agency for Health Care Research and Quality (AHRQ) in their methods guide for medical test reviews (26). Other parts of the extended PICO question definition may include the timing and setting for the question (i.e. PICOTS) (27). Timing is one aspect that is often considered critical for diagnostic testing as the time between the test being investigated and the comparator test is essential. Timing plays an important role, particularly if the investigators are not blinded to the index and reference test results are not masked. It is also important if the two tests are carried out at different time points in the disease process. For index tests and reference tests, that require samples or procedures other than blood (for example tissue or diagnostic imaging), then the two tests must be conducted in a time frame in which change in the disease process would not impact the interpretation of the test result. For laboratory testing based on blood samples the ideal situation is collection of all samples at the same point in time. The setting often helps defines the population more clearly. When the prevalence of the diagnosis is changed because of the setting (e.g. primary care versus specialist clinic), it becomes an important component as consideration of prevalence will impact the diagnostic accuracy data. This can be illustrated by two of the questions asked in the AHRQ comparative effectiveness review on the use of Natriuretic peptides in Heart Failure (28, 29). Two diagnostic settings were considered and this allowed for the primary papers to be grouped correctly and evaluated in the appropriate context (Table 2).

Table 2

Grading of evidence for the diagnostic use of B-type Natriuretic peptides

PICO	Diagnostic measure	Risk of bias	Inconsistency	Indirectness	Imprecision	Publication bias	Strength of evidence
Use of B-type natriuretic peptides for the diagnosis of heart failure in the emergency department (28)	Sensitivity	low	Consistent for BNP	Direct	Imprecise	n/a	For both BNP and NT-proBNP
	Sensitivity	low	Inconsistent for NT-proBNP	Direct	Imprecise	n/a	High or ++++
	Specificity	low	Consistent for BNP	Direct	Imprecise	n/a	BNPHigh or++++
	Specificity	low	Inconsistent for NT-proBNP	Direct	Imprecise	n/a	NT-proBNP Moderate or +++o
Diagnostic performance of B-type natriuretic peptide for the diagnosis of heart failure in primary care (27)	Sensitivity	low	Consistent	Direct	Imprecise	No evidence	High or ++++
	Specificity	low	Inconsistent	Direct	Imprecise	No evidence	Moderate or +++o

Assessing risk of bias for diagnostic accuracy studies is discussed extensively in the GRADE papers as this is seen as particularly challenging (18, 30). The AHRQ Methods Guide describes the challenges of assessing risk of bias in more detail (31). Validated tools such as the QUADAS II(32) tool or its predecessor the QUADAS(33) can be helpful to carefully consider a range of important factors that impact on the evaluation of risk of bias. For any new systematic reviews or clinical practice guidelines the use of QUADAS II would be recommended as it has improved from the earlier version. QUADAS II focuses on 4 aspects of risk of bias (patient selection, conduct or interpretation of the index test, conduct or interpretation of the reference test, flow and timing of the tests) and four aspects of applicability (whether the study is applicable to the population and settings of interest). In the AHRQ Methods Guide, the domain of indirectness, which is the link between diagnostic accuracy and clinical outcome, and the domain of imprecision were identified as challenging to assess (34). This section provides an overview of the theoretical framework to identify ways in which the domains of risk of bias/study limitations, inconsistency, indirectness, imprecision and publication bias can be considered for evaluating the evidence for diagnostic tests. This has been successfully applied to diagnostic applications of laboratory tests and Table 2 provides an example of how GRADE was applied in the recent AHRQ systematic review for Natriuretic peptides in the diagnosis of heart failure (28, 29).

APPLICATION OF GRADE TO PROGNOSTIC TESTING

Although the GRADE has been widely adopted for assessing the quality of the evidence in both studies of interventions and diagnostic accuracy, it has not yet been applied to studies evaluating prognosis. In large part, this is because GRADE has not reached consensus on how to apply the criteria in the four major domains and in the minor domains specific to prognosis research. Prognosis is defined as the probable course and outcome of a health condition over time. A prognostic factor is any measure in people with a health condition that from a specific start point is associated with subsequent clinical outcome (endpoint) (35). Prognostic factors, if well established, function to stratify individuals with the health condition into categories of risk or probability for the outcomes of interest. Research into prognostic factors aims to establish which factors are modifiable, which should be included in more complex models predicting outcome, monitor disease progression, or show differential responses to treatment. We had the opportunity to explore the application of the GRADE approach in a systematic review in which 3 prognostic questions were addressed (36). In the diagnostic examples (Table 2), we considered the use of natriuretic peptides with respect to diagnosing heart failure. In addition, our systematic review considered natriuretic peptides as potential markers predicting mortality and morbidity in both acutely ill and chronic heart failure patients(37-40). as well as in the general population (41). Our review showed that both BNP and NT-proBNP generally functioned as an independent predictor of subsequent mortality and morbidity at different time frames. Huguet et al.(2013) have recently proposed some guidance for adapting GRADE for prognostic studies based on their work in identifying factors associated with chronic pain (15). The main differences from GRADE applied to intervention studies, occur with respect to study limitations and to factors that may increase overall quality. With regards to study limitations, there is consideration of the phases of prognostic research. This differs from evaluating evidence from intervention and diagnostic accuracy studies, where the type of specific design (e.g. RCT or cohort study) is given specific weighting. In the context of prognostic studies, there is no consensus on the taxonomy for phases of prognosis research (Table 3). The simplest approach considers three phases of prognostic research. At the lowest level of prediction (PHASE 1), prognosis studies are designed to identify potential associations of the factors of interest and are termed “exploration” (42) or “predictor finding” (43) or “developmental studies” (44) PHASE 2 explanatory studies typically establish or confirm independent association between prognostic factors and outcomes, and are also labelled as “validation” studies (44). The highest level of evidence is from PHASE 3 studies where the prognosis study attempts to evaluate the underlying processes that link the prognostic factor with the outcome. High quality evidence is likely found in PHASE 3 studies (15); conversely, moderate to very low quality evidence is based on PHASE 1 and 2 studies.

Table 3

Frameworks for sequential development of prediction models that assess the contribution of potential prognostic factors

Framework of an explanatory approach to studying prognosis (42)	Consecutive phases of multivariate prognostic research (44)	Types of multivariate prediction research (43)
PHASE 1: Identifying associations		Predictor Finding Studies
PHASE 2: Testing independent associations	Developmental Studies (new prognostic model is designed)	Model Development studies without external validation
PHASE 3: Understanding Prognostic Pathways	Validation Studies (External replication of the model)	Model Development studies with external validation
		External validation with or without model updating
	Impact Studies (prognostic models are technologies which require assessment of their impact on health outcomes)	Model Impact Studies

In prognostic research, setting the clinical question is still the most important aspect as patient important outcomes need to be addressed in the appropriate context. Using the PICOTS format is central to this process to adequately define the population, the intervention, the timing and the setting. The comparator and the outcome are also critical but often challenging to define. The comparator test could be a wide range of items when it comes to delineating probable course and outcome. In our examples we included a full range of reported comparators in the form of any type of diagnosis of heart failure. This could prove to be challenging if one form of confirmation is clearly better than another or if the different confirmatory tests include different sub-populations. For the heart failure populations we did not attempt to divide these out, apart from the division between acute decompensated and chronic stable heart failure. However, we could have tried to use different diagnostic criteria such as echocardiography findings to delineate severity and diastolic from systolic dysfunction. As discussed in the diagnostic accuracy section the range of clinically relevant outcomes can be quite diverse. For prognostic outcomes the use of clinical pathways and clinically effectiveness should be considered in additional to the more traditional mortality and morbidity outcomes. The length of time from the test to the evaluation of the outcome status may be an important consideration as this may change with differing lengths of time. Bearing all these concepts in mind is important when defining the outcome as the applicability of the findings will be dependent on patient important outcomes. Risk of bias for the prognostic studies in the natriuretic peptide systematic review was evaluated using the underlying principles of the Quality in Prognosis Studies (QUIPS) tool (45). The elements of the QUIPS tool had been previously published and we adapted these very slightly for the prognostic questions in our study (46). This considers 6 domains that may impact bias of a prognostic study: participation; attrition; prognostic factor measurement; confounding measurement and control; outcome measurement; and analysis and reporting (45). The type of study design for prognostic evaluation is largely cohort studies and these are primarily prospective in nature. However, in many reports the original study was a prospective or randomised controlled trial and the analysis of the prognostic factor was done as an afterthought and hence the study design should be classified as retrospective cohort. There are randomised controlled trials that could be considered as true evaluations of prognostic testing but these are rare. One additional advantage of using the QUIPS is that there is a thorough assessment of the potential for confounding bias. When applying the GRADE to intervention studies, where the presence of plausible confounding in cohort studies can be expected to reduce the effect size observed, the study limitations can be upgraded. However, this assumption may not be applicable to prognostic studies which are predominately observational in design; residual confounding can effect predictions in either direction (over or under estimation of the predictive strength) or have no effect at all (15). Our systematic review for natriuretic peptides and heart failure showed that most studies had many plausible confounders (biases) that were not accounted for in the adjusted analysis (i.e. residual confounders) (38, 40). The methods used in our comparative effectiveness review attempted to establish a minimum of three critical confounders; age, renal function, BMI (or other measure of height and weight) considered in the study design or in the analysis. As an example to evaluate confounding from renal function we considered multiple terms to identify the tests and conditions (Table 4). Our findings showed consistent problems with studies measuring these three plausible confounders, not considering several other potential confounders. However, it was not clear which if any of these affected our estimates of prediction or the direction of impact. The domain of confounder measurement and control is essential in prognostic studies because the link between the prognostic test and the outcome is most often not direct and thus consideration of all other known factors that influence the outcome need to be taken into account. This evaluation of primary papers allowed us to judge the overall bias for the papers included for each sub-question that we addressed as well as obtain some insight into the other relevant domains of GRADE. Huguet et al (2013) have also made use of the QUIPS tool in their experience with chronic pain systematic reviews (15).

Table 4

Example of the range of terms used to identify renal dysfunction in the prognostic evaluation of natriuretic peptides

Terms used for renal function	Test used for renal function
renal failure	urea or BUN
acute renal failure	blood (serum or plasma) creatinine
ARF	creatinine clearance
primary acute renal failure	urine creatinine
chronic renal failure
CRF
acute interstitial nephritis
acute tubular necrosis
azotemia
dialysis
glomerulonephritis
hemodialysis
obstructive renal failure
renal insufficiency
kidneys
acute kidney failure
diabetes

Inconsistency can be estimated from the summary tables with the point estimates and 95% CI from odds ratio (OR), hazards ratio (HR) and relative risk (RR). This follows the description from the GRADE group and application of this category does not differ from tests of intervention or diagnostic tests (8). The proposed adaptation of the GRADE to prognostic studies for indirectness asks raters to consider this domain in the context of the population, the prognostic factor, and the outcome. The less generalizable the results for each of these contexts, the higher the likelihood of down-rating this category increases. Indirectness is typically present when one considers prognostic use of a test as there is very seldom a direct link between the test and the outcome of interest. There are typically numerous steps in the process and many of these are completely independent of the test being evaluated. If the factors described by the GRADE group (population; intervention, outcome and comparator) are well described in the PICOTS then it may be possible to find a group of primary studies that match all factors in the same way. If such a group of studies could be found then indirectness may not be present. In the natriuretic peptide systematic review primary studies differed in outcome and comparators that clearly made the evidence-to-outcomes link indirect (38, 40). Imprecision has some interesting difference between application in guidelines and systematic reviews (10). For systematic reviews the goal is estimating the effect size while for guidelines the goal is to support a recommendation. Thus in a systematic review the precision will be interpreted on the width of the 95% CI while in guidelines it would be interpreted on the ability to separate from the comparator. When possible the pooled effect size and confidence limit would be the ideal tool to evaluate imprecision. Consideration should also be given to the sample size of studies (10). However meta-analysis is not always available as the appropriate application of meta-analysis requires that the studies being included match the PICOTS closely. When meta-analysis is not possible the range of effect size and the spread of 95% CI need to be considered. Publication bias will follow the same principles described in the GRADE papers (11). Although the issue has been noted in recent literature, in the context of prognostic studies (47), there is currently no registry of studies, or studies related to laboratory testing. Thus it is difficult to make informed judgements about the likelihood of publication bias. Careful consideration and description of all the GRADE domains need to be made by the guideline developers or systematic reviewers. This should be documented and written up as an appendix to allow users of the guideline to consider the details used by the guideline writers and to allow methodologists the opportunity to further develop the concepts around evaluation of diagnostic tests.

STRENGTHS AND LIMITATIONS OF GRADE FOR LABORATORY TESTS

The major strengths when using the GRADE approach for the evaluation of the strength of evidence and recommendations is the explicitness and reproducibility of the process (48). An advantage is the requirement to define a useful and appropriate clinical question that includes the necessary components of PICOTS. The GRADE system takes into account key domains to assess quality and strength of evidence. The process of GRADE allows for transparency when users of the guideline review the evidence behind the recommendations (49). Limitations can be grouped in a number of areas. Firstly guideline writers often do not fully understand the GRADE system. Methodological experts are most often aware of the system but many of them invited to participate in the guideline team will not have had sufficient exposure to GRADE or training to incorporate the GRADE assessment of the strength of evidence strength or to the process for making recommendations. The GRADE system has been available for a number of years but as it continues to develop it can be difficult for non-methodologists to keep pace with the changes. The application of GRADE requires judgment of the evidence in the domains as well as judgement of the factors that help form the recommendation. This judgment is often construed as expert opinion and this has formed the core of clinical practice guidelines in many instances. The GRADE process is designed to move away from expert opinion alone to one that includes an evidence-formed judgement. If the team is well versed in the GRADE literature and suitably trained then the judgement aspect will be a strength; however, it could be a limitation if the team is not able to sufficiently consider the evidence and be unduly influenced by their own expert opinion. The second group of limitations relates to the challenges guideline teams face in meeting the explicit criteria required for developing structured clinical questions and for the evaluation of the evidence as described in the GRADE process. Although the domains of GRADE and how to apply these are well defined, the heterogeneity of evidence presents practical challenges to guideline development teams. For example, defining the appropriate type of study design for the highest rank of evidence can be challenging. As noted previously, the designs that are considered to have greater rigour (i.e. higher form of evidence) will depend on the actual purpose of the study. For diagnostic testing and prognostic testing these will be different and these nuances require careful reflection from the guideline developers. Initially the researchers may consider using the currently published models (for example CEBM tables and Table 3) and use these if seen as appropriate (17, 42-44). If an alternative system is used it should be justified in the method description. The aspects of PICOTS require careful consideration to make the question applicable to the target audience. This is reasonably straightforward for diagnostic testing (19). but definitions may be more challenging in prognostic questions as the distinction between population and disease become even more important. Often more than a single outcome should be considered in order to capture the complexity of the contribution of diagnostic testing in relation to patient important outcomes. There are practical challenges when judgements are based on patient-relevant versus a test accuracy perspective (19). Similarly, there are some challenges to adequately judge imprecision as statistical approaches are somewhat limited for assessing heterogeneity in diagnostic tests. The complexity and diversity of clinical care pathways may complicate the assessment of indirectness. Here the factors that may impact the clinical care pathway need to be accounted for when the directness or indirectness of the evidence is rated. The choice of outcome measures will further influence the considered judgement process of the GRADE approach.

CONCLUSIONS

The GRADE system can be used to rate the evidence for diagnostic and prognostic use of laboratory testing. There are numerous challenges and the results may not always be seen as consistent between different guideline groups. However, the GRADE evidence rating system allows users of the guideline to compare and contrast guidelines covering the same or similar content. The transparency of the approach also allows better-informed adaptation and implementation of guideline recommendations to local practice.

46 in total

1. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration.

Authors: Patrick M Bossuyt; Johannes B Reitsma; David E Bruns; Constantine A Gatsonis; Paul P Glasziou; Les M Irwig; David Moher; Drummond Rennie; Henrica C W de Vet; Jeroen G Lijmer
Journal: Clin Chem Date: 2003-01 Impact factor: 8.327

2. Evaluation of the quality of prognosis studies in systematic reviews.

Authors: Jill A Hayden; Pierre Côté; Claire Bombardier
Journal: Ann Intern Med Date: 2006-03-21 Impact factor: 25.391

3. GRADE: assessing the quality of evidence for diagnostic recommendations.

Authors: Holger J Schünemann; Andrew D Oxman; Jan Brozek; Paul Glasziou; Patrick Bossuyt; Stephanie Chang; Paola Muti; Roman Jaeschke; Gordon H Guyatt
Journal: Evid Based Med Date: 2008-12

4. From biomarkers to medical tests: the changing landscape of test evaluation.

Authors: Andrea R Horvath; Sarah J Lord; Andrew StJohn; Sverre Sandberg; Christa M Cobbaert; Stefan Lorenz; Phillip J Monaghan; Wilma D J Verhagen-Kamerbeek; Christoph Ebert; Patrick M M Bossuyt
Journal: Clin Chim Acta Date: 2013-09-27 Impact factor: 3.786

5. GRADE guidelines: 4. Rating the quality of evidence--study limitations (risk of bias).

Authors: Gordon H Guyatt; Andrew D Oxman; Gunn Vist; Regina Kunz; Jan Brozek; Pablo Alonso-Coello; Victor Montori; Elie A Akl; Ben Djulbegovic; Yngve Falck-Ytter; Susan L Norris; John W Williams; David Atkins; Joerg Meerpohl; Holger J Schünemann
Journal: J Clin Epidemiol Date: 2011-01-19 Impact factor: 6.437

6. GRADE guidelines: 9. Rating up the quality of evidence.

Authors: Gordon H Guyatt; Andrew D Oxman; Shahnaz Sultan; Paul Glasziou; Elie A Akl; Pablo Alonso-Coello; David Atkins; Regina Kunz; Jan Brozek; Victor Montori; Roman Jaeschke; David Rind; Philipp Dahm; Joerg Meerpohl; Gunn Vist; Elise Berliner; Susan Norris; Yngve Falck-Ytter; M Hassan Murad; Holger J Schünemann
Journal: J Clin Epidemiol Date: 2011-07-30 Impact factor: 6.437

7. Performance of BNP and NT-proBNP for diagnosis of heart failure in primary care patients: a systematic review.

Authors: Ronald A Booth; Stephen A Hill; Andrew Don-Wauchope; P Lina Santaguida; Mark Oremus; Robert McKelvie; Cynthia Balion; Judy A Brown; Usman Ali; Amy Bustamam; Nazmul Sohel; Parminder Raina
Journal: Heart Fail Rev Date: 2014-08 Impact factor: 4.214

Review 8. Incremental value of natriuretic peptide measurement in acute decompensated heart failure (ADHF): a systematic review.

Authors: Pasqualina L Santaguida; Andrew C Don-Wauchope; Usman Ali; Mark Oremus; Judy A Brown; Amy Bustamam; Stephen A Hill; Ronald A Booth; Nazmul Sohel; Robert McKelvie; Cynthia Balion; Parminder Raina
Journal: Heart Fail Rev Date: 2014-08 Impact factor: 4.214

9. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies.

Authors: Penny F Whiting; Anne W S Rutjes; Marie E Westwood; Susan Mallett; Jonathan J Deeks; Johannes B Reitsma; Mariska M G Leeflang; Jonathan A C Sterne; Patrick M M Bossuyt
Journal: Ann Intern Med Date: 2011-10-18 Impact factor: 25.391

10. Systems for grading the quality of evidence and the strength of recommendations I: critical appraisal of existing approaches The GRADE Working Group.

Authors: David Atkins; Martin Eccles; Signe Flottorp; Gordon H Guyatt; David Henry; Suzanne Hill; Alessandro Liberati; Dianne O'Connell; Andrew D Oxman; Bob Phillips; Holger Schünemann; Tessa Tan-Torres Edejer; Gunn E Vist; John W Williams
Journal: BMC Health Serv Res Date: 2004-12-22 Impact factor: 2.655