Literature DB >> 24010406

The reliability of a quality appraisal tool for studies of diagnostic reliability (QAREL).

Nicholas Lucas¹, Petra Macaskill, Les Irwig, Robert Moran, Luke Rickards, Robin Turner, Nikolai Bogduk.

Abstract

BACKGROUND: The aim of this project was to investigate the reliability of a new 11-item quality appraisal tool for studies of diagnostic reliability (QAREL). The tool was tested on studies reporting the reliability of any physical examination procedure. The reliability of physical examination is a challenging area to study given the complex testing procedures, the range of tests, and lack of procedural standardisation.
METHODS: Three reviewers used QAREL to independently rate 29 articles, comprising 30 studies, published during 2007. The articles were identified from a search of relevant databases using the following string: "Reproducibility of results (MeSH) OR reliability (t.w.) AND Physical examination (MeSH) OR physical examination (t.w.)." A total of 415 articles were retrieved and screened for inclusion. The reviewers undertook an independent trial assessment prior to data collection, followed by a general discussion about how to score each item. At no time did the reviewers discuss individual papers. Reliability was assessed for each item using multi-rater kappa (κ).
RESULTS: Multi-rater reliability estimates ranged from κ = 0.27 to 0.92 across all items. Six items were recorded with good reliability (κ > 0.60), three with moderate reliability (κ = 0.41 - 0.60), and two with fair reliability (κ = 0.21 - 0.40). Raters found it difficult to agree about the spectrum of patients included in a study (Item 1) and the correct application and interpretation of the test (Item 10).
CONCLUSIONS: In this study, we found that QAREL was a reliable assessment tool for studies of diagnostic reliability when raters agreed upon criteria for the interpretation of each item. Nine out of 11 items had good or moderate reliability, and two items achieved fair reliability. The heterogeneity in the tests included in this study may have resulted in an underestimation of the reliability of these two items. We discuss these and other factors that could affect our results and make recommendations for the use of QAREL.

Entities: Disease Gene Species

Mesh：

Year: 2013 PMID： 24010406 PMCID： PMC3847619 DOI： 10.1186/1471-2288-13-111

Source DB: PubMed Journal: BMC Med Res Methodol ISSN： 1471-2288 Impact factor: 4.615

Background

The Quality Appraisal for Reliability Studies (QAREL) checklist is an appraisal tool recently developed to assess the quality of studies of diagnostic reliability [1]. When QAREL was first accepted for publication in 2009, no other quality appraisal tool was widely accepted for use in systematic reviews of reliability studies, and QAREL was therefore developed to fill this gap. Since then, both the COSMIN [2] and GRRAS [3] checklists have been published. COSMIN, deals with the methodological quality of agreement and reliability studies, whereas GRRAS deals with the reporting of such studies. This paper focuses specifically on the evaluation of the reliability of QAREL. QAREL is an 11-item checklist that covers 7 key domains, those being the spectrum of subjects; the spectrum of examiners; examiner blinding; the order effects of examination; the suitability of the time-interval between repeated measurements; appropriate test application and interpretation; and appropriate statistical analysis. Using this checklist, reviewers are able to evaluate individual studies of diagnostic reliability in the preparation of systematic reviews. QAREL was developed in consultation with a reference group of individuals with expertise in diagnostic research and quality appraisal [1]. This panel identified specific areas of bias and error in reliability studies to derive relevant items for potential inclusion on a new quality appraisal tool. Systematic reviews of reliability studies were also examined to identify existing quality appraisal tools [4-10]. In addition, the STARD [11] and QUADAS [12] resources were reviewed for additional items not already identified. Using an iterative process, members of the panel reviewed the proposed items and reduced the list to those considered essential for assessing study quality. We also developed an instruction document and data extraction form for use in systematic reviews [1]. The data extraction form is to be used in conjunction with QAREL to help systematic reviewers extract relevant information from primary studies. It is necessary to evaluate the reliability of QAREL, where reliability is a measure of the chance corrected agreement between different reviewers who independently rate the same set of papers. The aim of this study was to investigate the inter-rater reliability of each item on the QAREL checklist. The reliability of physical examination was chosen as the topic area for this study as there is high variability in the performance, interpretation and reporting of physical examination procedures, and this provided a challenging context in which to evaluate the reliability of QAREL.

Methods

Three reviewers (NL, RM, LR) participated in this study designed to evaluate the inter-rater reliability of each item on QAREL. The University of Sydney Human Research Ethics committee granted approval for the study. All reviewers were qualified health professionals and had experience in physical examination procedures. Each had experience in the critical appraisal of research papers, and had participated in formally reviewing papers for systematic reviews. Two reviewers (NL, RM) were involved in the development of QAREL. A search of MEDLINE, CINAHL, AMED and SCOPUS was conducted to locate papers on the reliability of physical examination published from January 2007 through December 2007. The search string used to locate potential papers was “Reproducibility of results (MeSH) OR reliability (t.w.) AND Physical examination (MeSH) OR physical examination (t.w.). No limits were placed on the source title for the published paper, nor on the type of physical examination procedure reported. A total of 415 records were retrieved and screened for potential inclusion in the study. Only articles that reported on the reliability of physical examination procedures were included. A total of 29 articles, comprising 30 studies, were retrieved and included in this study [13-40]. The reviewers received basic written instructions regarding the use of QAREL [1]. Each item on the checklist can be rated as ‘Yes’, ‘No’, or ‘Unclear’, and certain items can be rated as ‘Not Applicable’. Reviewers independently performed a trial assessment of each paper, followed by a meeting with members of the reference group involved in the development of QAREL to establish baseline criteria for the interpretation of each item. At no time did the reviewers discuss individual studies, which ensured that each reviewer remained blinded to the opinions and findings of other reviewers for each study. Reviewers discussed the general interpretation of individual items on QAREL and outlined general areas of ambiguity for certain items. Following the meeting between reviewers and the reference group, each reviewer independently rated each paper. Reviewers were not permitted to communicate about the checklist or about the individual papers being reviewed. Completed data collection forms were returned for reliability (κ) analysis.

Analysis

Data were analysed for reliability using kappa (κ) for multiple raters [41]. Each response option was recorded as a category, including ‘unclear’ and ‘not applicable’. All computations were performed using STATA 8.2 (StataCorp TX, USA) Kappa is a chance corrected measure of inter-rater reliability, and ranges from −1 to +1, with +1 being perfect agreement, –1 being perfect disagreement, and zero being agreement no better than chance. In this study, kappa was interpreted as unreliable (κ < 0.00), poor (κ = 0.01 – 0.20), fair (κ = 0.21 – 0.40), moderate (κ = 0.41 – 0.60), good (κ = 0.61 – 0.80) and very good (κ = 0.81 – 1.00). A 95% confidence interval for kappa was computed using the test-based standard error. For this study, reliability was considered acceptable if it was moderate or higher.

Results

The estimates of multi-rater reliability for each item are presented in Table 1. The multi-rater scores for individual items ranged from κ 0.27 to κ 0.92, with one item reaching very good reliability (Item 3), eight achieving good or moderate reliability (Items 2, 4 – 9, 11), and two reaching fair reliability (Items 1, 10).

Table 1

Multi-rater reliability for reviewers rating of 30 studies of diagnostic reliability using QAREL

Item	Item description (abbreviated)	Subsequent evaluation
		κ	95% CI
1	Was the sample of subjects representative?	0.27	(0.11, 0.42)
2	Was the sample of raters representative?	0.59	(0.43, 0.74)
3	Were raters blinded to the findings of other raters?	0.92	(0.76, 1.00)
4	Were raters blinded to their own prior findings?	0.78	(0.62, 0.94)
5	Were raters blinded to the accepted reference standard?	0.66	(0.49, 0.82)
6	Were raters blinded to clinical information not part of test	0.51	(0.37, 0.64)
7	Were raters blinded to additional non-clinical cues?	0.59	(0.39, 0.78)
8	Was the order of examination varied?	0.71	(0.58, 0.84)
9	Was the time interval between repeated measures appropriate?	0.69	(0.50, 0.88)
10	Was the test applied correctly and interpreted appropriately?	0.35	(0.18, 0.51)
11	Were appropriate statistical measures of agreement used?	0.73	(0.54, 0.92)

κ = multi-rater kappa. 95% CI = 95% confidence interval.

Multi-rater reliability for reviewers rating of 30 studies of diagnostic reliability using QAREL κ = multi-rater kappa. 95% CI = 95% confidence interval.

Reliability of each item

Item 1, regarding the representativeness of subjects, was reported with fair reliability (κ =0.27). The reviewers identified “subject representativeness” as a difficult item to rate because each paper in this study presented a different diagnostic test procedure. Under normal circumstances, the scope of a systematic review would limit the number of tests making it possible for reviewers to identify and agree upon appropriate criteria thereby making judgments for this item more straightforward. In this evaluation 10 studies were classified as “Yes” and three studies were classified as “No” by all 3 raters. Two raters agreed on “yes” for 12 studies, “No” for 3 studies and “Unclear” for 1 study. Reviewers also expressed difficulty rating Item 2, regarding the representativeness of the raters. This item, however, achieved moderate reliability (κ = 0.59). All three raters agreed on “Yes” for 15 studies, “No” for 2 studies and “unclear” for 4 studies. Two raters agreed on “yes” for 5 studies, “No” for 1 study, and “Unclear” for 2 studies. For Item 3, reviewers reliably reported whether the raters in a given study were blinded to the findings of other raters. This item, which only has relevance to studies of inter-rater reliability, was reported with very good (κ= 0.92) reliability. All three reviewers selected “Yes” for 18 studies, “Unclear” for 5 studies and “Not Applicable” for 5 studies. “No” was not recorded for any study. The purpose of item 4 is to identify if raters had any prior knowledge of the test outcome for a particular subject before rating them in the study. There are two possible situations in which this might occur. First, in studies of intra-rater reliability, the rater may recall their findings from the first ‘rating’ when they rate the subject a second time. The second possibility is that the rater may have performed the test on a subject prior to their enrolment in the study. For example, subjects may have been recruited from the rater’s own list of patients, and the rater may recall examination findings from their prior assessment of the patient. This item achieved good reliability (κ = 0.78). All three reviewers selected “Not Applicable” for 20 studies, “Yes” for 5 studies and “Unclear” for one study. “No” was not recorded for any study. Item 5 concerns the blinding of raters to the results of the accepted reference standard. This item achieved good reliability (κ =0.66). All three reviewers selected “Not Applicable” for 22 studies, “Yes” for 2 studies and “Unclear” for one study. “No” was not recorded for any study. Item 6 refers to whether raters were blinded to clinical information that was not intended to form part of the test procedure. This item was found to be moderately reliable (κ=0.51). All three raters agreed on “Yes” for five studies and “Unclear” for 13 studies. The remaining responses were spread across all categories. The purpose of item 7 is to identify if raters had access to non-clinical information that was not intended to form part of the test procedure. Reliability may be influenced by the recognition of additional cues such as tattoos, scars, voice accent and unique identifying features on imaging films. The reviewers discussed that they could think of a large number of potential ‘additional cues’ that might be important for each study, and found it difficult to judge this item without predetermined criteria. Reliability for this item was moderate (κ = 0.59). All three reviewers classified 22 studies as “Unclear” for this item and three studies as “Yes”. Only a single reviewer selected “No” for a single study. Item 8 requires reviewers to consider the order of examination and if it was varied during the study. This item was reported with good reliability (κ = 0.71). All three raters agreed on “Yes” for 10 studies, “No” for one study, “Unclear” for 7 studies and “Not Applicable” for 3 studies. Item 9 considers the time interval between repeated test applications. This item achieved good reliability (κ = 0.69). All three raters agreed on “Yes” for 24 studies and “Unclear” for 3 studies. Only a single reviewer selected “No” for a single study. Item 10 requires reviewers to consider if the test has been applied correctly and interpreted appropriately. This item was reported with fair reliability (κ=0.35). Interpretation of these results should take into account that each study reported a different physical examination test. Under more typical systematic review conditions, only one or a small number of related tests would be reported. All 3 reviewers selected “Yes” for 23 studies and “No” for one study. A single reviewer selected “Unclear” for 4 studies, “Yes” for one study and “No” for one study. Item 11 requires reviewers to consider if the statistical analysis used was appropriate. Reliability for this item was found to be good (κ = 0.73). All three reviewers agreed on “Yes” for 26 studies and “No” for 2 studies.

Discussion

In this study we evaluated the reliability of individual items on the QAREL checklist in the area of physical examination. We found that the majority of items were reported with either moderate or good reliability, with two items achieving fair reliability. From these results, we consider that QAREL is a reliable tool for the assessment of studies of diagnostic reliability, and we emphasize that reviewers should have the opportunity to discuss the criteria by which to rate individual studies, as is typical in the preparation of systematic reviews. We also recommend further studies to evaluate the reliability of QAREL as used by different examiners and in different contexts. As mentioned in the background, COSMIN is a related tool and has also been published and assessed for reliability [42]. COSMIN was developed to evaluate the measurement properties of health measurement instruments, of which reliability is one property, whereas QAREL was developed to specifically evaluate reliability. COSMIN has been evaluated for inter-rater reliability [42] in a study comprising 88 examiners who used COSMIN to rate a total of 75 papers. Of the 14 COSMIN reliability items, good reliability (κ = 0.72) was achieved for one item, and moderate reliability (κ = 0.41-0.60) was achieved for 5 items. For the reliability of items on QAREL, 6 of 11 items had good reliability, and 3 had moderate reliability. The QAREL and COSMIN reliability studies differ markedly in their design, however, which makes it difficult to compare reliability between the items or constructs that they have in common. Four main factors should be taken into consideration in the interpretation of the results. First, reliability of physical examination is a challenging area to investigate. Physical examination procedures are subject to variability in both test application and interpretation. In addition, many of the disorders that are evaluated by physical examination procedures do not have an accepted reference standard by which to confirm test results. This absence makes it difficult for reviewers to determine if any differences observed in repeated test outcomes are attributable to real changes in the underlying disorder, or variability in the test application and interpretation. For example, Item 9 is concerned with whether the time interval between repeated applications of the same test was appropriate, yet this knowledge can only be determined by application of an accepted reference standard. This example highlights the need for reviewers to agree upon criteria for rating this item prior to undertaking reviews of individual studies. Second, this study is atypical because each of the articles reports the reliability of a different physical examination procedure, with no two articles reporting on the same test. This introduced an unusually high level of variability in this study in terms of the test procedures, type of patients or subjects, type of examiners, and types of disorder. Under normal conditions, QAREL would more likely be used to evaluate a group of related papers, each reporting the reliability of the same test in different patients groups and as performed by different examiners. In that context, reviewers would establish agreed criteria by which to rate each item on QAREL, prior to evaluating the papers. This study, therefore, evaluated QAREL under challenging circumstances, and this may have led to lower reliability estimates. A third factor that should be mentioned is that the estimated reliability (kappa) for each item is affected by the distribution of responses across the available categories for that item. A large imbalance in the number of responses across categories, as occurred for item 10, can result in a low estimate for reliability (kappa) even when observed agreement between raters is high. Lastly, this study comprised three reviewers and 29 papers reporting studies of reliability in the area of physical medicine. Further evaluation is warranted to assess the reliability of QAREL in other contexts, and the effect of training. A larger study would provide scope to investigate the effect of reviewer experience and training.

Conclusion

In this study, we found that QAREL was a reliable assessment tool for studies of diagnostic reliability when reviewers had the opportunity to discuss the criteria by which to interpret each item. Reliability for 9 out of 11 items was moderate or good, and fair for 2 (items 1 and 10). The results for these two items were likely affected by the heterogeneous group of papers evaluated in this study and the challenges inherent in the field of physical examination. If reviewers utilize QAREL after agreement on the criteria by which they will make judgments for each item, they can expect the tool to be reliable. Further testing of the reliability of QAREL in different contexts is needed to further establish the reliability of this tool.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The authors of this paper are Nicholas Lucas (NL), Petra Macaskill (PM), Les Irwig (LI), Rob Moran (RM), Luke Rickards (LR), Robin Turner (RT), and Nikolai Bogduk (NB). The author contributions were: NL conceived of the study, designed the initial study protocol and implemented the study. PM, LI and NB provided advice on the study protocol and participated in the study as the reference group. NL, RM, an LR undertook the reliability study and rated all papers. NL wrote the first draft of the paper. All authors contributed to and approved the final version of the paper.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/13/111/prepub

36 in total

Review 1. Are chiropractic tests for the lumbo-pelvic spine reliable and valid? A systematic critical literature review.

Authors: L Hestbaek; C Leboeuf-Yde
Journal: J Manipulative Physiol Ther Date: 2000-05 Impact factor: 1.437

Review 2. Inter-examiner reliability of passive assessment of intervertebral motion in the cervical and lumbar spine: a systematic review.

Authors: E van Trijffel; Q Anderegg; P M M Bossuyt; C Lucas
Journal: Man Ther Date: 2005-07-01

Review 3. Reliability of procedures used in the physical examination of non-specific low back pain: a systematic review.

Authors: Stephen May; Chris Littlewood; Annette Bishop
Journal: Aust J Physiother Date: 2006

4. Assessment of forearm pronation strength in C6 and C7 radiculopathies.

Authors: James Rainville; Damon J Noto; Cristin Jouve; Louis Jenis
Journal: Spine (Phila Pa 1976) Date: 2007-01-01 Impact factor: 3.468

Review 5. Methodological quality and outcomes of studies addressing manual cervical spine examinations: a review.

Authors: Dieter Hollerwöger
Journal: Man Ther Date: 2006-02-17

6. The Colorado Haemophilia Paediatric Joint Physical Examination Scale: normal values and interrater reliability.

Authors: M R Hacker; S M Funk; M J Manco-Johnson
Journal: Haemophilia Date: 2007-01 Impact factor: 4.287

Review 7. Manual examination of the spine: a systematic critical literature review of reproducibility.

Authors: Mette Jensen Stochkendahl; Henrik Wulff Christensen; Jan Hartvigsen; Werner Vach; Mitchell Haas; Lise Hestbaek; Alan Adams; Gert Bronfort
Journal: J Manipulative Physiol Ther Date: 2006 Jul-Aug Impact factor: 1.437

Review 8. Reliability of spinal palpation for diagnosis of back and neck pain: a systematic review of the literature.

Authors: Michael A Seffinger; Wadie I Najm; Shiraz I Mishra; Alan Adams; Vivian M Dickerson; Linda S Murphy; Sibylle Reinsch
Journal: Spine (Phila Pa 1976) Date: 2004-10-01 Impact factor: 3.468

Review 9. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative.

Authors: Patrick M Bossuyt; Johannes B Reitsma; David E Bruns; Constantine A Gatsonis; Paul P Glasziou; Les M Irwig; Jeroen G Lijmer; David Moher; Drummond Rennie; Henrica C W de Vet
Journal: Ann Intern Med Date: 2003-01-07 Impact factor: 25.391

10. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews.

Authors: Penny Whiting; Anne W S Rutjes; Johannes B Reitsma; Patrick M M Bossuyt; Jos Kleijnen
Journal: BMC Med Res Methodol Date: 2003-11-10 Impact factor: 4.615

20 in total

Review 1. Diagnostic performance of the automated breast volume scanner: a systematic review of inter-rater reliability/agreement and meta-analysis of diagnostic accuracy for differentiating benign and malignant breast lesions.

Authors: Zheying Meng; Cui Chen; Yitong Zhu; Shuling Zhang; Cong Wei; Bin Hu; Li Yu; Bing Hu; E Shen
Journal: Eur Radiol Date: 2015-04-28 Impact factor: 5.315

Review 2. Magnetic resonance imaging criteria for the assessment of the rotator cuff after repair: a systematic review.

Authors: Maristella F Saccomanno; Gianpiero Cazzato; Mario Fodale; Giuseppe Sircana; Giuseppe Milano
Journal: Knee Surg Sports Traumatol Arthrosc Date: 2015-01-04 Impact factor: 4.342

3. Validity and reliability of clinical prediction rules used to screen for cervical spine injury in alert low-risk patients with blunt trauma to the neck: part 2. A systematic review from the Cervical Assessment and Diagnosis Research Evaluation (CADRE) Collaboration.

Authors: N Moser; N Lemeunier; D Southerst; H Shearer; K Murnaghan; D Sutton; P Côté
Journal: Eur Spine J Date: 2017-09-22 Impact factor: 3.134

Review 4. Reproducibility of interferon gamma (IFN-γ) release Assays. A systematic review.

Authors: Saloua Tagmouti; Madeline Slater; Andrea Benedetti; Sandra V Kik; Niaz Banaei; Adithya Cattamanchi; John Metcalfe; David Dowdy; Richard van Zyl Smit; Nandini Dendukuri; Madhukar Pai; Claudia Denkinger
Journal: Ann Am Thorac Soc Date: 2014-10

Review 5. Reliability and validity of clinical tests to assess the anatomical integrity of the cervical spine in adults with neck pain and its associated disorders: Part 1-A systematic review from the Cervical Assessment and Diagnosis Research Evaluation (CADRE) Collaboration.

Authors: Nadège Lemeunier; S da Silva-Oolup; N Chow; D Southerst; L Carroll; J J Wong; H Shearer; P Mastragostino; J Cox; E Côté; K Murnaghan; D Sutton; P Côté
Journal: Eur Spine J Date: 2017-06-12 Impact factor: 3.134

6. Reliability and validity of self-reported questionnaires to measure pain and disability in adults with neck pain and its associated disorders: part 3-a systematic review from the CADRE Collaboration.

Authors: N Lemeunier; S da Silva-Oolup; K Olesen; H Shearer; L J Carroll; O Brady; E Côté; P Stern; T Tuff; M Suri-Chilana; P Torres; J J Wong; D Sutton; K Murnaghan; P Côté
Journal: Eur Spine J Date: 2019-03-16 Impact factor: 3.134

7. THE RELIABILITY OF THE STAR EXCURSION BALANCE TEST AND LOWER QUARTER Y-BALANCE TEST IN HEALTHY ADULTS: A SYSTEMATIC REVIEW.

Authors: Cameron J Powden; Teralyn K Dodds; Emily H Gabriel
Journal: Int J Sports Phys Ther Date: 2019-09

Review 8. Paediatric flexible flat foot: how are we measuring it and are we getting it right? A systematic review.

Authors: Helen A Banwell; Maisie E Paris; Shylie Mackintosh; Cylie M Williams
Journal: J Foot Ankle Res Date: 2018-05-30 Impact factor: 2.303

Review 9. The reliability and validity of goniometric elbow measurements in adults: A systematic review of the literature.

Authors: Suzanne F van Rijn; Elisa L Zwerus; Koen Lm Koenraadt; Wilco Ch Jacobs; Michel Pj van den Bekerom; Denise Eygendaal
Journal: Shoulder Elbow Date: 2018-06-03

10. Visual assessment of movement quality: a study on intra- and interrater reliability of a multi-segmental single leg squat test.

Authors: John Ressman; Wilhelmus Johannes Andreas Grooten; Eva Rasmussen-Barr
Journal: BMC Sports Sci Med Rehabil Date: 2021-06-08