Literature DB >> 29089821

Selection and Reporting of Statistical Methods to Assess Reliability of a Diagnostic Test: Conformity to Recommended Methods in a Peer-Reviewed Journal.

Ji Eun Park¹, Kyunghwa Han², Yu Sub Sung¹, Mi Sun Chung³, Hyun Jung Koo¹, Hee Mang Yoon¹, Young Jun Choi¹, Seung Soo Lee¹, Kyung Won Kim¹, Youngbin Shin¹, Suah An¹, Hyo-Min Cho⁴, Seong Ho Park¹.

Abstract

OBJECTIVE: To evaluate the frequency and adequacy of statistical analyses in a general radiology journal when reporting a reliability analysis for a diagnostic test.
MATERIALS AND METHODS: Sixty-three studies of diagnostic test accuracy (DTA) and 36 studies reporting reliability analyses published in the Korean Journal of Radiology between 2012 and 2016 were analyzed. Studies were judged using the methodological guidelines of the Radiological Society of North America-Quantitative Imaging Biomarkers Alliance (RSNA-QIBA), and COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative. DTA studies were evaluated by nine editorial board members of the journal. Reliability studies were evaluated by study reviewers experienced with reliability analysis.
RESULTS: Thirty-one (49.2%) of the 63 DTA studies did not include a reliability analysis when deemed necessary. Among the 36 reliability studies, proper statistical methods were used in all (5/5) studies dealing with dichotomous/nominal data, 46.7% (7/15) of studies dealing with ordinal data, and 95.2% (20/21) of studies dealing with continuous data. Statistical methods were described in sufficient detail regarding weighted kappa in 28.6% (2/7) of studies and regarding the model and assumptions of intraclass correlation coefficient in 35.3% (6/17) and 29.4% (5/17) of studies, respectively. Reliability parameters were used as if they were agreement parameters in 23.1% (3/13) of studies. Reproducibility and repeatability were used incorrectly in 20% (3/15) of studies.
CONCLUSION: Greater attention to the importance of reporting reliability, thorough description of the related statistical methods, efforts not to neglect agreement parameters, and better use of relevant terminology is necessary.

Entities: Chemical Disease Species

Keywords: Agreement; Reliability; Repeatability; Repeatability coefficient; Reproducibility; Software program; Statistical analysis; Statistical method

Mesh：

Year: 2017 PMID： 29089821 PMCID： PMC5639154 DOI： 10.3348/kjr.2017.18.6.888

Source DB: PubMed Journal: Korean J Radiol ISSN： 1229-6929 Impact factor: 3.500

INTRODUCTION

In addition to its accuracy, reliability (used in this article as an umbrella term to cover various concepts such as reproducibility, repeatability, and agreement except when used in a fixed expression of “reliability parameter,” which will be further explained later in the Materials and Methods section) is an important performance metric of a diagnostic test (12). The problem of omitting a proper analysis of reliability in diagnostic research studies has previously been recognized (12). However, this issue was still cited as one of the top 10 statistical errors seen in the submissions to one prominent journal in the field of medical imaging in the recent past (3). The lack of familiarity of the investigators and peer reviewers with the statistical tools designed for this purpose was among the main reasons for the suboptimal reporting reliability analysis in diagnostic research studies (1). Regarding this, to help guide the proper use of the statistical tools for reliability analysis, the Radiological Society of North America-Quantitative Imaging Biomarkers Alliance (RSNA-QIBA) (https://www.rsna.org/QIBA), and COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative (http://www.cosmin.nl) have recently provided methodological guides (456). Furthermore, it appears that investigators, and perhaps also journals themselves, might be less attentive to reporting the reliability analysis when compared with the accuracy analysis. For example, although the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) (7) exist, these do not seem to be well-known or referred to as often as the STAndards for Reporting of Diagnostic accuracy (STARD) (8). According to a study by a general radiology journal, the Korean Journal of Radiology, many more studies reporting diagnostic accuracy were published compared with those reporting reliability in the same period (9). Furthermore, in contrast with multiple secondary research studies analyzing the reporting quality of diagnostic test accuracy (DTA) (91011121314), similar secondary research studies of reliability analyses are scarce. In this regard, we performed this study to evaluate the frequency of reporting a reliability analysis in DTA studies. In addition, we aimed to assess how appropriately the statistical methods for reliability analysis were selected and reported in published studies using the methodological guides provided by the RSNA-QIBA and COSMIN initiative as the adjudication tool with studies from a general radiology journal as a sample.

MATERIALS AND METHODS

Article Search Strategy and Study Selection

We conducted a search to identify all potentially relevant original research papers from the articles published in a single peer-reviewed journal, the Korean Journal of Radiology, during the 5-year period between January 1, 2012 and December 31, 2016 using the PubMed Medline database. The search terms to find DTA studies were “sensitivity” OR “specificity” OR “accuracy” OR “performance” OR “receiver operating” OR “ROC.” The search terms to find studies that analyzed reliability included “reliability” OR “repeatability” OR “reproducibility” OR “agreement” OR “precision” OR “biomarker.” Retrieved articles were screened for eligibility. Regarding the DTA studies, one reviewer experienced in DTA studies selected eligible articles according to criteria established elsewhere (9) with additional confirmation by another DTA expert in cases of ambiguity. Of the initial 124 candidate articles, 63 articles (151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677) were finally included. Regarding the studies that analyzed reliability, eligible articles were chosen by consensus after review by two of four independent reviewers experienced in the relevant methodology. When the two reviewers disagreed or in cases of ambiguity, a third reviewer experienced in related methodology was invited as an adjudicator. We excluded studies that investigated the agreement between continuous or ordinal outcomes/test results and fixed reference standard results (787980). These studies could be viewed as extensions of DTA analysis of non-binary data, which require different statistical analyses (81), than the standard analysis used for reliability, although some published studies seem to have failed to distinguish between them. Of the initial 71 article candidates, 36 articles (1519424553575864666769828384858687888990919293949596979899100101102103104105106) were finally included.

Data Extraction for DTA Studies

Diagnostic test accuracy studies were evaluated regarding whether they also analyzed the reliability of the investigated tests/methods and, when reliability was not assessed, whether the reliability analysis was deemed necessary per se. We considered reliability analysis unnecessary if the tests/methods investigated in a DTA study were only a minor component of the study or if their reliability was already well established. The extraction of this information was performed by nine independent editorial board members of the journal (names are listed in the acknowledgment section). Each reviewer was assigned to the articles in his/her area of expertise (two to ten articles per reviewer). When there is doubt, a second reviewer additionally reviewed the article to make a consensus decision with the original reviewer.

Data Extraction for Reliability Studies

Before data extraction, we first established the recommended statistical methods for the analysis of the reliability of a test/method (Table 1) according to the methodological guides provided by the RSNA-QIBA and COSMIN initiative (46107108). We then used the table as the reference when evaluating if the articles conformed to the recommended statistical methods. Each article was evaluated by two of four independent reviewers experienced in the statistical methodology. Disagreements between two reviewers were adjudicated by two additional reviewers (a biostatistician) both of whom were also experienced in the statistical methodology. The reviewers extracted the data using a predetermined standardized set of questionnaires, which were intended to address the following issues. First, if authors used the proper statistical methods according to the suggestions that we established for this study (Table 1). Second, if authors provided a detailed description of the statistical methods. Third, for studies assessing the reliability of a continuous outcome, if authors distinguished the difference between the “reliability parameter” and “agreement parameter” (Table 1) and used them appropriately with respect to the study purpose and conclusion. Fourth, when the terms “reproducibility” and “repeatability” were used, if authors used the correct definitions.

Table 1

Recommended Statistical Methods for Analysis of Reliability

Dichotomous or Nominal Data (e.g., Benign vs. Malignant)	Ordinal Data (e.g., Grades I, II, III, and IV)	Continuous Data (e.g., Tumor Volume in mL)
Kappa	Weighted kappa	Reliability parameters:
Proportion of agreement	ICC	ICC
		CCC
		Agreement parameters:
		Within-subject standard deviation
		Repeatability coefficient and reproducibility coefficient
		Coefficient of variation
		Bland-Altman limits of agreement

ICC has three different models including one-way random, two-way random, and two-way mixed models, and can use either consistency or absolute agreement assumptions. As ICC value for same set of data may change according to model and assumption used, it is desirable to describe model and assumption, for example, as shown in study by Yoo et al. (86). ICC calculated using one-way random model is appropriate for assessing repeatability (112). CCC or ICC calculated using two-way model, random or mixed according to data and setting (6), are appropriate for analyzing reproducibility. Intraobserver reliability could be regarded as similar to repeatability depending on study setting, whereas interobserver reliability should be regarded as reproducibility. CCC = concordance correlation coefficient, ICC = intraclass correlation coefficient

The “reliability parameter” is a term that has a specific meaning as defined elsewhere (4108), unlike reliability which is used as a general umbrella term. Reliability parameters, such as the intraclass correlation coefficient (ICC) or concordance correlation coefficient, explain how well the subjects in a study set can be distinguished from each other (108), but they do not show the exact measurement uncertainties. Small measurement uncertainties (as opposed to large measurement uncertainties) would allow for a clear distinction between the subjects, yielding a large reliability parameter score. However, a clear distinction between subjects can also be obtained even with large measurement uncertainties if there are large differences between subjects (statistically referred to as a large between-subject variance). Therefore, although reliability parameters are useful in making a relative comparison between different tests/methods regarding their levels of reliability, i.e., a higher score means greater reliability (109), they are not helpful if one wants to know what specific range of measurement differences should be considered true changes instead of mere measurement uncertainties in a longitudinal followup. On the other hand, “agreement parameters” assess exactly how close the results for repeated measurements are (108). Therefore, agreement parameters can be used both for the relative comparison of reliability and assessment of absolute measurement uncertainties. Agreement parameters are needed when investigating a test/method for potential use in a longitudinal follow-up setting. Repeatability, as defined by RSNA-QIBA, concerns repeated measurements of the same or similar experimental units under identical or near-identical conditions, using the same measurement procedure, same operators, same measuring system, same operating conditions, and same physical location over a short period (56). On the other hand, reproducibility applies to rerunning a measurement in slightly different settings, for example, different locations, operators, scanners, etc. (56).

Statistical Analysis

We obtained the following study outcomes in a descriptive manner using proportions, i.e., the percentage of articles out of all eligible articles, for each of the following outcome categories: Reporting of reliability along with accuracy Use of the recommended statistical methods. We considered that a study satisfied this item if the study used at least one method listed in Table 1 and did not require any further details (for example, explanations of weighting methods for weighted kappa or descriptions of the ICC model and assumption were not considered). The results were obtained for each of three different data types (dichotomous/nominal, ordinal, and continuous data). Reporting of weighting method when weighted kappa was used. Reporting of model and assumption when ICC was used. Appropriate use/interpretation of reliability parameters Correct use of the terms reproducibility and repeatability

RESULTS

Reporting of Reliability along with Accuracy

Of the 63 DTA studies (151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677), 32 studies (50.8%) included an analysis of reliability (n = 22) or did not include reliability analysis when the analysis was not necessary (n = 10). Thirty-one articles (49.2%) did not include a reliability analysis in cases where the analysis was deemed necessary.

Selection and Reporting of Statistical Methods to Assess Reliability

The results obtained from the 36 eligible studies (1519424553575864666769828384858687888990919293949596979899100101102103104105106) are summarized in Table 2.

Table 2

Selection and Reporting of Statistical Methods to Assess Reliability

Items	No. of Eligible Articles (Denominator)	Yes (%)	No or Uncertain (%)
Use of recommended statistical methods
Analysis of dichotomous/nominal data	5	5 (100.0)	0 (0.0)
Analysis of ordinal data	15	7 (46.7)	8 (53.3)
Analysis of continuous data	21	20 (95.2)	1 (4.8)
Reporting of weighting method for weighted kappa	7	2 (28.6)	5 (71.4)
Reporting of model for ICC	17	6 (35.3)	11 (64.7)
Reporting of assumption for ICC	17	5 (29.4)	12 (70.6)
Appropriate use/interpretation of reliability parameters	13	10 (76.9)	3 (23.1)
Correct meaning of reproducibility and repeatability	15	12 (80.0)	3 (20.0)

Data are numbers of articles with proportion of eligible articles for each item described as percentage in parentheses.

Of the five studies that reported an analysis of dichotomous/nominal data, four studies used kappa, and one study used both kappa and proportion of agreement. Of the 15 studies that reported an analysis of ordinal data, six studies used weighted kappa, and one study used both weighted kappa and ICC, whereas eight studies used kappa without clarifying if they calculated weighted kappa. Of the 21 studies that reported an analysis of continuous data, one study used Pearson's correlation coefficient instead of the recommended methods. The 20 other studies used the recommended methods, including reliability parameters alone (n = 13, 65%), agreement parameters alone (n = 2, 10%), and both reliability and agreement parameters (n = 5, 25%). Of the 17 studies that used ICC, 11 studies (64.7%) did not report the ICC model, and 12 studies (70.6%) did not explain the assumptions made for the ICC. Of the 13 studies that used reliability parameters alone, ten studies properly used and interpreted the analysis for the study purpose and conclusion, whereas three studies (23.1%) inappropriately considered the reliability parameters as if they were agreement parameters. Among the 15 studies that used reproducibility or repeatability, three studies did not use them accurately, with two studies incorrectly using reproducibility instead of repeatability and one study incorrectly using repeatability instead of reproducibility.

DISCUSSION

In our study, approximately half of the DTA studies did not include a reliability analysis when it was deemed necessary. Most of the reliability studies seem to have selected the proper statistical methods for the analysis. However, description of the further details of the statistical methods, including the weighting method for weighted kappa and specific model and assumption for ICC, were generally poor. This study is limited in that we analyzed a single peer-reviewed journal and did not have specific data from other journals. However, according to the current authors' experience, other radiology journals seem to have similar trends. Another notable observation was that studies more frequently used reliability parameters than agreement parameters for analyzing the reliability of continuous data, and a small but notable (23.1%) fraction of studies imprecisely interpreted the reliability parameters. Lastly, the distinction between repeatability and reproducibility was not perfect. These weaknesses found in the published papers would indicate the areas to require improvements in the future. The importance of reporting reliability along with accuracy needs to be further emphasized because these two parameters are necessary complementary parameters of technical performance and clinical utility for an imaging biomarker (110). It is reassuring that the published studies overall selected the proper methods for reliability analysis. For those investigators who are not familiar with the statistical methods, the table of suggested methods we made for this study (Table 1) could be a useful reference as it succinctly summarizes the well-thoughtout methodological guides by the RSNA-QIBA and COSMIN initiative (46107108). Regarding the suboptimal reporting of the details of the statistical methods, in fact, some user-friendly software programs for statistical analysis, which authors frequently quote as having been used for statistical analysis, often include the details as optional parameters and report them in their output (Fig. 1). Paying closer attention to these features would facilitate reporting them more clearly and would also help investigators to select the most appropriate statistical analysis. The use of agreement parameters, when applicable, should also be more encouraged. It was reported that agreement parameters were often neglected in medical research studies (108), as was also seen in our study. Among these parameters, the repeatability coefficient (RC) is particularly important as it is the smallest detectable change based on the intrinsic technical uncertainties of a quantitative measurement method and its importance is highlighted by the RSNA-QIBA (6108). One of the reasons why the agreement parameters are underutilized compared with reliability parameters may be the lack of readily available user-friendly software programs, except for the Bland-Altman analysis. In this regard, we have developed a web calculator to compute RC and its 95% confidence interval for two or more repeat measurements of a continuous parameter (available at http://datasharing.aim-aicro.com/reliability) according to the methods proposed elsewhere (6111). A software tool like this would help promote the use of agreement parameters such as RC in analyzing the reliability of quantitative imaging parameters.

Fig. 1

Display of detailed options associated with statistical tests used for reliability analysis in some user-friendly software programs.

A. Selection of weighting method to calculate weighted kappa with MedCalc Version 17.6 (MedCalc Software BVBA; https://www.medcalc.org). B. Selection of model and assumption to calculate ICC with IBM SPSS Statistics for Windows Version 21 (IBM Corp.). C. Selection of model and assumption to calculate ICC with MedCalc Version 17.6 (MedCalc Software BVBA). This software program does not distinguish between random and fixed effects models. ICC = intraclass correlation coefficient

Limitations of this study include the fact that the eligible articles were selected from a single journal and, therefore, there could be an issue regarding generalizability. Nevertheless, the journal, the Korean Journal of Radiology, is a representative general journal in the radiology/medical imaging field ranked 53rd out of 126 journals in the field according to the 2016 Journal Citation Reports by Clarivate Analytics. Given its rank and the coverage of topics, the Korean Journal of Radiology may be a suitable litmus test for journals in general in the radiology/medical imaging field. Second, as we focused on the quality of the reporting of the statistical analysis, our results do not necessarily reflect the overall reporting quality or quality of the research. In conclusion, the quality of reporting the reliability analysis of a diagnostic test can be improved through greater attention to the importance of reporting the reliability of a test, more thorough description of the related statistical methods, efforts not to neglect agreement parameters, and a clearer distinction of reproducibility and repeatability. Some of the tips discussed in this article, including the software tool to calculate the RC, may be helpful.

111 in total

1. Use of methodological standards in diagnostic test research. Getting better but still not good.

Authors: M C Reid; M S Lachs; A R Feinstein
Journal: JAMA Date: 1995 Aug 23-30 Impact factor: 56.272

2. The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study.

Authors: Lidwine B Mokkink; Caroline B Terwee; Donald L Patrick; Jordi Alonso; Paul W Stratford; Dirk L Knol; Lex M Bouter; Henrica C W de Vet
Journal: Qual Life Res Date: 2010-02-19 Impact factor: 4.147

3. Imaging findings of brain death on 3-tesla MRI.

Authors: Chul-Ho Sohn; Hwa-Pyung Lee; Jun Beom Park; Hyuk Won Chang; Ealmaan Kim; Eunhee Kim; Ui Jun Park; Hyoung-Tae Kim; Jeonghun Ku
Journal: Korean J Radiol Date: 2012-08-28 Impact factor: 3.500

4. True progression versus pseudoprogression in the treatment of glioblastomas: a comparison study of normalized cerebral blood volume and apparent diffusion coefficient by histogram analysis.

Authors: Yong Sub Song; Seung Hong Choi; Chul-Kee Park; Kyung Sik Yi; Woong Jae Lee; Tae Jin Yun; Tae Min Kim; Se-Hoon Lee; Ji-Hoon Kim; Chul-Ho Sohn; Sung-Hye Park; Il Han Kim; Geon-Ho Jahng; Kee-Hyun Chang
Journal: Korean J Radiol Date: 2013-07-17 Impact factor: 3.500

5. Differentiating benign from malignant bone tumors using fluid-fluid level features on magnetic resonance imaging.

Authors: Hong Yu; Jian-Ling Cui; Sheng-Jie Cui; Ying-Cai Sun; Feng-Zhen Cui
Journal: Korean J Radiol Date: 2014-11-07 Impact factor: 3.500

6. Diffusion-Weighted MRI of Malignant versus Benign Portal Vein Thrombosis.

Authors: Jhii-Hyun Ahn; Jeong-Sik Yu; Eun-Suk Cho; Jae-Joon Chung; Joo Hee Kim; Ki Whang Kim
Journal: Korean J Radiol Date: 2016-06-27 Impact factor: 3.500

7. Optimized Performance of FlightPlan during Chemoembolization for Hepatocellular Carcinoma: Importance of the Proportion of Segmented Tumor Area.

Authors: Seung-Moon Joo; Yong Pyo Kim; Tae Jun Yum; Na Lae Eun; Dahye Lee; Kwang-Hun Lee
Journal: Korean J Radiol Date: 2016-08-23 Impact factor: 3.500

8. CT-guided core needle biopsy of deep suprahyoid head and neck lesions.

Authors: En-Haw Wu; Yao-Liang Chen; Yi-Ming Wu; Yu-Ting Huang; Ho-Fai Wong; Shu-Hang Ng
Journal: Korean J Radiol Date: 2013-02-22 Impact factor: 3.500

9. Bronchopulmonary dysplasia: new high resolution computed tomography scoring system and correlation between the high resolution computed tomography score and clinical severity.

Authors: Su-Mi Shin; Woo Sun Kim; Jung-Eun Cheon; Han Suk Kim; Whal Lee; Ah Young Jung; In-One Kim; Jung Hwan Choi
Journal: Korean J Radiol Date: 2013-02-22 Impact factor: 3.500

10. Detection of recurrent hepatocellular carcinoma in cirrhotic liver after transcatheter arterial chemoembolization: value of quantitative color mapping of the arterial enhancement fraction of the liver.

Authors: Dong Ho Lee; Jeong Min Lee; Ernst Klotz; Soo Jin Kim; Kyung Won Kim; Joon Koo Han; Byung Ihn Choi
Journal: Korean J Radiol Date: 2012-12-28 Impact factor: 3.500

13 in total

1. Repeatability of amide proton transfer-weighted signals in the brain according to clinical condition and anatomical location.

Authors: Jung Bin Lee; Ji Eun Park; Seung Chai Jung; Youngheun Jo; Donghyun Kim; Ho Sung Kim; Choong-Gon Choi; Sang Joon Kim; Dong-Wha Kang
Journal: Eur Radiol Date: 2019-07-23 Impact factor: 5.315

2. Three-dimensional fractal dimension and lacunarity features may noninvasively predict TERT promoter mutation status in grade 2 meningiomas.

Authors: So Yeon Won; Jun Ho Lee; Narae Lee; Yae Won Park; Sung Soo Ahn; Jinna Kim; Jong Hee Chang; Se Hoon Kim; Seung-Koo Lee
Journal: PLoS One Date: 2022-10-20 Impact factor: 3.752

3. High-Resolution Magnetic Resonance Imaging Using Compressed Sensing for Intracranial and Extracranial Arteries: Comparison with Conventional Parallel Imaging.

Authors: Chong Hyun Suh; Seung Chai Jung; Ho Beom Lee; Se Jin Cho
Journal: Korean J Radiol Date: 2019-03 Impact factor: 3.500

4. Prognostic Value of Radiologic Extranodal Extension in Human Papillomavirus-Related Oropharyngeal Squamous Cell Carcinoma.

Authors: Boeun Lee; Young Jun Choi; Seon Ok Kim; Yoon Se Lee; Jung Yong Hong; Jung Hwan Baek; Jeong Hyun Lee
Journal: Korean J Radiol Date: 2019-08 Impact factor: 3.500

5. A Glimpse on Trends and Characteristics of Recent Articles Published in the Korean Journal of Radiology.

Authors: Yeon Hyeon Choe
Journal: Korean J Radiol Date: 2019-12 Impact factor: 3.500

6. Characteristics of Recent Articles Published in the Korean Journal of Radiology Based on the Citation Frequency.

Authors: Yeon Hyeon Choe
Journal: Korean J Radiol Date: 2020-12 Impact factor: 3.500

7. Test-retest repeatability of ultrasonographic shear wave elastography in a rat liver fibrosis model: toward a quantitative biomarker for preclinical trials.

Authors: Youngbin Shin; Jimi Huh; Su Jung Ham; Young Chul Cho; Yoonseok Choi; Dong-Cheol Woo; Jeongjin Lee; Kyung Won Kim
Journal: Ultrasonography Date: 2020-04-23

8. Optimal Phase of Dynamic Computed Tomography for Reliable Size Measurement of Metastatic Neuroendocrine Tumors of the Liver: Comparison between Pre- and Post-Contrast Phases.

Authors: Jimi Huh; Jisuk Park; Kyung Won Kim; Hyoung Jung Kim; Jong Seok Lee; Jong Hwa Lee; Yoong Ki Jeong; Atul B Shinagare; Nikhil H Ramaiya
Journal: Korean J Radiol Date: 2018-10-18 Impact factor: 3.500

9. Age of Data in Contemporary Research Articles Published in Representative General Radiology Journals.

Authors: Ji Hun Kang; Dong Hwan Kim; Seong Ho Park; Jung Hwan Baek
Journal: Korean J Radiol Date: 2018-10-18 Impact factor: 3.500

10. Reliability of 3D image analysis and influence of contrast medium administration on measurement of Hounsfield unit values of the proximal femur.

Authors: Hye-Won Lee; Hong Il Ha; Sun-Young Park; Hyun Kyung Lim
Journal: PLoS One Date: 2020-10-21 Impact factor: 3.240