Literature DB >> 27605869

Correlation and agreement: overview and clarification of competing concepts and measures.

Jinyuan Liu¹, Wan Tang², Guanqin Chen¹, Yin Lu³, Changyong Feng¹, Xin M Tu¹.

Abstract

Agreement and correlation are widely-used concepts that assess the association between variables. Although similar and related, they represent completely different notions of association. Assessing agreement between variables assumes that the variables measure the same construct, while correlation of variables can be assessed for variables that measure completely different constructs. This conceptual difference requires the use of different statistical methods, and when assessing agreement or correlation, the statistical method may vary depending on the distribution of the data and the interest of the investigator. For example, the Pearson correlation, a popular measure of correlation between continuous variables, is only informative when applied to variables that have linear relationships; it may be non-informative or even misleading when applied to variables that are not linearly related. Likewise, the intraclass correlation, a popular measure of agreement between continuous variables, may not provide sufficient information for investigators if the nature of poor agreement is of interest. This report reviews the concepts of agreement and correlation and discusses differences in the application of several commonly used measures.

Entities: Chemical Disease Gene Species

Keywords: Kendall's tau; Pearson's correlation; Spearman's rho; concordance correlation; intraclass correlation; non-linear association

Year: 2016 PMID： 27605869 PMCID： PMC5004097 DOI： 10.11919/j.issn.1002-0829.216045

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

Introduction

Agreement and correlation are widely used concepts in the medical literature. Both are used to indicate the strength of association between variables of interest, but they are conceptually distinct and, thus, require the use of different statistics. Correlation focuses on the association of changes in two outcomes, outcomes that often measure quite different constructs such as cancer and depression. The Pearson correlation is the most popular measure of the association between two continuous outcomes, but it is only useful when measuring linear relationships between variables. If the relationship is non-linear, the Pearson correlation generally does not provide a good indication of association between the variables. Another problem is that using the standard interpretation of Pearson correlation coefficients can, in some circumstances, lead to incorrect conclusions. Agreement, also known as reproducibility, is a concept closely related to, but fundamentally different from, correlation. Like correlation, agreement also assesses the relationships between outcomes of interest, but, as the name indicates, the emphasis is on the degree of concordance in the opinions between two or more individuals or in the results between two or more assessments of the variable of interest. An example of agreement in mental health research is the consensus between multiple clinicians about the psychiatric diagnoses of a group of patients. In biomedical sciences agreement can also include measures of the reproducibility (i.e., reliability) of a laboratory test result when repeated in the same center or when conducted in multiple centers under the same conditions. It is not sensible to speak of agreement (reproducibility) between variables that measure different constructs; so when measuring the association between different variables - such as weight and height - one can assess correlation but not agreement. For continuous outcomes, the intraclass correlation (ICC) is a popular measure of agreement. Like the Pearson correlation, the ICC is an estimate of the magnitude of the relationship between variables (in this case, between multiple assessments of the same variable). However, the ICC also takes into account rater bias, the element that distinguishes agreement from correlation; that is, good agreement (reproducibility) not only requires good correlation, it also requires small rater bias. In this report, we provide an overview of popular measures and statistical methods for assessing the two different notations of association between variables. We also clarify the key differences between the measures and between the methods used to assess the measures. We focus on continuous outcomes and assume all variables are continuous unless stated otherwise.

Correlation measures

Pearson correlation

Consider a sample of n subjects and a bivariate continuous outcome, (u, v), from each subject within the sample (1≤i≤n). The Pearson correlation is the most popular statistic for measuring the association between the two variables u and v:[ where u. (v.) denotes the sample mean of u (v) The Pearson correlation ranges between -1 and 1, with 1(-1) indicating perfect positive (negative) correlation and 0 indicating no association between the variables. As popular as it is, the Pearson correlation is only appropriate for measuring correlation between u and v when the two variables follow a linear relationship. If the bivariate outcome (u, v) follows a non-linear relationship, is not an informative measure and is difficult to interpret. To see this, let μ(μ) and σ2 (σ2) denote the (population) mean and (population) variance of the variable u (v). The Pearson correlation is an estimate of the following product moment correlation: Unlike , which measures correlation between u and v based on the sample, the product-moment correlation p is the population-level correlation, which cannot be calculated but is estimated by . Thus, may also be referred to as the 'sample product-moment correlation'. If u and v have a linear relationship, then u=av+b+ε, where a and b are some constants, and ε denotes random errors with mean 0 and variance σ2. By centering u (v) at its mean, we have: u - μ= a(v-μ)+ε. It follows that σ2=a2σ2+σ2. If u and v are perfectly correlated, that is, σ2=0, it follows from Equation (2) that p=1 or (-1), depending on whether a is positive or negative. Also, if u and v are uncorrelated, or independent, that is, a=0, then p=0 and vice versa. If u and v have a non-linear relationship, the product moment correlation generally does not provide an informative measure of correlation. The example below shows that the Pearson correlation in this case can be quite misleading. Example 1. Suppose that u and v are perfectly correlated and follow the non-linear relationship, u=v9. Further, assume that v follows a standard normal distribution N(0, 1) with mean 0 and variance 1. Then, the product-moment correlation is: The poor association between u and v as indicated by the product-moment correlation contradicts the conceptual perfect correlation between the two variables. Thus, the product-moment and its sample counterpart, the Pearson correlation, generally do not apply to non-linear relationships.

Spearman's Rho

Spearman's rho is also a popular measure of association. Unlike the Pearson correlation, it also applies to non-linear relationship, thereby addressing the aforementioned limitation associated with the Pearson correlation. Let q (r) denote the rankings of u (v), (1 ≤ i ≤n). Spearman's rho is defined as: By comparing (1) and (4), it is clear that is really the Pearson correlation when applied to the rankings (q , r) of the original variables (u, v). Since the rankings only concern the ordering of the observations, relationships among the rankings are always linear, regardless of whether the original variables are linearly related. Thus, Spearman's rho not only has the same interpretation as the Pearson correlation, but also applies to non-linear relationships. The Spearman ranges between -1 and 1, with 1 and -1 indicating perfect positive (negative) correlation; when =0 there is no association between the variables u and v. If =1 then q= r, in which case, If =-1, then q=n-r+1, in which case, Any two pairs of bivariate outcomes (u, v) and (u, v) that satisfy (5) or (6) are said to be concordant or discordant; that is, u and v are either both larger or both smaller than u and v. Thus, perfect positive (negative) correlation by Spearman' rho corresponds to perfect concordance (discordance); that is, concordant (discordant) pairs (u, v) and (u, v) for all 1≤i Example 2. Table 1 shows 12 observations of the bivariate outcome (u, v) as described in Example 1, and the ranks associated with these observations. Note that u and v are perfectly related, so their rankings are identical; that is, q= r.

Table 1

A sample of 12 bivariate outcomes (u, v) simulated with u= v9 and v from standard normal N (0,1).

u_i	0.26	1.49	1.39	0.65	-0.49	-1.38	1.168	0.87	-0.96	2.15	-0.03	-1.08
v_i	0	38.1	19.4	0.02	-0.002	-18.5	4.06	0.29	-0.68	971.6	0	-2.10
q_i(r_i)	6	11	10	7	4	1	9	8	3	12	5	2

In this example the Pearson correlation =0.531, while Spearman's =1. Thus, only the Spearman rho captures the perfect non-linear relationship between u and v. Note that the Pearson correlation =0.531 has a higher upward bias than the product-moment correlation p=0.161; this occurs due to the small sample size, n=12. As sample size increases, becomes closer to p, a property known as 'consistency' in statistics. For example, we also simulated (u, v) with n=1000 and obtained =0.173, much closer to p. Like the Pearson correlation, the Spearman's rho in (4) is a statistic based on a sample. This sample Spearman rho is an estimate of the following population Spearman rho: In Equation (7), E[I(u Note that the sample Spearman's rho in (4) is referred to as Spearman's rho in the literature. Unlike the Pearson correlation, there is no formal name for the population Spearman's rho in (7). In general, the lack of a formal name for the population version does not cause confusion, since it is usually clear which one is used within the context of a discussion. Like all statistics, the population version of a statistic is called a parameter in statistical lingo. The statistic and parameter serve different purposes. For example, only the parameter can be used in stating statistical hypotheses, such as the null hypothesis, H:ρ=0, for testing whether the population Spearman's rho is 0. Reported values of Spearman's rho by studies are always the sample Spearman rho.

Kendall's Tau

Another alternative for non-linear association is Kendall's tau.[ Like Spearman's rho, Kendall's tau also exploits the concept of concordance and discordance to derive a measure for bivariate outcomes. Unlike Spearman's rho, it uses the notion of concordant and discordant pairs directly in the definition of this correlation measure. Specifically, Kendall's τ (sample version) is defined as: In the above, n = n (n - 1) n 1 is the total number of concordant and discordant pairs in the sample. If n=n(n=n), then =1(-1) and vice versa. Also, if there is no association between u and v, then n and n should be close to each other and should be close to 0 (not exactly 0 due to sampling variability). Thus, like Spearman's rho, =1(-1) corresponds to perfect concordance (discordance). A value of close to 0 indicates weaker or no association between the variables u and v. Like the Pearson and Spearman correlation, the sample Kendall's in (8) estimates the following population parameter: Like its sample counterpart, τ also ranges between -1 and 1. If (5) holds true for all pairs (u, v) and (u, v), then E[I(u Example 3. Consider the data in Example 2. The sample Kendall's tau =-1. Thus, like Spearman's rho, Kendall's tau also provides a sensible measure of association for non-linearly related variables.

Agreement and measures of agreement

Agreement, or reproducibility, is another widely used concept for assessing the relationship among outcomes. As indicated in the Introduction, unlike variables considered in correlation analysis, variables considered for agreement must measure the same construct. Conversely, measures of correlation considered in Section 2 generally do not apply to agreement. Example 4. Consider two judges who rate each subject from a study of 5 subjects sampled from a population of interest using a scale from 1 to 10. Let u and v denote the two judges' ratings on the ith subject (1 Since u and v are linearly related, the Pearson correlation can be applied, yielding =1, indicating perfect correlation. However, the data clearly do not indicate perfect agreement; in fact, the two judges hardly agree with one another. The poor agreement in this hypothetical example is due to bias in judges' ratings. The mean ratings for the two judges are 3 (for u) and 8 (for v). Thus, despite the perfect correlation between the ratings, the two judges do not have good agreement because of bias in their ratings of the subjects; either u has downward or v has upward bias (or both). The issue of bias does not apply to correlation because the variables considered for correlation generally measure different constructs and, thus, typically have different means. For the Pearson correlation, the sample means u. and v. are removed from the calculations of the correlation in (1), thus, the Pearson correlation is independent of differences between the (sample) means of the variables being correlated.

Intraclass correlation

Intraclass correlation (ICC) is a popular measure of agreement for continuous outcomes. Like the Pearson correlation, the ICC requires a linear relationship between the variables. However, it differs from the Pearson correlation in one key respect; the ICC also takes into account differences in the means of the measures being considered. In addition, the ICC can be applied to situations where there are three or more separate raters. Consider a study with n subjects and assume each subject is rated by a different group of K judges. Let y denote the rating of the ith subject by the kth judge (1 ≤ i ≤n, 1 ≤ k ≤K). The ICC is defined based on the following linear mixed-effects model:[ In the above model, the fixed effect μ is the (population) mean rating of the study population over all possible K judges from the population of judges; that is, the random effect or latent variable. β represents the difference between the mean rating of the ith subject and the mean rating of the study population μ. Thus, the sum u+β represents the mean rating of the ith subject. The intraclass correlation (ICC) is defined as the variance ratio, , of the variance σ2 of the mean rating of the subjects (u+β) to the total variance consisting of σ2 plus the variance σ2 of the judges. If there are only two judges (K=2), then under the linear mixed-effects model in (9) the productmoment correlation between y and y is the same as the ICC; that is, . Moreover, y and y have the same mean (μ) and variance (σ2 ). Thus, in this special case, the ICC is the same as the product moment correlation (pICC= p). Note that this result is not a contradiction to the data in Example 4, since u and v do not have the same mean and thus the linear mixed-effects model in (9) does not apply to the data and the ICC no longer serves its intended purpose in this case. However, since differences in means between judges' ratings decrease the ICC, this agreement index may still be applied in this situation to indicate poorer agreement. Follow-up analyses are necessary to determine whether poor agreement is due to bias or large variability or both between the judges. Example 5. Consider again Example 4 and let y=u and y=v. By fitting the model in (9) to the data, we obtain estimates = 0 and =9.167. Thus, the (sample) ICC based on the data is =0, which is quite different from the Pearson correlation. Although the judges' ratings are perfectly correlated, agreement between the judges is extremely poor. Note that is not a valid measure of agreement between y and y for the data in Example 5, since the assumption of a common mean between y and y is not met by the data. However, it is precisely this assumption that makes totally different from the Pearson correlation =(1). We may revise the model in (9) to account for the bias in the judges' ratings to consider: where the added fixed-effect μk accounts for the difference between the two judges. By fitting the above model, we obtain estimates =1.256, =0, =3 and =5. Once accounting for bias, the two judges have perfect agreement. The model in (10) also provides mean ratings for the judges. The positive estimate describes the variability among the subjects. Although the correct model for the data, the ICC calculated from the model in (10) no longer has the interpretation as a measure of agreement. In fact, =1, the same as the Pearson correlation =1 as we have calculated in Example 4. Note since pICC≥0 we can either reverse code some of the judges' ratings or use a different index, such as the concordance correlation, discussed below.

Concordance correlation

The concordance correlation (CCC) is another measure of agreement which, unlike the ICC, does not assume a common mean for judges' ratings at the outset, so it can be used to assess both the level of agreement and the level of disagreement. However, a major limitation of the CCC is that it only applies to two judges at a time. Consider a study with n subjects and assume each subject is rated by a different group of two judges. Let y again denote the rating of the ith subject by the kth judge (1≤i≤n, 1≤k≤2). Let μ=E(y) and σ2 =Var(y), denoting the mean and variance of y, and σ12=Cov(y, y), denoting the covariance between y and y. The CCC is defined as:[ Unlike the ICC, no statistical model is assumed in the definition of pCCC. Further, the two judges can come from two different populations of judges with different means and variances. The CCC pCCC has a nice decomposition, pCCC=pC, where p is the product-moment correlation in (2) and C is called the bias correction factor given by: It can be shown that pCCC=1(-1) if and only if p=1(-1), μ1=μ2 and σ12=σ22.[ Thus, pCCC=1(-1) if and only if y = (10) y(y=-y), that is, when there is perfect agreement (disagreement). The bias correction factor C(0≤C≤1) in (12) assesses the level of bias, with smaller C indicating larger bias. Thus, unlike the ICC, poor agreement can result from low correlation (small p) or large bias (small C). Example 6. Consider again Example 5. The (sample) mean and variance of y, and the (sample) correlation between y and y are given by : =3, =8, =2.5, =2.5 and =1. Thus, it follows from (11) that . We can also obtain CCC by using the decomposition result, which in our case yields =1, =0.0533 and = 0.0533. Note that unlike correlation the issue of linear versus non-linear association does not arise when assessing agreement. This is because good agreement requires an approximate linear relationship between the outcomes. For example, in the case of two raters, good agreement requires that y and y are close to each other, such as y = y in the case of perfect agreement.

Discussion

We discussed the concepts of agreement and correlation and described various measures that can be used to assess the relationships among variables of interest. We focused on the measures and methods for continuous outcomes. For non-continuous outcomes, different methods must be applied. For example, for categorical outcomes a different version of Kendall's tau, known as Kendall's tau b can be used for assessing correlation and Kappa can be used for assessing agreement.[

2 in total

Review 1. Intraclass correlations: uses in assessing rater reliability.

Authors: P E Shrout; J L Fleiss
Journal: Psychol Bull Date: 1979-03 Impact factor: 17.737

2. A concordance correlation coefficient to evaluate reproducibility.

Authors: L I Lin
Journal: Biometrics Date: 1989-03 Impact factor: 2.571

2 in total

44 in total

1. Assisting students' writing with computer-based concept map feedback: A validation study of the CohViz feedback system.

Authors: Christian Burkhart; Andreas Lachner; Matthias Nückles
Journal: PLoS One Date: 2020-06-29 Impact factor: 3.240

2. Comparison of Approaches for Aggregating Quality Measures in Population-based Payment Models.

Authors: Alex McDowell; Christina A Nguyen; Michael E Chernew; Kevin N Tran; J Michael McWilliams; Bruce E Landon; Mary Beth Landrum
Journal: Health Serv Res Date: 2018-08-22 Impact factor: 3.402

3. Reliability of Measures Intended to Assess Threshold-Independent Hearing Disorders.

Authors: Aryn M Kamerer; Judy G Kopun; Sara E Fultz; Stephen T Neely; Daniel M Rasetshwane
Journal: Ear Hear Date: 2019 Nov/Dec Impact factor: 3.570

4. Development and Validation of PREDICT-DM: A New Microsimulation Model to Project and Evaluate Complications and Treatments of Type 2 Diabetes Mellitus.

Authors: Pooyan Kazemian; Deborah J Wexler; Naomi F Fields; Robert A Parker; Amy Zheng; Rochelle P Walensky
Journal: Diabetes Technol Ther Date: 2019-06 Impact factor: 6.118

5. Spatial resolution of Normalized Difference Vegetation Index and greenness exposure misclassification in an urban cohort.

Authors: Raquel B Jimenez; Kevin J Lane; Lucy R Hutyra; M Patricia Fabian
Journal: J Expo Sci Environ Epidemiol Date: 2022-01-29 Impact factor: 5.563

6. Clinical utility of the Trendelenburg Test in people with multiple sclerosis.

Authors: Paul W Kline; Cory L Christiansen; Dana L Judd; Mark M Mañago
Journal: Physiother Theory Pract Date: 2022-01-24 Impact factor: 2.176

Review 7. Validity and Reliability of IPAQ-SF and GPAQ for Assessing Sedentary Behaviour in Adults in the European Union: A Systematic Review and Meta-Analysis.

Authors: Kaja Meh; Gregor Jurak; Maroje Sorić; Paulo Rocha; Vedrana Sember
Journal: Int J Environ Res Public Health Date: 2021-04-26 Impact factor: 3.390

8. Effect of adaptive statistical iterative reconstruction-V (ASiR-V) levels on ultra-low-dose CT radiomics quantification in pulmonary nodules.

Authors: Kai Ye; Min Chen; Qiao Zhu; Yuliu Lu; Huishu Yuan
Journal: Quant Imaging Med Surg Date: 2021-06

9. Karnofsky Performance Score-Failure to Thrive as a Frailty Proxy?

Authors: Margaret R Stedman; Daniel J Watford; Glenn M Chertow; Jane C Tan
Journal: Transplant Direct Date: 2021-06-08

10. Measuring Spatiotemporal Parameters on Treadmill Walking Using Wearable Inertial System.

Authors: Sofia Scataglini; Stijn Verwulgen; Eddy Roosens; Robby Haelterman; Damien Van Tiggelen
Journal: Sensors (Basel) Date: 2021-06-29 Impact factor: 3.576