| Literature DB >> 32010547 |
Abstract
Rank correlation is invariant to bijective marginal transformations, but it is not immune to confounding. Assuming a categorical confounding variable is observed, the author proposes weighted coefficients of correlation for continuous variables developed within a larger framework based on copulas. While the weighting is clear under the assumption that the dependence is the same within each group implied by the confounder, the author extends the Minimum Averaged Mean Squared Error (MAMSE) weights to borrow strength between groups when the dependence may vary across them. Asymptotic properties of the proposed coefficients are derived and simulations are used to assess their finite sample properties.Entities:
Keywords: Confounding; Copulas; MAMSE weights; Rank statistics; Weighted methods
Year: 2017 PMID: 32010547 PMCID: PMC6961503 DOI: 10.1186/s40488-017-0076-1
Source DB: PubMed Journal: J Stat Distrib Appl ISSN: 2195-5832
Fig. 1Scatter plot of height vs salary for 150 men and 150 women that are simulated as independent variables conditional on gender. A Spearman correlation of 0.137 (p-value=0.018) provides (false) evidence against independence
Fig. 2Histograms of 10,000 p-values of a test of independence. Salary and Height are simulated as independent random variables with gender-specific marginal distributions. On the left panel, the tests of independence are based on Spearman’s rho calculated on the pooled data. On the right panel, a weighted version of the coefficient of correlation leads to an apparently unbiased test
Fig. 3Side-by-side boxplots for the marginal data of the Iris data set by species
Spearman’s correlation matrices for sepal length, sepal width, petal length and petal width
|
|
|
|
|
While contain Spearman’s coefficients for each each of the three species of iris, namely Setosa (), Versicolor () and Virginica (), the matrix contains Spearman’s correlation for the 150 iris taken as a single dataset, hence ignoring marginal discrepancies
Relative MSE of Spearman’s correlation matrices for sepal length (SL), sepal width (SW), petal length (PL) and petal width (PW)
| SL | SW | PL | ||||||
|---|---|---|---|---|---|---|---|---|
| Species of interest | MAMSE | Matrix | SW | PL | PW | PL | PW | PW |
| Setosa | Global | 76 | 34 | 38 | 170 | 103 | 92 | 132 |
| Pairwise | 25 | 38 | 174 | 77 | 68 | 104 | ||
| Versicolor | Global | 170 | 139 | 120 | 250 | 175 | 169 | 272 |
| Pairwise | 154 | 139 | 350 | 225 | 217 | 284 | ||
| Virginica | Global | 181 | 208 | 207 | 159 | 140 | 149 | 79 |
| Pairwise | 198 | 169 | 169 | 136 | 141 | 95 | ||
The values listed are and are based on 10,000 repetitions. Each species of iris is in turn the target group. The MAMSE weights are calculated based on a global or pairwise strategy. Relative MSE are reported for each pairwise correlation, as well as for the correlation matrix in the case of global weights
Fig. 4Histograms of 1000 p-values of a resampling test for the homogeneity of the copulas. Simulated iris datasets are generated from two scenarios. While on the left panel, the three species of iris share a same copula, on the right panel, the three species are generated as a multivariate normal with parameters estimated from the three species of iris in the original dataset
Performance of different weighted measures of dependence reported as or by a ratio of the kind
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
|
| 20 | 50 |
| 20 | 50 |
| 20 | 50 | |
|
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
|
| 93 | 93 | 93 | 94 | 95 | 95 | 99 | 99 | 99 |
|
| 93 | 94 | 98 | 78 | 87 | 95 | 35 | 48 | 69 |
|
| 60 | 64 | 67 | 53 | 63 | 72 | 33 | 46 | 68 |
|
| 79 | 86 | 95 | 76 | 86 | 95 | 66 | 77 | 91 |
|
| 52 | 59 | 65 | 54 | 64 | 72 | 61 | 73 | 88 |
|
| 45 | 56 | 63 | 46 | 59 | 70 | 39 | 50 | 72 |
In a practical situation, the confounding would make it impossible to calculate , and on the whole dataset, but they are used here as unattainable ideal benchmarks. Five samples of size n are simulated from a Clayton distribution with Spearman’s correlation ρ. Each scenario is repeated 10,000 times
Fig. 5Power of a test of independence based on different coefficient of correlations. The two columns of plots are respectively for estimates of Spearman’s ρ and Kendall’s τ, the rows correspond to different scenarios described with equations on the right. Equal samples of size n=20 are drawn from five groups from a Clayton distribution with correlation ρ . The null hypothesis is H 0:ρ 1=0. The power is simulated with 1000 repetitions on 51 different values of ρ 1 to yield a curve that is then smoothed